Re: [PATCH 0/4] Finer granularity and task/cgroup irq timeaccounting

From: Peter Zijlstra
Date: Wed Sep 08 2010 - 07:12:58 EST


On Tue, 2010-08-24 at 19:02 -0700, Venkatesh Pallipadi wrote:
> On Tue, Aug 24, 2010 at 1:39 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:
> >
> >
> >> - I started looking as not accounting this time to tasks themselves.
> >> This was really tricky as things are tightly tied to scheduler
> >> vruntime to get it right.
> >
> > I'm not exactly sure where that would get complicated, simply treat
> > interrupts the same as preemptions by other tasks and things should
> > basically work out rather straight forward from that.
> >
>
> Atleast the way I tried it turned out to be messy. Keep track of time
> at si and hi
> and remove it from update_curr delta. I did it that way as I didn't
> want to take rq lock
> on hi si path. Doing it as preemption with put_prev/pick_next would be
> expensive. No?
> Or you meant it some other way?

No, removing it from update_curr()'s delta is exactly what I meant. That
gives the same end result as if the task were preempted (ie. it ran
less).

> >> I am not even sure I got it totally right
> >> :(, but I did play with the patch a bit. And noticed there were
> >> multiple issues. 1) A silly case as in of two tasks on one CPU, one
> >> task totally CPU bound and another task doing network recv. This is
> >> how task and softirq time looks like for this (10s samples)
> >> (loop) (nc)
> >> 503 9 502 301
> >> 502 8 502 303
> >> 502 9 501 302
> >> 502 8 502 302
> >> 503 9 501 302
> >> Now, when I did "not account si time to task", the loop task ended up
> >> getting a lot less CPU time and doing less work as nc task doing rcv
> >> got more CPU share, which was not right thing to do. IIRC, I had
> >> something like <300 centiseconds for loop after the change (with si
> >> activity increasing due to higher runtime of nc task).
> >
> > Well, that actually makes sense and I wouldn't call it wrong.
>
> I meant, that will make nc run for more than its fair share.

No it wouldn't, it would make nc run exactly its fair share. SoftIRQ
time isn't nc time, so by not accounting it to nc, nc gets to run more.

> >> So, even though it seems accounting irq time as "system time" seems
> >> the right thing to do, it can break scheduling in many ways. May be
> >> hardirq can be accounted as system time. But, dealing with softirq is
> >> tricky as they can be related to the task.
> >
> > I haven't yet seen any scheduler breakage here, it will divide time
> > differently, but not in a broken way, if the system consumes 1/3rd of
> > the time, there's only 2/3rd left to fairly distribute between tasks, so
> > something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.
> >
> > You'd get exactly the same kind of thing if you replace (soft)irq with a
> > FIFO task.
> >
>
> But, FIFO in that case would be some unrelated task taking away CPU.
> Here one task can take more than its share due to si.

No, how is being preempted by an unrelated task (FIFO whatever)
different from being 'preempted' by unrelated SoftIRQ processing?


> > This is where I strongly disagree, providing an interface that cannot
> > possibly be implemented correctly just so you can fudge something (still
> > not sure what from userspace) seems a very bad idea indeed.
> >
>
> I don't think correctness is a problem. TSC is pretty good for this
> purpose on current hardware. I agree that usability is debatable.

Like already argued, it is a correctness issue, TSC is only an accuracy
issue, but since you cannot properly attribute this very accurate time
measurement, you're still totally incorrect.

> If you strongly think that the right way is to make both si and hi
> "system time" and that will not cause unfairness and slowdown for some
> unrelated tasks, I can try to cleanup the patch I had for that and
> send it out. I am afraid though, it will cause some regression and we
> will end up back at square one after a month or so. :(

Sure it will affect some workloads, but its also more correct. If
network workloads suffer we can look at sorting out some of those
problems. But really SoftIRQ != task context and thus should not be
accounted to said task.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/