Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

From: Ingo Molnar
Date: Thu Jun 28 2007 - 12:02:11 EST

Next message: Zhang, Rui: "RE: [patch -mm] s390: struct bin_attribute changes"
Previous message: Mark Brown: "Re: Userspace compiler support of "long long""
In reply to: Steven Rostedt: "Re: [RFC PATCH 0/6] Convert all tasklets to workqueues"
Next in thread: Jeff Garzik: "Re: [RFC PATCH 0/6] Convert all tasklets to workqueues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Alexey Kuznetsov <kuznet@xxxxxxxxxxxxx> wrote:

> > the context-switch argument i'll believe if i see numbers. You'll
> > probably need in excess of tens of thousands of irqs/sec to even be
> > able to measure its overhead. (workqueues are driven by nice kernel
> > threads so there's no TLB overhead, etc.)
>
> It was authors of the patch who were supposed to give some numbers, at
> least one or two, just to prove the concept. :-)

sure enough! But it was not me who claimed that 'workqueues are slow'.

firstly, i'm not here at all to tell people what tools to use. I'm not
trying to 'force' people away from a perfectly logical technological
choice. I am just wondering out loud whether this particular tool, in
its current usage pattern, makes much technological sense. My claim is:
it could very well be that it doesnt make _much_ sense, and in that case
we should provide a non-intrusive migration path away in terms of a
compatible API wrapper to a saner (albeit by virtue of trying to emulate
an existing API, slower) mechanism. The examples cited so far had the
tasklet as an intermediary towards a softirq - what's the technological
point in such a splitup?

> According to my measurements (maybe, wrong) on 2.5GHz P4 tasklet
> schedule and execution eats ~300ns, workqueue eats ~4usec. On my
> 1.8GHz PM notebook (UP kernel), the numbers are 170ns and 1.2usec.

I find the 4usecs cost on a P4 interesting and a bit too high - how did
you measure it? (any test-patch for it i could try?) But i think even
your current numbers partly prove my point: with 1.2 usecs and 10,000
irqs/sec the cost is 1.2 msecs/sec, or 0.1%. And 10K irqs/sec themselves
will eat up much more CPU time than that already.

> Formally looking awful, this result is positive: tasklets are almost
> never used in hot paths. I am sure only about one such place: acenic
> driver uses tasklet to refill rx queue. This generates not more than
> 3000 tasklet schedules per second. Even on P4 it pure workqueue
> schedule will eat ~1% of bare cpu ticks.

... and the irq cost itself will eat 5-10% of bare CPU ticks already.

> > ... workqueues are also possibly much more scalable
>
> I cannot figure out - scale in what direction? :-)

workqueues can be per-cpu - for tasklets to be per-cpu you have to
open-code them into per-cpu like rcu-tasklets did (which in essence
turns them into more expensive softirqs).

> > (percpu workqueues
> > are easy without changing anything in your code but the call where
> > you create the workqueue).
>
> I do not see how it is related to scalability. And the statement does
> not even make sense. The patch already uses per-cpu workqueue for
> tasklets, otherwise it would be a disaster: guaranteed cpu
> non-locality.

my argument was: workqueues are more scalable than tasklets in general.

Just look at the tasklet_disable() logic. We basically have a per-cpu
list of tasklets that we poll in tasklet_action:

static void tasklet_action(struct softirq_action *a)
{
[...]
while (list) {
struct tasklet_struct *t = list;

list = list->next;

if (tasklet_trylock(t)) {

and if the trylock fails, we just continue to meet this activated
tasklet again and again, in this nice linear list.

this happens to work in practice because 1) tasklets are used quite
rarely! 2) tasklet_disable() is done realtively rarely and nobody truly
runs tons of the same devices (which depend on a tasklet) on the same
box, but still it's quite an unhealthy approach. Every time i look at
the tasklet code it hurts - having fundamental stuff like that in the
heart of Linux ;-)

also, the "be afraid of the hardirq or the process context" mantra is
overblown as well. If something is too heavy for a hardirq, _it's too
heavy for a tasklet too_. Most hardirqs are (or should be) running with
interrupts enabled, which makes their difference to softirqs miniscule.

The most scalable workloads dont involve any (or many) softirq middlemen
at all: you queue work straight from the hardirq context to the target
process context. And that's what you want to do _anyway_, because you
want to create as little locally cached data for the hardirq context, as
the target task could easily be on another CPU. (this is generally true
for things like block IO, but it's also true for things like network
IO.)

the most scalable solution would be _for the network adapter to figure
out the target CPU for the packet_. Not many (if any) such adapters
exist at the moment. (as it would involve allocating NR_CPUs irqs to
that adapter alone.)

> Tasklet is single thread by definition and purpose. Those a few places
> where people used tasklets to do per-cpu jobs (RCU f.e.) exist just
> because they had troubles with allocating new softirq. [...]

no. The following tale is the true and only history of the RCU tasklet
;-) The RCU guys first used a tasklet, then noticed its bad scalability
(a particular VFS-intense benchmark regressed because only a single CPU
would do RCU completion on an 8-way box) so they switched it to a
per-cpu tasklet - without realizing that a per-cpu tasklet is in essence
a softirq. I pointed it out to them (years down the road ...) then the
"convert rcu-tasklet to softirq" patch was born.

> > the only remaining argument is latency:
>
> You could set realtime prioriry by default, not a poor nice -5. If
> some network adapters were killed just because I run some task with
> nice --22, it would be just ridiculous.

there are only 20 negative nice levels ;-) And i dont really get the
'you might kill the network adapter' argument, because the opposite is
true just as much: tasklets from a totally uninteresting network adapter
can kill your latency-sensitive application too.

So providing more flexibility in the prioritization of the work that
goes on in the system (as long as it has no other drawbacks) can not be
wrong. The "but you will shoot yourself in the foot" argument is really
backwards in that context.

Tasklets are called 'task'-lets for a reason: they are poorly scheduled,
inflexible tasks. They were written in an age when we didnt have
workqueues, we didnt have kthreads and real men thought they wanted to
do all their TCP/IP processing in softirq context [ am i heading down
the road towards a showdown with DaveM here? ;-) ].

Now ... you (and Jeff, and others) are right and workqueues could be too
slow for some of the cases (i said before that i'd be surprised if it
were more than 1-2), in which case my argument changes to what i
outlined above: if you want good scalability, dont use middlemen :-)
Figure out the target task as early as possible and let it do as much of
the remaining work as possible. _Increasing_ the amount of cached
context (by doing delayed processing in tasklets or even softirqs on the
same CPU where the hardirq arrived) only increases the cross-CPU cost.
Keeping stuff in a softirq only makes (some) sense as long as you have
no target task at all (routing, filtering, etc.).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Zhang, Rui: "RE: [patch -mm] s390: struct bin_attribute changes"
Previous message: Mark Brown: "Re: Userspace compiler support of "long long""
In reply to: Steven Rostedt: "Re: [RFC PATCH 0/6] Convert all tasklets to workqueues"
Next in thread: Jeff Garzik: "Re: [RFC PATCH 0/6] Convert all tasklets to workqueues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]