Re: [PATCH 0/4] workqueue_tracepoint: Add worklet tracepoints forworklet lifecycle tracing

From: Frederic Weisbecker
Date: Fri Apr 24 2009 - 18:59:27 EST


On Fri, Apr 24, 2009 at 01:06:16PM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2009 19:42:19 +0800
> Zhaolei <zhaolei@xxxxxxxxxxxxxx> wrote:
>
> > These patchs add tracepoints for per-worklet tracing.
> > Now we have enough tracepoints to start makeing trace_workqueue.c support
> > worklet time mesurement.
>
> I'm not seing anywhere in this patchset a description of the user
> interface. What does the operator/programmer actually get to see from
> the kernel as a result of these changes?
>
> A complete usage example would be an appropriate way of communicating
> all of this.
>
> The patches introduce a moderate amount of tracing-specific hooks into
> the core workqueue code, which inevitably increases the maintenance
> load for that code. It is important that it be demonstrated that the
> benefts of the code are worth that cost. Hence it is important that
> these benefits be demonstrated to us, by yourself. Please.
>
> Another way of looking at it: which previously-encountered problems
> would this facility have helped us to solve? How will this facility
> help us to solve problems in the future? Looking at this patch series
> I cannot answer those questions!
>


Hi Andrew,

Although I'm not the author of this patchset, I'm somewhat
involved in the workqueue tracer and I would like to express
my opinion on what is happening.

Until recently, the workqueue tracer was a basic machine.
It was designed to trace the workqueue level. We were not yet
thinking about the worklet level:

- creation of a workqueue thread
- cleanup/destruction of these threads
- insertion of a work in a workqueue
- execution of this work

The output looked like such following histogram:

# CPU INSERTED EXECUTED WORKQUEUE THREAD
#
1 125 125 reiserfs/1
1 0 0 scsi_tgtd/1
1 0 0 aio/1
1 0 0 ata/1
1 114 114 kblockd/1
1 0 0 kintegrityd/1
1 2147 2147 events/1

0 0 0 kpsmoused
0 105 105 reiserfs/0
0 0 0 scsi_tgtd/0
0 0 0 aio/0
0 0 0 ata_aux
0 0 0 ata/0
0 0 0 cqueue
0 0 0 kacpi_notify
0 0 0 kacpid
0 149 149 kblockd/0
0 0 0 kintegrityd/0
0 1000 1000 khelper
0 2270 2270 events/0


Its purpose and the information it gave was limited, though
somehow useful.
It was able to provide some informations about the frequency
of the works inserted.

Why that? Because it pushed some useful informations to help a
kernel developer when it's time to decide how to defer async jobs.
Should he use the traditional and whole purposed events workqueues
or a devoted workqueue?

The decision has to be made relatively to:

- the workload it induces. If we might insert too much
works at a time, we can create unacceptable latencies
that may delay other queued works. Events/%d are such
fragile pieces.

- The workload is not only about work insertion frequency
but also about how much time is spent to process a work.
Could it be trying to perform too much things which involve
too much cpu time? Could it sleep for too much time, waiting
for an event, a very contended lock or something?

- Another kind of reasons that may encumber the events and
create bad latencies.

After reading this list of requirements, I guess you noticed
that only capturing the frequency of inserted events is only
one variable that composes the workload recipe.
Not only is it giving incomplete information, but it is also
providing erroneous datas to the user which might focuse
on these frequencies when they become large whereas a single work
can create more burden than hundreds of others.


Then, Zhaolei and Kosaki Motohiro came up recently with a good
update for this tracer by posting a patchset which integrates
the per-worklet tracing dimension.
And the output on my laptop became the following:

$ cat /debug/tracing/trace_stat/workqueues
# CPU INSERTED EXECUTED TASKNAME:PID
# | | | `-WORKFUNC
# | | | |
0 3822 3898 events/0:9
0 1 1 |-hpet_work+0x0/0x340
0 47 46 |-rt_worker_func+0x0/0x400
0 2655 2654 |-vmstat_update+0x0/0x40
0 1 1 |-power_supply_changed_work+0x0/0x60
0 212 212 |-fb_flashcursor+0x0/0x3c0
0 80 80 |-test_work+0x0/0x64
0 10 9 |-rekey_seq_generator+0x0/0x70
0 8 7 |-lru_add_drain_per_cpu+0x0/0x10
0 455 455 |-console_callback+0x0/0x1e0
0 249 249 |-flush_to_ldisc+0x0/0x2f0
0 1 1 |-atkbd_event_work+0x0/0xb0
0 103 96 `-dst_gc_task+0x0/0x340
0 0 0 cpuset:11
0 1378 1378 khelper:12
0 1378 1378 `-__call_usermodehelper+0x0/0xe0
0 0 0 kintegrityd/0:178
0 333 333 kblockd/0:184
0 233 233 |-blk_unplug_work+0x0/0x110
0 100 100 `-cfq_kick_queue+0x0/0x50
0 0 0 kacpid:190
0 0 0 kacpi_notify:191
0 0 0 tifm:498
0 8048 8048 ata/0:507
0 8048 8048 `-ata_pio_task+0x0/0x280
0 0 0 ata_aux:511
0 0 0 aio/0:871
0 0 0 crypto/0:897
0 0 0 scsi_tgtd/0:2368
0 0 0 iscsi_eh:2387
0 0 0 kpsmoused:2521
0 0 0 hd-audio0:2660
0 0 0 krxrpcd/0:2745
0 59 59 reiserfs/0:2895
0 59 59 `-flush_async_commits+0x0/0x90
1 5480 5503 events/1:10
1 1 1 |-hpet_work+0x0/0x340
1 2819 2818 |-vmstat_update+0x0/0x40
1 464 464 |-console_callback+0x0/0x1e0
1 80 80 |-test_work+0x0/0x64
1 1 1 |-sysfs_add_func+0x0/0xb0
1 8 8 |-lru_add_drain_per_cpu+0x0/0x10
1 1 1 |-sync_cmos_clock+0x0/0x140
1 1823 1823 |-fb_flashcursor+0x0/0x3c0
1 282 282 |-flush_to_ldisc+0x0/0x2f0
1 1 1 `-atkbd_event_work+0x0/0xb0
1 0 0 kintegrityd/1:181
1 254 254 kblockd/1:185
1 148 148 |-blk_unplug_work+0x0/0x110
1 106 106 `-cfq_kick_queue+0x0/0x50
1 10988 10988 ata/1:509
1 10988 10988 `-ata_pio_task+0x0/0x280
1 0 0 aio/1:878
1 0 0 crypto/1:898
1 0 0 scsi_tgtd/1:2370
1 0 0 krxrpcd/1:2746
1 25 25 reiserfs/1:2896
1 25 25 `-flush_async_commits+0x0/0x90


The trace has evolved by decomposing these frequency stats
per-worklet. Not only does it help for a driver writer to
instrument its own async jobs, but also it can help
any developer to find the culprit source of a latency
issued from a workqueue.

It can at least show a first direction of investigation, because
we are still only talking about frequency.
And speaking about incomplete informations, this output is only
a step before the main goal which is to provide the average and the
maximum time of execution for each worklet.


# CPU INSERTED EXECUTED AVG TIME MAX TIME TASKNAME:PID
# | | | | | `-WORKFUNC
# | | | | | |
0 3822 3898 100 us 200us events/0:9
0 1 1 12 us 15us |-hpet_work+0x0/0x340
0 47 46 10 us 45us |-rt_worker_func+0x0/0x400


With that we'll have much more indications to start a latency investigation.
But to get this end result, we needed the changes done on this latest patchset
from Zhaolei and Kosaki which, in the middle of other useful changes, turns
the workqueue_execution event into a new dichotomous capture with:

- trace_work_execute
- trace_work_complete

It was a required piece to perform the time of execution measurement.

Also, as I said, the statistical/histogram output is only the first step
of any worklet instrumentation. Once we focuse in a specific culprit,
we can trace all of its lifecycle using the trace events:
insertion, execution, completion, delayed insertion, cancelling, and so on.

All these events provide all the required fine grained instrumentation
to learn any worklet lifecycle details. Combining with well chosen events
toggling and event's fields filtering, we can focus on the only relevant
events at the time.

Example of such trace:

# cd /debug/tracing

# ls events/workqueue/
filter worklet_complete worklet_enqueue_delayed workqueue_creation workqueue_flush
worklet_cancel worklet_enqueue worklet_execute workqueue_destruction

# for e in $(ls events/workqueue); do echo 1 > events/workqueue/$e/enable; done

# tail -f trace

events/1-10 [001] 4687.187314: worklet_execute: thread=events/1:10 work=ffff880001b27200 func=vmstat_update+0x0/0x40
events/1-10 [001] 4687.187330: worklet_enqueue_delayed: thread=events/1:10 func=vmstat_update+0x0/0x40 cpu=-1 delay=250
events/1-10 [001] 4687.187331: worklet_complete: thread=events/1:10 work=ffff880001b27200
events/0-9 [000] 4687.250236: worklet_execute: thread=events/0:9 work=ffff880001b0c200 func=vmstat_update+0x0/0x40
events/0-9 [000] 4687.250250: worklet_enqueue_delayed: thread=events/0:9 func=vmstat_update+0x0/0x40 cpu=-1 delay=237
events/0-9 [000] 4687.250251: worklet_complete: thread=events/0:9 work=ffff880001b0c200

If more details are needed, such events could fit nicely inside with a
function graph tracer output.

So this latest patchset provides all these required informations on the events
tracing level. The next step is likely to be on the stat tracing to provide the avg/max time
of execution.


Thanks,
Frederic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/