Re: [rfc] lru_add_drain_all() vs isolation

From: KOSAKI Motohiro
Date: Tue Sep 08 2009 - 07:41:46 EST

Next message: Ingo Molnar: "Re: [tip:sched/core] sched: Ensure that a child can't gain timeover it's parent after fork()"
Previous message: Jan Kara: "[PATCH] fs: Make sure data stored into inode is properly seen before unlocking new inode"
In reply to: Peter Zijlstra: "Re: [rfc] lru_add_drain_all() vs isolation"
Next in thread: Peter Zijlstra: "Re: [rfc] lru_add_drain_all() vs isolation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> On Tue, 2009-09-08 at 19:06 +0900, KOSAKI Motohiro wrote:
> > > On Tue, 2009-09-08 at 08:56 +0900, KOSAKI Motohiro wrote:
> > > > Hi Peter,
> > > >
> > > > > On Mon, 2009-09-07 at 10:17 +0200, Mike Galbraith wrote:
> > > > >
> > > > > > [ 774.651779] SysRq : Show Blocked State
> > > > > > [ 774.655770] task PC stack pid father
> > > > > > [ 774.655770] evolution.bin D ffff8800bc1575f0 0 7349 6459 0x00000000
> > > > > > [ 774.676008] ffff8800bc3c9d68 0000000000000086 ffff8800015d9340 ffff8800bb91b780
> > > > > > [ 774.676008] 000000000000dd28 ffff8800bc3c9fd8 0000000000013340 0000000000013340
> > > > > > [ 774.676008] 00000000000000fd ffff8800015d9340 ffff8800bc1575f0 ffff8800bc157888
> > > > > > [ 774.676008] Call Trace:
> > > > > > [ 774.676008] [<ffffffff812c4a11>] schedule_timeout+0x2d/0x20c
> > > > > > [ 774.676008] [<ffffffff812c4891>] wait_for_common+0xde/0x155
> > > > > > [ 774.676008] [<ffffffff8103f1cd>] ? default_wake_function+0x0/0x14
> > > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [ 774.676008] [<ffffffff810c0e63>] ? lru_add_drain_per_cpu+0x0/0x10
> > > > > > [ 774.676008] [<ffffffff812c49ab>] wait_for_completion+0x1d/0x1f
> > > > > > [ 774.676008] [<ffffffff8105fdf5>] flush_work+0x7f/0x93
> > > > > > [ 774.676008] [<ffffffff8105f870>] ? wq_barrier_func+0x0/0x14
> > > > > > [ 774.676008] [<ffffffff81060109>] schedule_on_each_cpu+0xb4/0xed
> > > > > > [ 774.676008] [<ffffffff810c0c78>] lru_add_drain_all+0x15/0x17
> > > > > > [ 774.676008] [<ffffffff810d1dbd>] sys_mlock+0x2e/0xde
> > > > > > [ 774.676008] [<ffffffff8100bc1b>] system_call_fastpath+0x16/0x1b
> > > > >
> > > > > FWIW, something like the below (prone to explode since its utterly
> > > > > untested) should (mostly) fix that one case. Something similar needs to
> > > > > be done for pretty much all machine wide workqueue thingies, possibly
> > > > > also flush_workqueue().
> > > >
> > > > Can you please explain reproduce way and problem detail?
> > > >
> > > > AFAIK, mlock() call lru_add_drain_all() _before_ grab semaphoe. Then,
> > > > it doesn't cause any deadlock.
> > >
> > > Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > cpu0 does mlock()->lru_add_drain_all(), which does
> > > schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > work. Except that cpu1, which is busy with the RT task, will never run
> > > keventd until the RT load goes away.
> > >
> > > This is not so much an actual deadlock as a serious starvation case.
> >
> > This seems flush_work vs RT-thread problem, not only lru_add_drain_all().
> > Why other workqueue flusher doesn't affect this issue?
>
> flush_work() will only flush workqueues on which work has been enqueued
> as Oleg pointed out.
>
> The problem is with lru_add_drain_all() enqueueing work on all
> workqueues.

Thank you for kindly explanation. I gradually become to understand this isssue.
Yes, lru_add_drain_all() use schedule_on_each_cpu() and it have following code

for_each_online_cpu(cpu)
flush_work(per_cpu_ptr(works, cpu));

However, I don't think your approach solve this issue.
lru_add_drain_all() flush lru_add_pvecs and lru_rotate_pvecs.

lru_add_pvecs is accounted when
- lru move
e.g. read(2), write(2), page fault, vmscan, page migration, et al

lru_rotate_pves is accounted when
- page writeback

IOW, if RT-thread call write(2) syscall or page fault, we face the same
problem. I don't think we can assume RT-thread don't make page fault....

hmm, this seems difficult problem. I guess any mm code should use
schedule_on_each_cpu(). I continue to think this issue awhile.

> There is nothing that makes lru_add_drain_all() the only such site, its
> the one Mike posted to me, and my patch was a way to deal with that.

Well, schedule_on_each_cpu() is very limited used function.
Practically we can ignore other caller.

> I also explained that its not only RT related in that the HPC folks also
> want to avoid unneeded work -- for them its not starvation but a
> performance issue.

I think you talked about OS jitter issue. if so, I don't think this issue
make serious problem. OS jitter mainly be caused by periodic action
(e.g. tick update, timer, vmstat update). it's because
little-delay x plenty-times = large-delay

lru_add_drain_all() is called from very limited point. e.g. mlock, shm-lock,
page-migration, memory-hotplug. all caller is not periodic.

> In generic we should avoid doing work when there is no work to be done.

Probably. but I'm not sure ;)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ingo Molnar: "Re: [tip:sched/core] sched: Ensure that a child can't gain timeover it's parent after fork()"
Previous message: Jan Kara: "[PATCH] fs: Make sure data stored into inode is properly seen before unlocking new inode"
In reply to: Peter Zijlstra: "Re: [rfc] lru_add_drain_all() vs isolation"
Next in thread: Peter Zijlstra: "Re: [rfc] lru_add_drain_all() vs isolation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]