Re: [PATCH] vfs: Avoid IPI storm due to bh LRU invalidation

From: Andrew Morton
Date: Mon Feb 06 2012 - 16:17:28 EST


On Mon, 6 Feb 2012 17:47:32 +0100
Jan Kara <jack@xxxxxxx> wrote:

> On Mon 06-02-12 21:12:36, Srivatsa S. Bhat wrote:
> > On 02/06/2012 07:25 PM, Jan Kara wrote:
> >
> > > When discovery of lots of disks happen in parallel, we call
> > > invalidate_bh_lrus() once for each disk from partitioning code resulting in a
> > > storm of IPIs and causing a softlockup detection to fire (it takes several
> > > *minutes* for a machine to execute all the invalidate_bh_lrus() calls).

Gad. How many disks are we talking about here?

> > > Fix the issue by allowing only single invalidation to run using a mutex and let
> > > waiters for mutex figure out whether someone invalidated LRUs for them while
> > > they were waiting.
> > >
> > > Signed-off-by: Jan Kara <jack@xxxxxxx>
> > > ---
> > > fs/buffer.c | 23 ++++++++++++++++++++++-
> > > 1 files changed, 22 insertions(+), 1 deletions(-)
> > >
> > > I feel this is slightly hacky approach but it works. If someone has better
> > > idea, please speak up.
> > >
> >
> >
> > Something related that you might be interested in:
> > https://lkml.org/lkml/2012/2/5/109
> >
> > (This is part of Gilad's patchset that tries to reduce cross-CPU IPI
> > interference.)
> Thanks for the pointer. I didn't know about it. As Hannes wrote, this
> need not be enough for our use case as there might indeed be some bhs in
> the LRU. But I'd be interested how well the patchset works anyway. Maybe it
> would be enough because after all when we invalidate LRUs subsequent
> callers will see them empty and not issue IPI? Hannes, can you give a try
> to the patches?

If that doesn't work then an option to think about is to have a bool to
disable the bh LRU code. That would add a test-n-branch to
__find_get_block(), which wouldn't kill us. Arrange for the LRU code
to be disabled during device probing. Or just leave the LRU disabled
until very late in boot, perhaps.

Also, I'm wondering why we call invalidate_bh_lrus() at all during
partition reading. Presumably it's where we're shooting down the
blockdev pagecache (you didn't tell us and I'm too lazy to hunt it
down). But do we really need to drop the pagecache at
whatever-this-callsite-is?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/