Re: smp race fix between invalidate_inode_pages* and do_no_page

From: Hugh Dickins
Date: Fri Apr 07 2006 - 15:18:56 EST


On Sat, 1 Apr 2006, Andrew Morton wrote:
> Andrea Arcangeli <andrea@xxxxxxx> wrote:
> >
> > The new PG_truncate bitflag can be used as well to eliminate the
> > truncate_count from all vmas etc... that would save substantial memory
> > and remove some complexity, truncate_count grown a lot since the time we
> > introduced it.

truncate_count is playing a useful role with the vma tree, allowing
us to find our place again after allowing preemption in. Though I'm
sometimes wondering whether just to make the vma _bigger_ and add a
list instead of that truncate_count: construct a list of vmas from
the prio_tree for unmap_mapping_range, so it can keep place in that,
instead of having to restart the tree whenever preempted. But I
think NFS doesn't hold i_sem across unmapping, so it wouldn't work.

> > The PG_truncate is needed as well because we can't know in do_no_page if
> > page->mapping is legitimate null or not (think bttv and other device
> > drivers returning page->mapping null because they're private but not
> > reserved pages etc..)

That part is easily dealt with, as Nick and I have suggested,
just by marking vmas liable to having their pages truncated.
But that's certainly no more than one part of a larger solution.

> The patch will pretty clearly fix that. But it really would be better to
> place all the cost over on the invalidate side, rather than adding overhead
> to the pagefault path.

I see ~2% slowdown in relevant faulting tests e.g. the lmbench ones.
I'd certainly prefer not to be locking pages here: or not without
word from the high-scalability people that Andrea's patch really in
practice doesn't hurt them (but the ~2% I see is on UP as much as MP).

> Could we perhaps check the page_count(page) and/or page_mapped(page) in
> invalidate_complete_page(), when we have tree_lock?

I'm thinking on it, but not finding it easy. And I've just realized
that invalidate_inode_pages2 isn't the only problematic case here
(though agreed the more urgent one): MADV_REMOVE likewise needs to
invalidate pages without adjusting i_size, so also cannot rely on
the truncate_count/i_size method.

> Do you have handy any test code which others can use to recreate this bug?

I'd be glad of that too.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/