Re: [patch 1/2] fs: mnt_want_write speedup

From: Nick Piggin
Date: Thu Apr 02 2009 - 14:22:27 EST


On Wed, Mar 18, 2009 at 12:13:43PM -0700, Dave Hansen wrote:
> On Thu, 2009-03-12 at 05:13 +0100, Nick Piggin wrote:
> > On Wed, Mar 11, 2009 at 03:11:17PM -0700, Dave Hansen wrote:
> > > I'm feeling a bit better about these, although I am still honestly quite
> > > afraid of the barriers. I also didn't like all the #ifdefs much, but
> > > here's some help on that.
> >
> > FWIW, we have this in suse kernels because page fault performance was
> > so bad compared with SLES10. mnt_want_write & co was I think the 2nd
> > biggest offender for file backed mappings (after pvops). I think we're
> > around parity again even with pvops.
>
> Page faults themselves? Which path was that from?

Yes. file_update_time.


> > Basically I think we have to improve this one way or another in mainline
> > too. Is there any way to make you feel better about the barriers? More
> > comments?
> >
> > mnt_make_readonly() mnt_want_write()
> > 1. mnt_flags |= MNT_WRITE_HOLD A. mnt_writers[x]++
> > 2. smp_mb() B. smp_mb()
> > 3. count += mnt_writers[0] C. while (mnt_flags & MNT_WRITE_HOLD) ;
> > ... D. smp_rmb()
> > count += mnt_writers[N] E. if (mnt_flags & MNT_READONLY)
> > 4. if (count == 0) F. mnt_writers[x]-- /* fail */
> > 5. mnt_flags |= MNT_READONLY G. else /* success */
> > 6. else /* fail */
> > 7. smp_wmb()
> > 8. mnt_flags &= ~MNT_WRITE_HOLD
> >
> > * 2 ensures that 1 is visible before 3 is loaded
> > * B ensures that A is visible before C is loaded
> > * Therefore, either count != 0 at 4, or C will loop (or both)
> > * If count == 0
> > * (make_readonly success)
> > * C will loop until 8
> > * D ensures E is not loaded until loop ends
> > * 7 ensures 5 is visible before 8 is
> > * Therefore E will find MNT_READONLY (want_write fail)
> > * If C does not loop
> > * 4 will find count != 0 (make_readonly fail)
> > * Therefore 5 is not executed.
> > * Therefore E will not find MNT_READONLY (want_write success)
> > * If count != 0 and C loops
> > * (make_readonly fail)
> > * 5 will not be executed
> > * Therefore E will not find MNT_READONLY (want_write success)
>
> It is great to spell it out like that. But, honestly, I think the code
> and comments that are there are probably better than looking at an out
> of line description like that.

Well I think the above gives a higher confidence of correctness
than the code and comments posted (at least, until the reader
workss through similar steps themselves). At least if you don't
like it, then it could just go int othe patch changelog.


> > I don't know if that helps (I should reference which statements rely
> > on which). I think it shows that either one or the other only must
> > succeed.
> >
> > It does not illustrate how the loop in the want_write side prevents
> > the sumation from getting confused by decrementing count on a different
> > CPU than it was incremented, but I've commented that case in the code
> > fairly well I think.
>
> I think you mentioned a seqlock being a possibility here. That would
> slot in as a replacement for MNT_WRITE_HOLD, right? mnt_make_readonly()
> takes a seq_write, mnt_want_write() takes a seq_read and doesn't
> consider MNT_READONLY valid until it gets a clear read of it. It would
> internalize all the barriers into the seqlock implementation except for
> the mnt_writers[]-related ones.
>
> As for mnt_writers, you'd need an smp_rmb() in mnt_make_readonly()
> before reading the counters and an smp_wmb() after writing the counter
> in mnt_want_write().
>
> Is see that those are effectively consolidated down in your version into
> a single smp_mb() at both call sites when combined with the
> MNT_WRITE_HOLD barriers.
>
> If we can get this down to two explicit calls to the barrier functions
> that paired and very clear, I think I'd be much more OK with it.

seqlock as I said doesn't get its barriers quite right for having
stores in the read protected section. I was musing the possibility,
but I think it ends up being less clear if you just end up having
to a) look at seqlock implementation, b) work out it is insufficnet,
c) add barriers to make up for it. May as well just open code it
here with sufficient comment to justify correctness (which I think
I have come up with).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/