Re: [PATCH 1/6] Extend completions to provide XFS object flushrequirements

From: Dave Chinner
Date: Thu Jun 26 2008 - 08:21:26 EST


On Thu, Jun 26, 2008 at 05:42:42AM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 09:32:09PM +1000, Dave Chinner wrote:
> > On Thu, Jun 26, 2008 at 05:26:12AM -0600, Matthew Wilcox wrote:
> > > On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> > > > XFS object flushing doesn't quite match existing completion semantics. It
> > > > mixed exclusive access with completion. That is, we need to mark an object as
> > > > being flushed before flushing it to disk, and then block any other attempt to
> > > > flush it until the completion occurs.
> > >
> > > This sounds like mutex semantics. Why are the existing mutexes not
> > > appropriate for your needs?
> >
> > Different threads doing wait and complete.
>
> Then let's leave it as a semaphore. You can get rid of the sema_t if
> you like, but I don't think that turning completions into semaphores is
> a good idea (because it's confusing).

So remind me what the point of the semaphore removal tree is again?

As Christoph suggested, I can put this under another API that
is implemented using completions. If I have to do that in XFS,
so be it....

The main reason for this that we've just uncovered the fact that the
way XFS uses semaphores is completely unsafe [*] on x86/x86_64 for
kernels prior to the new generic semaphores.

[*] 2.6.20 panics in up() because of this race when I/O completion
(the up call) races with a simultaneous down() (iowaiter):

T1 T2
up() down()
kmem_free()

When the down() call completes, the up() call can still be
referencing the semaphore, and hence if we free the structure after
the down call then the up() will reference freed memory. This is
probably the cause of many unexplained log replay or unmount panics
that we've been hitting for years with buffers that been freed while
apparently still in use....

Hence I'd prefer just to move completely away from semaphores for
this flush interface. I'd like to start with getting the upstream
code fixed in a sane manner so all the backports to older kernels
start from the same series of commits.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/