Re: [PATCH 0/6 RFC] Mapping range lock

From: Jan Kara
Date: Mon Feb 04 2013 - 07:38:38 EST


On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:48 +0100
> Jan Kara <jack@xxxxxxx> wrote:
>
> > There are several different motivations for implementing mapping range
> > locking:
> >
> > a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> > punched range after page cache has been invalidated) leading to nasty
> > results as fs corruption (we can end up writing to already freed block),
> > user exposure of uninitialized data, etc. To fix this we need some new
> > mechanism of serializing hole punching and page faults.
>
> This one doesn't seem very exciting - perhaps there are local fixes
> which can be made?
I agree this probably won't be triggered by accident since punch hole
uses are limited. But a malicious user is a different thing...

Regarding local fix - local in what sense? We could fix it inside each
filesystem separately but the number of filesystems supporting punch hole
is growing so I don't think it's a good decision for each of them to devise
their own synchronization mechanisms. Fixing 'locally' in a sence that we
fix just the mmap vs punch hole race is possible but we need some
synchronisation of page fault and punch hole - likely in a form of rwsem
where page fault will take a reader side and punch hole a writer side. So
this "minimal" fix requires additional rwsem in struct address_space and
also incurs some cost to page fault path. It is likely a lower cost than
the one of range locking but there is some.

> > b) There is an uncomfortable number of mechanisms serializing various paths
> > manipulating pagecache and data underlying it. We have i_mutex, page lock,
> > checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> > Different pairs of operations are serialized by different mechanisms and
> > not all the cases are covered. Case (a) above is likely the worst but DIO
> > vs buffered IO isn't ideal either (we provide only limited consistency).
> > The range locking should somewhat simplify serialization of pagecache
> > operations. So i_dio_count can be removed completely, i_mutex to certain
> > extent (we still need something for things like timestamp updates,
> > possibly for i_size changes although those can be dealt with I think).
>
> Those would be nice cleanups and simplifications, to make kernel
> developers' lives easier. And there is value in this, but doing this
> means our users incur real costs.
>
> I'm rather uncomfortable changes which make our lives easier at the
> expense of our users. If we had an infinite amount of labor, we
> wouldn't do this. In reality we have finite labor, but a small cost
> dispersed amongst millions or billions of users becomes a very large
> cost.
I agree there's a cost (as with everything) and personally I feel the
cost is larger than I'd like so we mostly agree on that. OTOH I don't quite
buy the argument "multiplied by millions or billions of users" - the more
machines running the code, the more wealth these machines hopefully
generate ;-). So where the additional cost starts mattering is when it is
making the code not worth it for some purposes. But this is really
philosophy :)

> > c) i_mutex doesn't allow any paralellism of operations using it and some
> > filesystems workaround this for specific cases (e.g. DIO reads). Using
> > range locking allows for concurrent operations (e.g. writes, DIO) on
> > different parts of the file. Of course, range locking itself isn't
> > enough to make the parallelism possible. Filesystems still have to
> > somehow deal with the concurrency when manipulating inode allocation
> > data. But the range locking at least provides a common VFS mechanism for
> > serialization VFS itself needs and it's upto each filesystem to
> > serialize more if it needs to.
>
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.
As Zheng said, there are people interested in this for DIO. Currently
filesystems each invent their own tweaks to avoid the serialization at
least for the easiest cases.

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/