Re: [RFC] New locking primitive for 2.5

From: Anton Altaparmakov (aia21@cam.ac.uk)
Date: Fri Feb 08 2002 - 16:55:32 EST


At 20:38 08/02/02, yodaiken@fsmlabs.com wrote:
>On Fri, Feb 08, 2002 at 08:14:32PM +0000, Anton Altaparmakov wrote:
> > The value of allowing multiple cpus to read the same data
> simultaneously by
> > far offsets the priority problems IMVHO. At least the way I am using rw
> > semaphores in ntfs it is. Readlocks are grabbed loads and loads of
> times to
> > serialize meta data access in the page cache while writelocks are a minute
> > number in comparison and because the data required to be accessed may not
>
>this is absolutely correct. However, once the decision has been made or
>fallen into to go to a priority inherit scheme, Linux will find itself
>in the same bind as Solaris.
>
> > be cached in memory (page cache page is not read in, is swapped out,
> > whatever) a disk access may be required which means a rw spin lock is no
> > good. In fact ntfs would be the perfect candidate for automatic rw combi
> > locks where the locking switches from spinning to sleeping if the code
> path
> > reaches a disk access. I can't use a manually controlled lock as the page
>
>Seem like the lock is simply grabbed way to far up.

It cannot be taken any later. NTFS is a database both in memory and on disk
and in fact the in memory meta data is the on-disk structures (i.e. no
conversion between the two is performed, except in few speed optimization
cases and in cases where meta data is compressed on disk).

Because everything is ogranized in database records, I need to grab the
lock as soon as I intend to access the record (in fact the lock itself is
taken by map_mft_record(READ or WRITE) which returns a locked record, that
is mapped and pinned in memory, to the caller. I can then do all sorts of
things with the record, like search it, parse data from it, etc. When
finished I unmap_mft_record(READ or WRITE) which undoes everything
map_mft_record() did, i.e. unpins the record, unmaps it and releases the lock.

All transactions with the on disk storage are done via the page cache and
read_cache_page() using a special readpage() and a special async io
completion handler.

Because of the abstraction of access layers it is undesirable to operate on
the same locks in different layers, hence the top most layer is the one
locking and unlocking the records. The lower layers don't even know the
locks exist. Also pushing the lock deep into the layers is not actually
possible as it would probably need to be taken in reac_cache_page() (the
only place where the code knows if the page is present in memory or not and
a disk access is required) which is a VFS/MM function and hence that is not
an option.

Further, I may need to allocate memory in order to store decompressed
metadata for example but I won't know if I need to do that until I have
already locked the record and accessed it to determine if the metadata is
compressed or not.

So basically I can only obtain sufficient info for deciding what lock I
need after I have already accessed parts of the record, but to access it I
need to have locked it. Chicken and egg situation... So the locks are
always semaphores and are taken in the top layer.

A combi lock would allow spinning in the case where it turns out the code
paths will not hit disk or sleep in kmalloc, etc and sleeping in the
disk/kmalloc hit case.

(Yes I know atomic kmalloc exists but I actually need vmalloc in some cases
depending how much memory I need to handle the metadata...)

Best regards,

Anton

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Feb 15 2002 - 21:00:23 EST