Re: [PATCH] anobjrmap 9 priority mjb tree

From: Rajesh Venkatasubramanian
Date: Wed Apr 14 2004 - 22:45:54 EST

Next message: Ryan Geoffrey Bourgeois: "Re: poor sata performance on 2.6"
Previous message: Andrew Morton: "Re: [PATCH] Increase number of dynamic inodes in procfs (2.6.5)"
In reply to: Martin J. Bligh: "Re: [PATCH] anobjrmap 9 priority mjb tree"
Next in thread: Martin J. Bligh: "Re: [PATCH] anobjrmap 9 priority mjb tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> > 2) However, vmas can be added and removed from a vm_set list
> > by just holding the read lock and a bit lock (vm_set_lock)
> > in the corresponding prio_tree node.
>
> no way, you cannot bitflip vm_flags unless you own the mmap_sem, this
> patch seems very broken to me, it should randomly corrupt memory in
> vma->vm_flags while racing against mprotect etc.. or am I missing
> something?

I don't know why bit_spin_lock with vma->vm_flags should be a problem
if it is used without mmap_sem. Can you explain ?

Anyway, in my patch, the only places that use vm_set_lock without
mmap_sem is __vma_prio_tree_first_lock and __vma_prio_tree_next_lock.
If it is really racy to use bit_spin_lock on vm_flags without mmap_sem
(I am not sure and I am not convinced that it is racy), then we can
revert these changes and take down_write on the page out path.

Well. In that case, we can use rwsem as you mentioned below: take
down_write on all modifications and take down_read on pageout. Here, you
allow multiple parallel page_referenced and try_to_unmap on the same
inode, however with only one modification at a time.

Wherease my solution will allow multiple modifications at the same
time (if possible) with only one pageout routine at a time. I chose
this solution because Martin's SDET took big hit in common cases of
adding and removing vmas from the i_mmap{_shared} data structure.

Thanks,
Rajesh

>
> > 3) All objrmap functions just hold read lock now. So when we
> > walk a vm_set list we have to hold the corresponding
> > vm_set_lock.
> > 4) Since truncate uses write lock (provides exclusion) we don't
> > have to take vm_set_locks.
> >
> > Martin! When you get time to test your SDET with this patch, please
> > let me know whether this patch helps you at all. The patch applies
> > on top of 2.6.5-mjb1+anobjrmap9_prio_tree.
>
> I considered converting it to a rwsem too, details are in the the email
> I posted while providing the rwspinlock solution to the parisc cache
> flushing code.
>
> As I wrote there, I wasn't convinced in the common case this is going to
> gain anything significant (the only thing that sleeps while teh
> semaphore is held is truncate and truncate during paging on the same
> inode isn't an extremly common case, especially for the big apps), and
> it makes it a bit more complicated, but giving it a try will be
> interesting. I was mostly interested about having the objrmap code very
> rarely failing the trylock during paging (that semaphore is by far the
> biggest scalability hit during paging of shm, but the cacheline bouncing
> won't be avoided by the rwsem). To make the paging scale better
> (something SDET cannot measure) I don't need a safe vm_set_lock, I
> believe simply making it a rwsem is the way to go just to make the
> paging potentially scale a bit better. I rated implementing the locking
> abstraction to fixup the basic parisc race as a bit higher prio, after
> that it should be easy to have it implementing a rwsem for all archs w/o
> cache flushing, the abstraction will have to expose a read/write
> functionality for the rwlock. I'm not convinced your double locking is
> going to boost anything even if it would be safe, I'd just take it in
> write mode when the tree is being modified, with the only object of
> avoiding the paging to block (and potentially to avoid blocking against
> big truncates too).
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ryan Geoffrey Bourgeois: "Re: poor sata performance on 2.6"
Previous message: Andrew Morton: "Re: [PATCH] Increase number of dynamic inodes in procfs (2.6.5)"
In reply to: Martin J. Bligh: "Re: [PATCH] anobjrmap 9 priority mjb tree"
Next in thread: Martin J. Bligh: "Re: [PATCH] anobjrmap 9 priority mjb tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]