Re: [PATCH v2] mm: per-thread vma caching

From: Davidlohr Bueso
Date: Tue Feb 25 2014 - 23:04:46 EST


On Tue, 2014-02-25 at 18:04 -0800, Michel Lespinasse wrote:
> On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <davidlohr@xxxxxx> wrote:
> > This patch is a continuation of efforts trying to optimize find_vma(),
> > avoiding potentially expensive rbtree walks to locate a vma upon faults.
> > The original approach (https://lkml.org/lkml/2013/11/1/410), where the
> > largest vma was also cached, ended up being too specific and random, thus
> > further comparison with other approaches were needed. There are two things
> > to consider when dealing with this, the cache hit rate and the latency of
> > find_vma(). Improving the hit-rate does not necessarily translate in finding
> > the vma any faster, as the overhead of any fancy caching schemes can be too
> > high to consider.
>
> Actually there is also the cost of keeping the cache up to date. I'm
> not saying that it's an issue in your proposal - I like the proposal,
> especially now that you are replacing the per-mm cache rather than
> adding something on top - but it is a factor to consider.

True, although numbers show that the cost of maintaining the cache is
quite minimal. Invalidations are a free lunch (except in the rare event
of a seqnum overflow), so the updating part would consume the most
cycles, but then again, the hit rate is quite good so I'm not worried
about that either.

>
> > +static inline void __vmacache_invalidate(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_MMU
> > + vmacache_invalidate(mm);
> > +#else
> > + mm->vmacache = NULL;
> > +#endif
> > +}
>
> Is there any reason why we can't use your proposal for !CONFIG_MMU as well ?
> (I'm assuming that we could reduce preprocessor checks by doing so)

Based on Linus' feedback today, I'm getting rid of this ugliness and
trying to have per-thread caches for both configs.

> > +void vmacache_invalidate_all(void)
> > +{
> > + struct task_struct *g, *p;
> > +
> > + rcu_read_lock();
> > + for_each_process_thread(g, p) {
> > + /*
> > + * Only flush the vmacache pointers as the
> > + * mm seqnum is already set and curr's will
> > + * be set upon invalidation when the next
> > + * lookup is done.
> > + */
> > + memset(p->vmacache, 0, sizeof(p->vmacache));
> > + }
> > + rcu_read_unlock();
> > +}
>
> Two things:
>
> - I believe we only need to invalidate vma caches for threads that
> share a given mm ? we should probably pass in that mm in order to
> avoid over-invalidation

I think you're right, since the overflows will always occur on
mm->seqnum, tasks that do not share the mm shouldn't be affected.

So the danger here is that when a lookup occurs, vmacache_valid() will
return true, having:

mm == curr->mm && mm->vmacache_seqnum == curr->vmacache_seqnum (both 0).

Then we just iterate the cache and potentially return some bugus vma.

However, since we're now going to reset the seqnum on every fork/clone
(before it was just the oldmm->seqnum + 1 thing), I doubt we'll ever
overflow.

> - My understanding is that the operation is safe because the caller
> has the mm's mmap_sem held for write, and other threads accessing the
> vma cache will have mmap_sem held at least for read, so we don't need
> extra locking to maintain the vma cache.

Yes, that's how I see things as well.

> Please 1- confirm this is the
> intention, 2- document this, and 3- only invalidate vma caches for
> threads that match the caller's mm so that mmap_sem locking can
> actually apply.

Will do.

> > +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
> > + unsigned long addr)
> > +
> > +{
> > + int i;
> > +
> > + if (!vmacache_valid(mm))
> > + return NULL;
> > +
> > + for (i = 0; i < VMACACHE_SIZE; i++) {
> > + struct vm_area_struct *vma = current->vmacache[i];
> > +
> > + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
> > + return vma;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
> > + struct vm_area_struct *newvma)
> > +{
> > + /*
> > + * Hash based on the page number. Provides a good
> > + * hit rate for workloads with good locality and
> > + * those with random accesses as well.
> > + */
> > + int idx = (addr >> PAGE_SHIFT) & 3;
> > + current->vmacache[idx] = newvma;
> > +}
>
> I did read the previous discussion about how to compute idx here. I
> did not at the time realize that you are searching all 4 vmacache
> entries on lookup - that is, we are only talking about eviction policy
> here, not a lookup hash policy.

Right.

> My understanding is that the reason both your current and your
> previous idx computations work, is that a random eviction policy would
> work too. Basically, what you do is pick some address bits that are
> 'random enough' to use as an eviction policy.

What do you mean by random enough? I assume that would be something like
my original scheme were I used the last X bits of the offset within the
page.

>
> This is more of a question for Linus, but I am very surprised that I
> can't find an existing LRU eviction policy scheme in Linux. What I
> have in mind is to keep track of the order the cache entries have last
> been used. With 4 entries, there are 4! = 24 possible orders, which
> can be represented with an integer between 0 and 23. When
> vmacache_find suceeds, that integer is updated using a table lookup
> (table takes 24*4 = 96 bytes). In vmacache_update, the lru value
> module 4 indicates which cache way to evict (i.e. it's the one that's
> been least recently used).

While not completely related, I did play with a mod 4 hashing scheme
before I got to the one I'm proposing now. It was just not as effective.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/