Re: [PATCH v2] mm: per-thread vma caching

From: Michel Lespinasse
Date: Wed Feb 26 2014 - 02:52:50 EST


On Tue, Feb 25, 2014 at 8:04 PM, Davidlohr Bueso <davidlohr@xxxxxx> wrote:
> On Tue, 2014-02-25 at 18:04 -0800, Michel Lespinasse wrote:
>> On Tue, Feb 25, 2014 at 10:16 AM, Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>> > This patch is a continuation of efforts trying to optimize find_vma(),
>> > avoiding potentially expensive rbtree walks to locate a vma upon faults.
>> > The original approach (https://lkml.org/lkml/2013/11/1/410), where the
>> > largest vma was also cached, ended up being too specific and random, thus
>> > further comparison with other approaches were needed. There are two things
>> > to consider when dealing with this, the cache hit rate and the latency of
>> > find_vma(). Improving the hit-rate does not necessarily translate in finding
>> > the vma any faster, as the overhead of any fancy caching schemes can be too
>> > high to consider.
>>
>> Actually there is also the cost of keeping the cache up to date. I'm
>> not saying that it's an issue in your proposal - I like the proposal,
>> especially now that you are replacing the per-mm cache rather than
>> adding something on top - but it is a factor to consider.
>
> True, although numbers show that the cost of maintaining the cache is
> quite minimal. Invalidations are a free lunch (except in the rare event
> of a seqnum overflow), so the updating part would consume the most
> cycles, but then again, the hit rate is quite good so I'm not worried
> about that either.

Yes. I like your patch precisely because it keeps maintainance costs low.

>> > +void vmacache_invalidate_all(void)
>> > +{
>> > + struct task_struct *g, *p;
>> > +
>> > + rcu_read_lock();
>> > + for_each_process_thread(g, p) {
>> > + /*
>> > + * Only flush the vmacache pointers as the
>> > + * mm seqnum is already set and curr's will
>> > + * be set upon invalidation when the next
>> > + * lookup is done.
>> > + */
>> > + memset(p->vmacache, 0, sizeof(p->vmacache));
>> > + }
>> > + rcu_read_unlock();
>> > +}
>>
>> Two things:
>>
>> - I believe we only need to invalidate vma caches for threads that
>> share a given mm ? we should probably pass in that mm in order to
>> avoid over-invalidation
>
> I think you're right, since the overflows will always occur on
> mm->seqnum, tasks that do not share the mm shouldn't be affected.
>
> So the danger here is that when a lookup occurs, vmacache_valid() will
> return true, having:
>
> mm == curr->mm && mm->vmacache_seqnum == curr->vmacache_seqnum (both 0).
>
> Then we just iterate the cache and potentially return some bugus vma.
>
> However, since we're now going to reset the seqnum on every fork/clone
> (before it was just the oldmm->seqnum + 1 thing), I doubt we'll ever
> overflow.

I'm concerned about it *precisely* because it won't happen often, and
so it'd be hard to debug if we had a problem there. 64 bits would be
safe, but a 32-bit counter doesn't take that long to overflow, and I'm
sure it will happen once in a while in production.

Actually, I think there is a case for masking the seqnum with a
constant (all ones in production builds, but something shorter when
CONFIG_DEBUG_VM is enabled) so that this code is easier to exercise.

>> - My understanding is that the operation is safe because the caller
>> has the mm's mmap_sem held for write, and other threads accessing the
>> vma cache will have mmap_sem held at least for read, so we don't need
>> extra locking to maintain the vma cache.
>
> Yes, that's how I see things as well.
>
>> Please 1- confirm this is the
>> intention, 2- document this, and 3- only invalidate vma caches for
>> threads that match the caller's mm so that mmap_sem locking can
>> actually apply.
>
> Will do.

Thanks :)

>> > +struct vm_area_struct *vmacache_find(struct mm_struct *mm,
>> > + unsigned long addr)
>> > +
>> > +{
>> > + int i;
>> > +
>> > + if (!vmacache_valid(mm))
>> > + return NULL;
>> > +
>> > + for (i = 0; i < VMACACHE_SIZE; i++) {
>> > + struct vm_area_struct *vma = current->vmacache[i];
>> > +
>> > + if (vma && vma->vm_start <= addr && vma->vm_end > addr)
>> > + return vma;
>> > + }
>> > +
>> > + return NULL;
>> > +}
>> > +
>> > +void vmacache_update(struct mm_struct *mm, unsigned long addr,
>> > + struct vm_area_struct *newvma)
>> > +{
>> > + /*
>> > + * Hash based on the page number. Provides a good
>> > + * hit rate for workloads with good locality and
>> > + * those with random accesses as well.
>> > + */
>> > + int idx = (addr >> PAGE_SHIFT) & 3;
>> > + current->vmacache[idx] = newvma;
>> > +}
>>
>> I did read the previous discussion about how to compute idx here. I
>> did not at the time realize that you are searching all 4 vmacache
>> entries on lookup - that is, we are only talking about eviction policy
>> here, not a lookup hash policy.
>
> Right.
>
>> My understanding is that the reason both your current and your
>> previous idx computations work, is that a random eviction policy would
>> work too. Basically, what you do is pick some address bits that are
>> 'random enough' to use as an eviction policy.
>
> What do you mean by random enough? I assume that would be something like
> my original scheme were I used the last X bits of the offset within the
> page.

What I mean is that if you used a per-thread random number generator
to compute idx, such as prandom_u32_state() for example, you'd get
nice enough results already - after all, random eviction is not that
bad. So, my hunch is that the particular way you compute idx based on
the address works not because it catches on some particular property
of the access patterns, but just because the sequence of indexes it
ends up generating is as good as a random one.

>> This is more of a question for Linus, but I am very surprised that I
>> can't find an existing LRU eviction policy scheme in Linux. What I
>> have in mind is to keep track of the order the cache entries have last
>> been used. With 4 entries, there are 4! = 24 possible orders, which
>> can be represented with an integer between 0 and 23. When
>> vmacache_find suceeds, that integer is updated using a table lookup
>> (table takes 24*4 = 96 bytes). In vmacache_update, the lru value
>> module 4 indicates which cache way to evict (i.e. it's the one that's
>> been least recently used).
>
> While not completely related, I did play with a mod 4 hashing scheme
> before I got to the one I'm proposing now. It was just not as effective.

I believe you mean using idx = addr % 4 ? I can see that this wouldn't
work well, because index 0 would be chosen way too often (i.e. any
time the address is aligned, or the first access into a given page is
from reading a byte stream that crosses into it, etc).

But, what I propose above is entirely different, it is just picking
whatever cache index was least recently used for lookups. In this
context, the 'modulo 4' is only an implementation detail of how I
propose we track LRU order between the 4 cache slots.

By the way, I'm happy enough with your patch going in with your
proposed eviction scheme; LRU eviction is a refinement that is best
done as a separate patch. It just came up because I think it would be
applicable here.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/