Re: [PATCH] Percpu tag allocator

From: Kent Overstreet
Date: Wed Jun 12 2013 - 13:59:25 EST


On Wed, Jun 12, 2013 at 07:08:35PM +0200, Oleg Nesterov wrote:
> On 06/11, Kent Overstreet wrote:
> >
> > + * This is done by keeping track of which cpus have tags on their percpu
> > + * freelists in a bitmap, and then on allocation failure if too many cpus have
> > + * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
> > + * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
> > + * random).
>
> Interesting... I'll try to read the patch later.
>
> I have to admit, I do not really understand what this bitmap can actually
> buy after a quick glance. It is only a hint afaics, and obviously it is
> not accurate. alloc_local_tag() never clears the bit, tag_free()->set_bit()
> is racy, etc. I guess this is fine, just this doesn't look clear.

Yeah, I'll add a comment explaining how it's used.

The bitmap is actually fairly important because that's how steal_tags()
knows whether it should run - without the bitmap, every time
alloc_global_tags() fails (and if you're keeping your device's queue
full and all tags allocated, that'll happen a lot) steal_tags() would
have to touch every other percpu freelist.

So we'd need at least an atomic counter, but a bitmap isn't really any
more trouble and it lets us skip most of the percpu lists that are empty
- which should make a real difference in scalability to huge nr_cpus.

The main thing is it's fine for a freelist to be empty when the bitmap
is set - steal_tags() will just keep iterating - but the bitmap _must_
be set whenever there's tags on a percpu freelist.

That's why it's ok for alloc_local_tag() to skip clearing it, and it's
probably better for it not to because it means not touching a shared
cacheline, and most likely we'll be doing more work on that cpu and
refilling the percpu freelist soon anyways.

>
> But the code looks correct, just a couple of minor nits, feel freee to
> ignore.
>
> > +enum {
> > + TAG_FAIL = -1U,
> > + TAG_MAX = TAG_FAIL - 1,
> > +};
>
> This can probably go to .c, and it seems that TAG_MAX is not needed.
> tag_init() can check "nr_tags >= TAG_FAIL" instead.

Users need TAG_FAIL to check for allocation failure.

>
> > +static bool steal_tags(struct percpu_tag_pool *pool,
> > + struct percpu_tag_cpu_freelist *tags)
> > +{
> > + unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
> > + struct percpu_tag_cpu_freelist *remote;
> > +
> > + for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
> > + cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);
>
> bitmap_weight(pool->cpus_have_tags, pool->nr_tags).

I couldn't remember what that was called, thanks :)

> > +
> > + while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
> > + cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
> > +
> > + if (cpu == nr_cpu_ids)
> > + cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
> > +
> > + if (cpu == nr_cpu_ids)
> > + BUG();
> > +
> > + pool->cpu_last_stolen = cpu;
> > + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > +
> > + if (remote == tags)
> > + continue;
> > +
> > + clear_bit(cpu, pool->cpus_have_tags);
> > +
> > + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> > +
> > + if (nr_free) {
> > + memcpy(tags->freelist,
> > + remote->freelist,
> > + sizeof(unsigned) * nr_free);
> > + smp_mb();
> > + remote->nr_free = 0;
> > + tags->nr_free = nr_free;
> > + return true;
> > + } else {
> > + remote->nr_free = 0;
> > + }
>
> Both branches clear remote->nr_free.

Yeah, but clearing it has to happen after the memcpy() and the smp_mb().
I couldn't find a way to combine them that I liked, but if you've got
any suggestions I'm all ears.

> BITS_TO_LONGS(nr_cpu_ids)

Aha, thanks.

I've made a few tweaks since doing some profiling this morning - here's
an updated version. Main thing was adding a fast path to
percpu_tag_alloc(), the finish_wait() was more expensive than I
expected: