Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

From: Jens Axboe
Date: Fri Apr 25 2014 - 22:03:14 EST


On 2014-04-25 18:01, Ming Lei wrote:
Hi Jens,

On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
On 04/25/2014 03:10 AM, Ming Lei wrote:

Sorry, I did run it the other day. It has little to no effect here, but
that's mostly because there's so much other crap going on in there. The
most effective way to currently make it work better, is just to ensure
the caching pool is of a sane size.

Yes, that is just what the patch is doing, :-)

But it's not enough. For instance, my test case, it's 255 tags and 64 CPUs. We end up in cross-cpu spinlock nightmare mode.

From percpu_ida view, it is easy to observe it can improve
allocation performance. I have several patches to export
these information by sysfs for monitoring percpu_ida
performance.

Sounds good!

I've got an alternative tagging scheme that I think would be useful for
the cases where the tag space to cpu ratio isn't big enough. So I think
we'll retain percpu_ida for the cases where it can cache enough, and
punt to an alternative scheme when not.

OK, care to comment on the patch or the idea of setting percpu cache
size as (nr_tags / hctx->nr_ctx)?

I think it's a good idea. The problem is that for percpu_ida to be effective, you need a bigger cache than the 3 I'd get above. If that isn't the case, it performs poorly. I'm just not convinced the design can ever work in the realm of realistic queue depths.


That doesn't mean we should not improve percpu_ida. There's quite a bit
of low hanging fruit in there.

IMO percpu_max_size in percpu_ida is very important for the
performance, and it might need to adjust dynamically according
to the percpu allocation loading, but it is far more complicated
to implement. And it might be the simplest way to fix the parameter
before percpu_ida_init().

That's what I did, essentially. Ensuring that the percpu_max_size is at least 8 makes it a whole lot better here. But still slower than a regular simple bitmap, which makes me sad. A fairly straight forward cmpxchg based scheme I tested here is around 20% faster than the bitmap approach on a basic desktop machine, and around 35% faster on a 4-socket. Outside of NVMe, I can't think of cases where that approach would not be faster than percpu_ida. That means all of SCSI, basically, and the basic block drivers.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/