Re: [patch V3] percpu_counter: scalability works

From: Tejun Heo
Date: Tue May 17 2011 - 08:45:39 EST


Hello, Eric.

On Tue, May 17, 2011 at 02:20:07PM +0200, Eric Dumazet wrote:
> Spikes are expected and have no effect by design.
>
> batch value is chosen so that granularity of the percpu_counter
> (batch*num_online_cpus()) is the spike factor, and thats pretty
> difficult when number of cpus is high.
>
> In Shaohua workload, 'amount' for a 128Mbyte mapping is 32768, while the
> batch value is 48. 48*24 = 1152.
> So the percpu s32 being in [-47 .. 47] range would not change the
> accuracy of the _sum() function [ if it was eventually called, but its
> not ]
>
> No drift in the counter is the only thing we care - and _read() being
> not too far away from the _sum() value, in particular if the
> percpu_counter is used to check a limit that happens to be low (against
> granularity of the percpu_counter : batch*num_online_cpus()).
>
> I claim extra care is not needed. This might give the false impression
> to reader/user that percpu_counter object can replace a plain
> atomic64_t.

We already had this discussion. Sure, we can argue about it again all
day but I just don't think it's a necessary compromise and really
makes _sum() quite dubious. It's not about strict correctness, it
can't be, but if I spent the overhead to walk all the different percpu
counters, I'd like to have a rather exact number if there's nothing
much going on (freeblock count, for example). Also, I want to be able
to use large @batch if the situation allows for it without worrying
about _sum() accuracy.

Given that _sum() is super-slow path and we have a lot of latitude
there, this should be possible without resorting to heavy handed
approach like lglock. I was hoping that someone would come up with a
better solution, which didn't seem to have happened. Maybe I was
wrong, I don't know. I'll give it a shot.

But, anyways, here's my position regarding the issue.

* If we're gonna just fix up the slow path, I don't want to make
_sum() less useful by making its accuracy dependent upon @batch.

* If somebody is interested, it would be worthwhile to see whether we
can integrate vmstat and percpu counters so that its deviation is
automatically regulated and we don't have to think about all this
anymore.

I'll see if I can come up with something.

Thank you.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/