Re: [RFC PATCH 1/3] getcpu_cache system call: cache CPU number of running thread

From: Paul E. McKenney
Date: Tue Jan 05 2016 - 16:47:32 EST


On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote:
> > For instance, an application could create a linked list or hash map
> > of thread control structures, which could contain the current CPU
> > number of each thread. A dispatch thread could then traverse or
> > lookup this structure to see on which CPU each thread is running and
> > do work queue dispatch or scheduling decisions accordingly.
>
> So, what happens if the linked list is walked from thread X, and we
> discover that thread Y is allegedly running on CPU1. We decide that
> we want to dispatch some work on that thread due to it being on CPU1,
> so we send an event to thread Y.
>
> Thread Y becomes runnable, and the scheduler decides to schedule the
> thread on CPU3 instead of CPU1.
>
> My point is that the above idea is inherently racy. The only case
> where it isn't racy is when thread Y is bound to CPU1, and so can't
> move - but then you'd know that thread Y is on CPU1 and there
> wouldn't be a need for the inherent complexity suggested above.
>
> The behaviour I've seen on ARM from the scheduler (on a quad CPU
> platform, observing the system activity with top reporting the last
> CPU number used by each thread) is that threads often migrate
> between CPUs - especially in the case of (eg) one or two threads
> running in a quad-CPU system.
>
> Given that, I'm really not sure what the use of reading and making
> decisions on the current CPU number would be within a program -
> unless the thread is bound to a particular CPU or group of CPUs,
> it seems that you can't rely on being on the reported CPU by the
> time the system call returns.

As I understand it, the idea is -not- to eliminate synchronization
like we do with per-CPU variables in the kernel, but rather to
reduce the average cost of synchronization. For example, there
might be a separate data structure per CPU, each structure guarded
by its own lock. A thread could sample the current running CPU,
acquire that CPU's corresponding lock, and operate on that CPU's
structure. This would work correctly even if there was an arbitrarily
high number of preemptions/migrations, but would have improved
performance (compared to a single global lock) in the common case
where there were no preemptions/migrations.

This approach can also be used in conjunction with Paul Turner's
per-CPU atomics.

Make sense, or am I missing your point?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/