Re: [RFC PATCH v2 1/3] getcpu_cache system call: cache CPU number of running thread

From: Alexei Starovoitov
Date: Wed Jan 27 2016 - 22:12:34 EST


On Wed, Jan 27, 2016 at 11:54:41AM -0500, Mathieu Desnoyers wrote:
> Expose a new system call allowing threads to register one userspace
> memory area where to store the CPU number on which the calling thread is
> running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
> current thread. Upon return to user-space, a notify-resume handler
> updates the current CPU value within each registered user-space memory
> area. User-space can then read the current CPU number directly from
> memory.
>
> This getcpu cache is an improvement over current mechanisms available to
> read the current CPU number, which has the following benefits:
>
> - 44x speedup on ARM vs system call through glibc,
> - 14x speedup on x86 compared to calling glibc, which calls vdso
> executing a "lsl" instruction,
> - 11x speedup on x86 compared to inlined "lsl" instruction,
> - Unlike vdso approaches, this cached value can be read from an inline
> assembly, which makes it a useful building block for restartable
> sequences.
> - The getcpu cache approach is portable (e.g. ARM), which is not the
> case for the lsl-based x86 vdso.
>
> On x86, yet another possible approach would be to use the gs segment
> selector to point to user-space per-cpu data. This approach performs
> similarly to the getcpu cache, but it has two disadvantages: it is
> not portable, and it is incompatible with existing applications already
> using the gs segment selector for other purposes.

Great work! The only concern is that every arch has to implement
a call to getcpu_cache_handle_notify_resume() to be able to do put_user()
from the safe place which is not pretty.
Can we do better?
Here is one crazy idea:
The kernel can allocate the memory that user space will mmap()
(ideally reusing perf ring-buffer alloc/mmap mechanism).
then the kernel can just write cpuid into it from any place.
Then user space will register the 'offset' into this space for a given
user space thread (or kernel will return it or ptr within this area)
and in finish_task_switch() the kernel will do
*task->offset_converted_to_ptr = smp_processor_id();
At init time the user space will do:
__thread int *cpuid;
cpuid = (void*)addr_from_mmap + registered_offset;
and at runtime the '*cpuid' will give userspace what it wants.
It's two loads to get cpuid vs getcpu_cache approach, but
probably still fast enough?
And this way we can have a mechanism to return much bigger
structures to userspace. Kernel can update such area from any
place and user space only needs one extra load to get the base of
such per-cpu area and another load to fetch cpuid.
Thoughts?