Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

From: Mathieu Desnoyers
Date: Tue May 26 2015 - 16:39:06 EST


----- Original Message -----
> [cc: hpa, Borislav and Andi]
>
> On Mon, May 25, 2015 at 11:30 AM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> > ----- Original Message -----
> >> On May 23, 2015 10:09 AM, "Mathieu Desnoyers"
> >> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >> >
> >> > ----- Original Message -----
> >> > > On Fri, May 22, 2015 at 2:34 PM, Mathieu Desnoyers
> >> > > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >> > > > ----- Original Message -----
> >> > > >> On Fri, May 22, 2015 at 1:26 PM, Michael Kerrisk
> >> > > >> <mtk.manpages@xxxxxxxxx>
> >> > > >> wrote:
> >> > > >> > [CC += linux-api@]
> >> > > >> >
> >> > > >> > On Thu, May 21, 2015 at 4:44 PM, Mathieu Desnoyers
> >> > > >> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >> > > >> >> Expose a new system call allowing userspace threads to register
> >> > > >> >> a TLS area used as an ABI between the kernel and userspace to
> >> > > >> >> share information required to create efficient per-cpu critical
> >> > > >> >> sections in user-space.
> >> > > >> >>
> >> > > >> >> This ABI consists of a thread-local structure containing:
> >> > > >> >>
> >> > > >> >> - a nesting count surrounding the critical section,
> >> > > >> >> - a signal number to be sent to the thread when preempting a
> >> > > >> >> thread
> >> > > >> >> with non-zero nesting count,
> >> > > >> >> - a flag indicating whether the signal has been sent within the
> >> > > >> >> critical section,
> >> > > >> >> - an integer where to store the current CPU number, updated
> >> > > >> >> whenever
> >> > > >> >> the thread is preempted. This CPU number cache is not strictly
> >> > > >> >> needed, but performs better than getcpu vdso.
> >> > > >> >>
> >> > > >> >> This approach is inspired by Paul Turner and Andrew Hunter's
> >> > > >> >> work
> >> > > >> >> on percpu atomics, which lets the kernel handle restart of
> >> > > >> >> critical
> >> > > >> >> sections, ref.
> >> > > >> >> http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >> > > >> >>
> >> > > >> >> What is done differently here compared to percpu atomics: we
> >> > > >> >> track
> >> > > >> >> a single nesting counter per thread rather than many ranges of
> >> > > >> >> instruction pointer values. We deliver a signal to user-space
> >> > > >> >> and
> >> > > >> >> let the logic of restart be handled in user-space, thus moving
> >> > > >> >> the complexity out of the kernel. The nesting counter approach
> >> > > >> >> allows us to skip the complexity of interacting with signals
> >> > > >> >> that
> >> > > >> >> would be otherwise needed with the percpu atomics approach,
> >> > > >> >> which
> >> > > >> >> needs to know which instruction pointers are preempted,
> >> > > >> >> including
> >> > > >> >> when preemption occurs on a signal handler nested over an
> >> > > >> >> instruction
> >> > > >> >> pointer of interest.
> >> > > >> >>
> >> > > >>
> >> > > >> I talked about this kind of thing with PeterZ at LSF/MM, and I was
> >> > > >> unable to convince myself that the kernel needs to help at all. To
> >> > > >> do
> >> > > >> this without kernel help, I want to relax the requirements
> >> > > >> slightly.
> >> > > >> With true per-cpu atomic sections, you have a guarantee that you
> >> > > >> are
> >> > > >> either really running on the same CPU for the entire duration of
> >> > > >> the
> >> > > >> atomic section or you abort. I propose a weaker primitive: you
> >> > > >> acquire one of an array of locks (probably one per cpu), and you
> >> > > >> are
> >> > > >> guaranteed that, if you don't abort, no one else acquires the same
> >> > > >> lock while you hold it.
> >> > > >
> >> > > > In my proof of concept (https://github.com/compudj/percpu-dev) I
> >> > > > actually implement an array of per-cpu lock. The issue here boils
> >> > > > down to grabbing this per-cpu lock efficiently. Once the lock is
> >> > > > taken,
> >> > > > the thread has exclusive access to that per-cpu lock, even if it
> >> > > > migrates.
> >> > > >
> >> > > >> Here's how:
> >> > > >>
> >> > > >> Create an array of user-managed locks, one per cpu. Call them
> >> > > >> lock[i]
> >> > > >> for 0 <= i < ncpus.
> >> > > >>
> >> > > >> To acquire, look up your CPU number. Then, atomically, check that
> >> > > >> lock[cpu] isn't held and, if so, mark it held and record both your
> >> > > >> tid
> >> > > >> and your lock acquisition count. If you learn that the lock *was*
> >> > > >> held after all, signal the holder (with kill or your favorite other
> >> > > >> mechanism), telling it which lock acquisition count is being
> >> > > >> aborted.
> >> > > >> Then atomically steal the lock, but only if the lock acquisition
> >> > > >> count
> >> > > >> hasn't changed.
> >> > > >>
> >> > > >> This has a few benefits over the in-kernel approach:
> >> > > >>
> >> > > >> 1. No kernel patch.
> >> > > >>
> >> > > >> 2. No unnecessary abort if you are preempted in favor of a thread
> >> > > >> that
> >> > > >> doesn't content for your lock.
> >> > > >>
> >> > > >> 3. Greatly improved debuggability.
> >> > > >>
> >> > > >> 4. With long critical sections and heavy load, you can improve
> >> > > >> performance by having several locks per cpu and choosing one at
> >> > > >> random.
> >> > > >>
> >> > > >> Is there a reason that a scheme like this doesn't work?
> >> > > >
> >> > > > What do you mean exactly by "atomically check that lock is not
> >> > > > held and, if so, mark it held" ? Do you imply using a lock-prefixed
> >> > > > atomic operation ?
> >> > >
> >> > > Yes.
> >> > >
> >> > > >
> >> > > > The goal of this whole restart section approach is to allow grabbing
> >> > > > a lock (or doing other sequences of operations ending with a single
> >> > > > store) on per-cpu data without having to use slow lock-prefixed
> >> > > > atomic operations.
> >> > >
> >> > > Ah, ok, I assumed it was to allow multiple threads to work in
> >> > > parallel.
> >> > >
> >> > > How arch-specific are you willing to be?
> >> >
> >> > I'd want this to be usable on every major architectures.
> >> >
> >> > > On x86, it might be possible
> >> > > to play some GDT games so that an unlocked xchg relative
> >> >
> >> > AFAIK, there is no such thing as an unlocked xchg. xchg always
> >> > imply the lock prefix on x86. I guess you mean cmpxchg here.
> >> >
> >>
> >> Right, got my special cases mixed up.
> >>
> >> I wonder if we could instead have a vdso function that did something like:
> >>
> >> unsigned long __vdso_cpu_local_exchange(unsigned long *base, int
> >> shift, unsigned long newval)
> >> {
> >> unsigned long *ptr = base + (cpu << shift);
> >> unsigned long old = *ptr;
> >> *ptr = new;
> >> return *ptr;
> >> }
> >>
> >> I think this primitive would be sufficient to let user code do the
> >> rest. There might be other more simple primitives that would work.
> >> It could be implemented by fiddling with IP ranges, but we could
> >> change the implementation later without breaking anything. The only
> >> really hard part would be efficiently figuring out what CPU we're on.
> >
> > The "fiddling with IP ranges" is where the restart sections come into
> > play. Paul Turner's approach indeed knows about IP ranges, and performs
> > the restart from the kernel. My alternative approach uses a signal and
> > page protection in user-space to reach the same result. It appears that
> > CONFIG_PREEMPT kernels are difficult to handle with Paul's approach, so
> > perhaps we could combine our approaches to get the best of both.
>
> I'm not sure why CONFIG_PREEMPT would matter. What am I missing?

With CONFIG_PREEMPT, the scheduler can preempt the kernel while
it's running a page fault, or a system call, on behalf of a thread.

So in addition of having to check which instruction pointer is preempted,
and which IP is having a signal delivered on, we may need to add a check
when entering into the kernel on pretty much every path, including
page faults and system calls. I presume the interrupt handler is OK,
provided that interrupt handlers cannot be preempted, and that the explicit
preempt check is done before the interrupt handler returns to user-space,
AFAIU passing the user-space pt_regs, which should be OK already.

I may very have missed other subtleties of issues related to pjt's approach
with CONFIG_PREEMPT. Hopefully he can enlighten us with the missing parts.

>
> Doing this in the vdso has some sneaky benefits: rather than aborting
> a very short vdso-based primitive on context switch, we could just fix
> it up in the kernel and skip ahead to the end.

The fixup rather than restart idea is quite interesting. However, it may
require to do the fixup in the kernel before being migrated, and I'm not
sure it's that easy to find a preemptable spot where we can do the user
accesses needed for the fixup before migration. One advantage of the
"restart" approach is that we can perform the user-accesses after the
migration, which allow us to call that code right before return to
userspace. Unfortunately, doing the fixup from the kernel after the
migration might be tricky, even if we use a lock-atomic op in the fixup,
because we might be doing a store concurrently with a non-locked atomic
op running in userspace on the original CPU.

Here is an idea I'm not totally proud of, but might work if we really
want to do this without restarting userspace: we could expose a vdso
that both performs an atomic operation on an array of pointers to
per-cpu values, and gets the cpu number:

void __vdso_cmpxchg_getcpu(unsigned long **p, unsigned long _old,
unsigned long _new, int *cpu);

If we preempt this vdso, the in-kernel fixup simply grabs the current CPU
number and updates the right array entry, all this locally. The advantage
here is that we can perform this after migration. It comes at the expense
of an indirection through the pointer array, which I rather dislike. If
we can rather find a way to perform the userspace fixup before migration,
it would be much better.

>
> >
> >>
> >> FWIW, 'xchg' to cache-hot memory is only about 20 cycles on my laptop,
> >> and cmpxchg seems to be about 6 cycles. Both are faster than getcpu.
> >> How much are we really saving with any of this over the pure userspace
> >> approach? I think that the most that the kernel can possibly help us
> >> is to give us a faster getcpu and to help us deal with being migrated
> >> to a different cpu if we want to avoid expensive operations and don't
> >> want to play unlocked cmpxchg tricks.
> >
> > You appear to summarize the two things that are added to the kernel with
> > this RFC patch:
> > - faster getcpu(): 12.7 ns -> 0.3 ns
>
> Yeah, I figured that out the second time I read your email and re-read
> the original message, but I apparently forgot to fix up the email I
> was typing.
>
> > - allow using less expensive operations to perform per-cpu-atomic ops:
> > (on my system):
> > - lock; cmpxchg (18 ns) -> load-test-branch-store (5.4 ns)
> >
>
> When you say "load-test-branch-store", do you mean that sequence of
> normal code or a literal cmpxchg? On x86, we really can pull off the
> sneaky trick of gs-relative cmpxchg in a single instruction, which
> gives us per-cpu atomic locking without any special kernel fixups,
> although that locks down gs and isn't currently supported.

It's a sequence of normal code. Ideally I'd want this to be doable
on other architectures too (powerpc and ARM at least).

>
> IIRC some other OS's use gs as a userspace per-cpu pointer instead of
> a per-thread pointer. Would we want to consider enabling that?
> This may interact oddly with the WIP wrgsbase stuff. (Grr, Intel, why
> couldn't you have made wrgsbase reset the gs selector to zero? That
> would have been so much nicer.)
>
> >>
> >> I was wrong about set_thread_area giving us efficient per-cpu data,
> >> BTW, although it would be easy to add a similar feature that would
> >> give us a per-cpu segment base on x86.
> >
> > ok
> >
>
> >>
> >> This ABI is kind of unfortunate in that it can't be shared between
> >> multiple libraries.
> >
> > Well ideally you'd want one library to implement this (e.g. glibc), and
> > other libraries to use it. I'm not sure doing a linked list of TLS entries
> > per thread would really be useful here.
>
> True, but that may not play so well with tracing injection.
>
> At least a pure vdso approach is guaranteed not to have this problem.

Yep.

Thanks!

Mathieu

>
>
> >> Oh, you're delivering a signal every time you're preempted, even
> >> outside the critical section? That sounds expensive.
> >
> > No, the signal is only delivered if the kernel preempts the critical
> > section.
>
> As noted above, I meant to delete this before sending :)
>
> --Andy
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/