Re: [PATCH -tip] introduce sys_membarrier(): process-wide memorybarrier (v9)

From: Ingo Molnar
Date: Thu Mar 04 2010 - 11:11:04 EST

Next message: David Teigland: "Re: [PATCH 1/4] dlm: fix ordering of bast and cast"
Previous message: Linus Torvalds: "Re: [PATCH] pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO insteadof force_sig()"
In reply to: Josh Triplett: "Re: [PATCH -tip] introduce sys_membarrier(): process-wide memorybarrier (v9)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH -tip] introduce sys_membarrier(): process-wide memorybarrier (v9)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

> I am proposing this patch for the 2.6.34 merge window, as I think it is
> ready for inclusion.

It's a bit late for this merge window i think.

> Here is an implementation of a new system call, sys_membarrier(), which
> executes a memory barrier on all threads of the current process. It can be
> used to distribute the cost of user-space memory barriers asymmetrically by
> transforming pairs of memory barriers into pairs consisting of
> sys_membarrier() and a compiler barrier. For synchronization primitives that
> distinguish between read-side and write-side (e.g. userspace RCU, rwlocks),
> the read-side can be accelerated significantly by moving the bulk of the
> memory barrier overhead to the write-side.

Why is this such a low level and still special-purpose facility?

Synchronization facilities for high-performance threading may want to do a bit
more than just execute a barrier instruction on another CPU that has a
relevant thread running.

You cited signal based numbers:

> (what we have now, with dynamic sys_membarrier check, expedited scheme)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 4316818891 reads, 503790 writes
>
> (dynamic sys_membarrier check, non-expedited scheme)
> memory barriers in reader: 907693804 reads, 817793 writes
> sys_membarrier scheme: 8698725501 reads, 313 writes

Much of that signal handler overhead is i think due to:

- FPU/SSE context save/restore
- the need to wake up, run and deschedule all threads

Instead i'd suggest for you to try to implement user-space RCU speedups not
via the new sys_membarrier() syscall, but via two new signal extensions:

- SA_NOFPU: on x86 to skip the FPU/SSE save/restore, for such fast in/out special
purpose signal handlers? (can whip up a quick patch for you if you want)

- SA_RUNNING: a way to signal only running threads - as a way for user-space
based concurrency control mechanisms to deschedule running threads (or, like
in your case, to implement barrier / garbage collection schemes).

( Note: to properly sync back you'll also need an sa_info field to tell
target tasks how many tasks were woken up. That way a futex can be used
as a semaphore to signal back to the issuing thread, and make it all
properly event triggered and nicely scalable. Also, queued signals are a
must for such a scheme. )

My estimation is that it will be _much_ faster than the naive signal based
approach - maybe even quite comparable to an open-coded sys_membarrier():

- as most of the overhead in a real scenario ought to be the IPI sending and
latency - not the syscall entry/exit. (with a signal approach we'd still go
into target thread user-mode, so one more syscall exit+re-entry)

- or for the common case where there are no other threads running, we are
just in/out of SA_RUNNING without having to do any synchronization. In that
case it should be quite close to sys_membarrier() - modulo some minimal
signal API overhead. [which we could optimize some more, if it's visible in
your benchmarks.]

Signals per se are pretty scalable these days - now that most of the fastpaths
are decoupled from tasklist_lock and everything is RCU-ized.

Further benefits are:

- both SA_NOFPU and SA_RUNNING could be used by a _lot_ more user-space
facilities than just user-space RCU.

- synergetic effects: growing some real high-performance facility based on
signals would ensure further signal speedups in the future as well.
Currently any server app that runs into signal limitations tends to shy
away from them and use some different (and often inferior) signalling
scheme. It would be better extend signals with 'lightweight' capabilities
as well.

All in one, signals are used by like 99.9% of Linux apps, while
sys_membarrier() would be used only by [WAG] 0.00001% of them.

So before we can merge this (at least via the RCU tree, which you have sent it
to), i'd like to see you try _much_, _MUCH_ harder to fix the very obvious
signal overhead performance problems you have demoed via the numbers above so
nicely.

If _that_ fails, and if we get all the fruits of that, _then_ we might
perhaps, with a lot of hesitation, concede defeat and think about adding yet
another syscall.

I know it's cool to add a brand new syscall - but, unfortunately, in practice
it doesnt help Linux apps all that much. (at least until we have tools/klibc/
or so.)

[ There's also a few small cleanliness details i noticed in your patch: enums
are a tiny bit nicer for ABIs than #define's, the #ifdef SMP is ugly, etc. -
but it doesnt really matter much as i think we should concentrate on the
scalability problems of signals first. ]

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Teigland: "Re: [PATCH 1/4] dlm: fix ordering of bast and cast"
Previous message: Linus Torvalds: "Re: [PATCH] pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO insteadof force_sig()"
In reply to: Josh Triplett: "Re: [PATCH -tip] introduce sys_membarrier(): process-wide memorybarrier (v9)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH -tip] introduce sys_membarrier(): process-wide memorybarrier (v9)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]