Re: [RFC PATCH] membarrier: handle nohz_full with expedited thread registration

From: Mathieu Desnoyers
Date: Mon Jan 16 2017 - 18:04:42 EST


----- On Jan 16, 2017, at 3:15 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx wrote:

> Excuse my french, but this looks like incredible shit to me.

I'm currently trying to figure out how we can get membarrier
to play nicely with recent no-hz kernel features. Indeed, my
initial prototype is a mess. The good news is based on the number
of flaws you found in this RFC, there is plenty of room for
improvement. :)

>
> On Mon, Jan 16, 2017 at 11:51 AM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> +
>> +static int membarrier_register_expedited(struct task_struct *t)
>> +{
>> + struct rq *rq;
>> +
>> + if (t->membarrier_expedited == UINT_MAX)
>> + return -EOVERFLOW;
>> + rq = this_rq();
>> + raw_spin_lock(&rq->lock);
>> + t->membarrier_expedited++;
>> + raw_spin_unlock(&rq->lock);
>> + return 0;
>> +}
>
> Yeah, taking the rq lock with
>
> (a) a random "rq" that isn't stable
>
> (b) without disabling interrupts

So for both register and unregister functions, as well as the use in
membarrier_nohz_full_expedited(), disabling interrupts around the rq
lock should fix this. But perhaps it would be wiser trying not to use the
rq lock at all.

>
> (c) using an internal scheduler helper that isn't supposed to be used
> externally

Yeah, I hate doing that. Hence the TODO comment I've placed near the include:

* TODO: private sched.h is needed for runqueue. Should we move the
* sched code under kernel/sched/ ?

I'm open to better ideas.

>
> (d) when it doesn't even make any sense in the first place for a
> per-thread value that is never modified by any other threads!

The variable "membarrier_expedited" is indeed only modified by the
current thread, but it is read by other threads calling
membarrier_nohz_full_expedited().

For each nohz cpu, membarrier_nohz_full_expedited() needs to access
the cpu's current thread membarrier_expedited field (rq->curr->membarrier_expedited).
Are there ways to ensure loading this field can be done without having
it re-used after exit of its thread other than using the rq lock ?

We also need to be careful about a membarrier registration/unregistration
done concurrently with a membarrier_nohz_full_expedited() on another CPU.

Here is a basic membarrier CMD_SHARED ordering scenario. The CMD_SHARED
membarrier ensures that all memory accesses performed in program order
from each targeted thread are ordered with respect to membarrier().

Initial values:
A = B = 0

CPU 0 | CPU 1
|
x = load A |
| store B = 1
membarrier(CMD_SHARED) |
| barrier() (compiler-level barrier)
y = load B |
| store A = 1

Expect: if x == 1, then y == 1


Now if we add nohz membarrier_expedited registration and unregistration
to the picture for a nohz full cpu, here is how I envision we could do
it without the rq lock:

Initial values:
A = B = 0

CPU 0 | CPU 1 (no-hz full)
|
| membarrier(REGISTER_EXPEDITED)
| set t->membarrier_exped = 1
| smp_mb() [3]
| store B = 1
| barrier() (compiler-level barrier)
| store A = 1
x = load A |
membarrier(CMD_SHARED) |
smp_mb() [1] |
iter. on nohz cpus |
if iter_t->membarrier_exped == 0 |
(skip) |
else |
IPI smp_mb() to target cpu |
| membarrier(UNREGISTER_EXPEDITED)
| smp_mb() [4]
| set t->membarrier_exped = 0
smp_mb() [2] |
y = load B |

Expect: if x == 1, then y == 1

[1,2,3,4] If we don't have those barriers, CPU 0 could observe the store to A,
but not see the store to B.

[1] Orders user-space loads/stores before membarrier CMD_SHARED before load of
membarrier_exped fields,
[2] Orders user-space loads/stores after membarrier CMD_SHARED after load of
membarrier_exped fields,
[3] Orders user-space loads/stores after membarrier REGISTER_EXPEDITED after
store to t->membarrier_exped,
[4] Orders user-space loads/stores before membarrier UNREGISTER_EXPEDITED before
store to t->membarrier_exped,

[1] pairs with [3]: [1] orders load A before load membarrier_exped, matching [3]
which orders store membarrier_exped = 1 before store A = 1.
[2] pairs with [4]: [2] orders load B after load membarrier_exped, matching [4]
which orders store membarrier_exped = 0 after store B = 1.

Using the rq lock was one way to get those barriers, but if we can
use explicit memory barriers instead, it would allow us to do this
without interacting with the scheduler internals.

>
> (e) .. and you expose this ABSOLUTELY SHIT as a random system call.
>
> Oh, and the clone semantics make no sense either.

Currently, this patch clears the state on exec and when forking a new thread,
but keeps the thread state when forking a new process, which AFAIU is
in line with current practices. But perhaps not, what I am missing ?

>
> In fact, it just makes me doubt everything about the whole membarrier
> concept, because it appears *so* terminally broken.
>
> So unless I'm seriously missing something, this is just about the
> worst piece of code I have seen this year.

The good news is we are still in January. ;-)

>
> No.
>
> NO NO NO.
>
> It really smells so broken that I'm wondering if I'm missing anything.
> But I don't think I am. I think the code is just pure garbage.

If there are other ways to synchronize the loads of thread's membarrier_expedited
fields by remote CPUs when iterating on nohz mask wrt thread exiting, then a
refcount-based approach could be used along with explicit memory barriers, and
we could do all this without grabbing the rq lock.

Thanks,

Mathieu


>
> Linus

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com