Re: [PATCH 2/3] task: RCU protect tasks on the runqueue

From: Peter Zijlstra
Date: Wed Sep 25 2019 - 03:37:23 EST


On Mon, Sep 09, 2019 at 07:22:15AM -0500, Eric W. Biederman wrote:

> Let me see if I can explain my confusion in terms of task_numa_compare.
>
> The function task_numa_comare now does:
>
> rcu_read_lock();
> cur = rcu_dereference(dst_rq->curr);
>
> Then it proceeds to examine a few fields of cur.
>
> cur->flags;
> cur->cpus_ptr;
> cur->numa_group;
> cur->numa_faults[];
> cur->total_numa_faults;
> cur->se.cfs_rq;
>
> My concern here is the ordering between setting rq->curr and and setting
> those other fields. If we read values of those fields that were not
> current when we set cur than I think task_numa_compare would give an
> incorrect answer.

That code is explicitly without serialization. One memory barrier isn't
going to make any difference what so ever.

Specifically, those numbers will change _after_ we make it current, not
before.

> From what I can see pick_next_task_fair does not mess with any of
> the fields that task_numa_compare uses. So the existing memory barrier
> should be more than sufficient for that case.

There should not be, but even if there is, load-balancing in general
(very much including the NUMA balancing) is completely unserialized and
prone to reading stale data at all times.

This is on purpose; it minimizes the interference of load-balancing, and
if we make a 'wrong' choice, the next pass can fix it up.

> So I think I just need to write up a patch 4/3 that replaces the my
> "rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
> next)". And includes a great big comment to that effect.
>
>
>
> Now maybe I am wrong. Maybe we can afford to see data that was changed
> before the task became rq->curr in task_numa_compare. But I don't think
> so. I really think it is that full memory barrier just a bit earlier
> in __schedule that is keeping us safe.
>
> Am I wrong?

Not in so far that we now both seem to agree on using
RCU_INIT_POINTER(); but we're still disagreeing on why.

My argument is that since we can already obtain any task_struct pointer
through find_task_by_vpid() and friends, any access to its individual
members must already have serialzation rules (or explicitly none, as for
load-balancing).

I'm viewing this as just another variant of find_task_by_vpid(), except
we index by CPU instead of PID. There can be changes to task_struct
between to invocations of find_task_by_vpid(), any user will have to
somehow deal with that. This is no different.

Take for instance p->cpus_ptr, it has very specific serialzation rules
(which that NUMA balancer thing blatantly ignores), as do a lot of other
task_struct fields.

The memory barrier in question isn't going to help with any of that.

Specifically, if we care about the consistency of the numbers changed by
pick_next_task(), we must acquire rq->lock (of course, once we do that,
we can immediately use rq->curr without further RCU use).