Re: [RFC PATCH v2] sched_pair_cpu: Introduce scheduler task pairing system call

From: Mathieu Desnoyers
Date: Fri Jun 26 2020 - 13:44:36 EST


----- On Jun 26, 2020, at 12:00 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Thu, Jun 25, 2020 at 10:56:35AM -0400, Mathieu Desnoyers wrote:
>> ----- On Jun 24, 2020, at 3:50 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>
> I'll try and read the earlier bit later, I can't think today.
>
>> > That's exactly what that signal would do. It would send SIGIO when the
>> > state changes.
>> >
>> > So you want to access CPU-n's data, you open that file, register a
>> > signal and read it's state, if offline, you good, do the rseq. If it
>> > suddenly decides to come back online, you're guaranteed that SIGIO
>> > before it reaches userspace.
>> >
>> > The nice thing is that it's all R/O so available to normal users, you
>> > don't have to write to the file.
>>
>> So let's say you have two threads trying to access (offline) CPU-n's data
>> with that algorithm concurrently. How are they serialized with each other ?
>
> Also implement F_SETLK or something :-)

I don't see this being available to non-root users.

>
>> >> We do not want to override the affinity restricted by cgroups because
>> >> we don't want to hurt performance characteristics of another partition
>> >> of the system.
>> >>
>> >> The sched_pair_cpu approach has the benefit of allowing us to touch
>> >> per-cpu data of a given CPU without requiring to run on that CPU, which
>> >> ensures that we do not thrash the cpu cache of cpus on which a thread
>> >> is not allowed to run. It takes care of issues caused by both cgroup
>> >> cpusets and cpu hotplug.
>> >
>> > But now I worry that your thing allows escaping the cgroup contraints,
>> > you can perturb random CPUs you're not allowed on. That's a really bad
>> > 'feature'.
>> >
>> > Offline cpus are okay, because you don't actually need to do anything as
>> > long as they're offline, but restricted CPUs we really should not be
>> > touching, not even a little.
>>
>> With sched_pair_cpu, the paired task never needs to run on the target CPU.
>> The kworker thread runs on the target CPU in the same way other existing
>> worker threads run today, e.g. the ones handling RCU callbacks. AFAIK the
>> priority of those threads can be configured by a system administrator.
>
> Ah, but the critical difference is that all those are only ever ran if
> the initial work was initialized on _that_ CPU to begin with. Consider
> an isolated CPU that's spinning in userspace, it would _never_ get any
> kthreads running.
>
> Except now you can, and you even want this system call to be unpriv.
>
> It utterly and completely wrecks NOHZ_FULL.
>
>> Are there additional steps we should take to minimize the impact of this
>> worker thread ? In the same way "no rcu callbacks" CPU can be configured
>> at boot time, we could have "no sched pair cpu" configured at boot, which
>> would prevent sched_pair_cpu system calls from targeting that CPU entirely,
>> and not spawn any kworker on that cpu.
>
> No, no, no! "at boot time" is an utter trainwreck. I've been trying to
> get NOHZ_FULL runtime configurable. This means that your cpuset can
> change at runtime and the CPU you tought you had now is a NOHZ_FULL CPU.
>
> We must not allow pears on it.

One possibility is to simply treat NOHZ_FULL cpus as offline from the
perspective of sched_pair_cpu: no kworker thread on those, and the
queuing is handled elsewhere. As long as rseq is not be used on NOHZ_FULL
cpus concurrently with sched_pair_cpu targeting those CPUs, it would work.

> I'm thinking that the best option might be to treat CPUs outside of your
> cpuset the same as offline CPUs. That more-or-less requires that tasks
> outside of your cpuset partition don't have access to your shared
> memory, but that isn't an entirely insane assumption.
>
> If you want to share memory across cpuset partitions, you get to keep
> the pieces.

The main issue is that cpuset partitions are changed dynamically at runtime
by external "manager" processes (e.g. Android). To make things even more
interesting, cpusets support both "process" and "threaded" domain types.

There are quite a few scenarios which worry me with this approach, e.g.:

A) rseq and sched_pair_cpu are used by a memory allocator within a process,
and specific threads of that process eventually have their cpuset changed
to exclude some cpus (with cpusets applied per-thread rather than per-process).
Now the memory allocator needs to touch the per-cpu data of specific cpus to
which it does not have access from a given thread while other threads within
the process still use it concurrently.

B) A tracer ring buffer works on per-cpu data in shared memory across processes
with rseq and sched_pair_cpu, and races on per-cpu data because an external
manager process concurrently applies different cpusets (with process domain)
to processes interacting over that shared memory.

> And the nice thing about offline, is that you don't actually need to run
> anything. You only need some exclusion thing (and using a spin-loop on a
> random other CPU for that is bloody insane).

Indeed, for the offline case, the kworker really does not need to keep burning
CPU time. I should eventually have it sleep for a while instead, or have the work
queues for all offline cpus handled by a single kworker for the entire system.

So treating NOHZ_FULL cpus as offline, I'm all good with that. Treating cpus which
are not in the cpuset as offline, not so much.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com