Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu

From: Andrei Vagin
Date: Wed Jan 18 2023 - 01:24:52 EST


On Sat, Jan 14, 2023 at 8:00 AM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
>
> On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote:
> > On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> > >
> > > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <posk@xxxxxxxxxx>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx>
> > > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxx>
> > > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > if (wake_flags & WF_TTWU) {
> > > > record_wakee(p);
> > > >
> > > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > + return cpu;
> > > I agree that cross-CPU wake up brings pain to fast context switching
> > > use cases, especially on high core count system. We suffered from this
> > > issue as well, so previously we presented this issue as well. The difference
> > > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> > > That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> > > the wakee on current CPU. So not only seccomp but also other components/workloads
> > > could benefit from this without having to set the WF_CURRENT_CPU flag.
> > >
> > > Link [1]:
> > > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@xxxxxxxxx/
> >
> > Thanks for the link. I like the idea, but this change has no impact on the
> > seccom notify case. I used the benchmark from the fifth patch. It is
> > a ping-pong
> > benchmark in which one process triggers system calls, and another process
> > handles them. It measures the number of system calls that can be processed
> > within a specified time slice.
> >
> Thanks for this information.
> > $ cd tools/testing/selftests/seccomp/
> > $ make
> > $ ./seccomp_bpf 2>&1| grep user_notification_sync
> > # RUN global.user_notification_sync ...
> > # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
> > # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
> > # OK global.user_notification_sync
> > ok 51 global.user_notification_sync
> >
> > The results are the same with and without your change. I expected that
> > your change improves
> > the basic case so that it reaches the sync one.
> >
> The reason why the patch did not bring benefit might be that, that patch
> aims to wake task on current CPU only there is no idle cores. The seccomp
> notify would launch 2 processes which makes it hard to saturate the system
> if I understand correctly?

Yes, you understand it right. Our workloads do not always scale on all CPU-s,
and it is critical to work with maximum performance even when there are some
idle cores. I feel I need to say a few words about our use-case. I am working on
gVisor, it is a sandbox solution. In a few words, we intercept guest system
calls and handle them in our user-mode kernel. All these mean that we are
limited by a user application that is running inside a sandbox and how well it
scales on multiple cpu-s. The current benchmark emulates a uni-thread
process running in a sandbox.

> I built a kernel based on your repo, and launched
> above test, it seems that the average load is quit low on my system. Is this
> expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)?

The accuracy within the 20% range is good enough for this test. But if
we want to
have a real benchmark, we need to implement it in a separate binary or we can
add it to the perf benchmark suite.

> > I did some experiments and found that we can achieve the desirable
> > outcome if we move the "short-task" checks prior to considering waking
> > up on prev_cpu.
> >
> > For example, with this patch:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2f89e44e237d..af20b58e3972 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
> > static int
> > wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > {
> > + /* The only running task is a short duration one. */
> > + if (cpu_rq(this_cpu)->nr_running == 1 &&
> > + is_short_task(cpu_curr(this_cpu)))
> > + return this_cpu;
> > +
> In v1 we put above code before checking the prev_cpu, but is was reported
> to bring regression on some systems[2]. The result suggested that, we should
> try to pick idle CPU first, no matter it is current CPU, or previous CPU,
> if it failed, we can lower the bar and pick the current CPU.

Maybe the criteria of a short task should be lower for idle cpu-s. It should be
close to the cost of waking up an idle cpu.

>
> [2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@xxxxxxx/
> > /*
> > * If this_cpu is idle, it implies the wakeup is from interrupt
> > * context. Only allow the move if cache is shared. Otherwise an
> > @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > if (available_idle_cpu(prev_cpu))
> > return prev_cpu;
> >
> > - /* The only running task is a short duration one. */
> > - if (cpu_rq(this_cpu)->nr_running == 1 &&
> > - is_short_task(cpu_curr(this_cpu)))
> > - return this_cpu;
> > -
> > return nr_cpumask_bits;
> > }
> >
> > @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
> > task_struct *p, int prev, int target)
> > asym_fits_cpu(task_util, util_min, util_max, target))
> > return target;
> >
> > + if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
> > + is_short_task(cpu_curr(target)) && is_short_task(p))
> > + return target;
> > +
> > /*
> > * If the previous CPU is cache affine and idle, don't be stupid:
> > */
> >
> >
> > the basic test case shows almost the same results as the sync one:
> >
> > $ ./seccomp_bpf 2>&1| grep user_notification_sync
> > # RUN global.user_notification_sync ...
> > # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
> > # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
> > # OK global.user_notification_sync
> > ok 51 global.user_notification_sync
> >
> > If you want to do any experiments, you can find my tree here:
> > [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task
> >
> I'm happy to further test on this. One thing I'm curious about is, where does the
> benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache?

I don't think that it is due to cpu caches. It should be due to the
overhead of queuing a task to
a non-current queue, sending ipt, waking from the idle loop.

Thanks,
Andrei