Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu

From: Chen Yu
Date: Sat Jan 14 2023 - 11:00:14 EST


On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote:
> On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@xxxxxxxxx> wrote:
> >
> > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <posk@xxxxxxxxxx>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <posk@xxxxxxxxxx>
> > > Signed-off-by: Andrei Vagin <avagin@xxxxxxxxx>
> > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > if (wake_flags & WF_TTWU) {
> > > record_wakee(p);
> > >
> > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > + return cpu;
> > I agree that cross-CPU wake up brings pain to fast context switching
> > use cases, especially on high core count system. We suffered from this
> > issue as well, so previously we presented this issue as well. The difference
> > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> > That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> > the wakee on current CPU. So not only seccomp but also other components/workloads
> > could benefit from this without having to set the WF_CURRENT_CPU flag.
> >
> > Link [1]:
> > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@xxxxxxxxx/
>
> Thanks for the link. I like the idea, but this change has no impact on the
> seccom notify case. I used the benchmark from the fifth patch. It is
> a ping-pong
> benchmark in which one process triggers system calls, and another process
> handles them. It measures the number of system calls that can be processed
> within a specified time slice.
>
Thanks for this information.
> $ cd tools/testing/selftests/seccomp/
> $ make
> $ ./seccomp_bpf 2>&1| grep user_notification_sync
> # RUN global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
> # OK global.user_notification_sync
> ok 51 global.user_notification_sync
>
> The results are the same with and without your change. I expected that
> your change improves
> the basic case so that it reaches the sync one.
>
The reason why the patch did not bring benefit might be that, that patch
aims to wake task on current CPU only there is no idle cores. The seccomp
notify would launch 2 processes which makes it hard to saturate the system
if I understand correctly? I built a kernel based on your repo, and launched
above test, it seems that the average load is quit low on my system. Is this
expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)?
> I did some experiments and found that we can achieve the desirable
> outcome if we move the "short-task" checks prior to considering waking
> up on prev_cpu.
>
> For example, with this patch:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f89e44e237d..af20b58e3972 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
> static int
> wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> {
> + /* The only running task is a short duration one. */
> + if (cpu_rq(this_cpu)->nr_running == 1 &&
> + is_short_task(cpu_curr(this_cpu)))
> + return this_cpu;
> +
In v1 we put above code before checking the prev_cpu, but is was reported
to bring regression on some systems[2]. The result suggested that, we should
try to pick idle CPU first, no matter it is current CPU, or previous CPU,
if it failed, we can lower the bar and pick the current CPU.

[2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@xxxxxxx/
> /*
> * If this_cpu is idle, it implies the wakeup is from interrupt
> * context. Only allow the move if cache is shared. Otherwise an
> @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> if (available_idle_cpu(prev_cpu))
> return prev_cpu;
>
> - /* The only running task is a short duration one. */
> - if (cpu_rq(this_cpu)->nr_running == 1 &&
> - is_short_task(cpu_curr(this_cpu)))
> - return this_cpu;
> -
> return nr_cpumask_bits;
> }
>
> @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
> asym_fits_cpu(task_util, util_min, util_max, target))
> return target;
>
> + if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
> + is_short_task(cpu_curr(target)) && is_short_task(p))
> + return target;
> +
> /*
> * If the previous CPU is cache affine and idle, don't be stupid:
> */
>
>
> the basic test case shows almost the same results as the sync one:
>
> $ ./seccomp_bpf 2>&1| grep user_notification_sync
> # RUN global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
> # OK global.user_notification_sync
> ok 51 global.user_notification_sync
>
> If you want to do any experiments, you can find my tree here:
> [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task
>
I'm happy to further test on this. One thing I'm curious about is, where does the
benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache?

thanks,
Chenyu
> Thanks,
> Andrei