Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task is running during wake up

From: Chen Yu
Date: Sun Jan 22 2023 - 10:38:18 EST


On 2023-01-20 at 17:09:28 +0800, Yicong Yang wrote:
> On 2023/1/20 0:11, Chen Yu wrote:
> > Hi Yicong,
> > On 2023-01-17 at 10:53:25 +0800, Yicong Yang wrote:
> >> Hi Chen Yu,
> >>
> >> On 2022/12/16 14:08, Chen Yu wrote:
> >>> The main purpose of this change is to avoid too many cross CPU
> >>> wake up when it is unnecessary. The frequent cross CPU wake up
> >>> brings significant damage to some workloads, especially on high
> >>> core count systems.
> >>>
> >>> This patch set inhibits the cross CPU wake-up by placing the wakee
> >>> on waking CPU or previous CPU, if both the waker and wakee are
> >>> short-duration tasks.
> >>>
> >>> The first patch is to introduce the definition of a short-duration
> >>> task. The second patch leverages the first patch to choose a local
> >>> or previous CPU for wakee.
> >>>
> >>
> >> Tested with tbench and netperf locally and MySQL remotely. Unluckily
> >> cannot reproduce the improvement and see some regression.
> > Thank you for the test.
> >> Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs
> >> Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER
> >> enabled.
> >>
> >> For tbench and netperf tested on single node and two nodes on one socket.
> >>
> > Does this platform have 2 sockets, each socket has 2 Numa nodes, each node
> > is attached to 32 CPUs, and the benchmarks are bound to the first sockets?
> > And may I know how many CPUs there are in one LLC? Is it also 32?
>
> you're right, exactly. There're 32 CPUs sharing one LLC.
>
> >> Further looked into 64 threads since it regress most and mainly looked on the
> >> server(mysqld) side. In this case most loads will be on node 0 and partly
> >> on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same
> >> but utilization on node 1 remains nearly half comparing to baseline. That
> >> may partly answers why the performance decreases.
> >>
> > OK, I see. So it seems that the Numa sched domain load balance does not spread
> > tasks fast enough while SIS_SHORT is waking task on the same node frequently.
> > But I thought select_idle_sibling() would wake up the tasks
> > on the same node(LLC) anyway, why SIS_SHORT inhibits the load balance more?
>
> Wakeup will happen much more frequently than load balance in this case and I don't
> think load balance can handle it in time. In the fast path of wakeup we care little
> about the load. SIS_SHORT may aggravate it since we're likely to put 2 short tasks
> on one single CPU.
>
I see. Suppose cpu0 ~ cpu31 are on node0, cpu32~cpu64 are on node1. With SIS_SHORT
enabled, wakees could be stacked on cpu0. With SIS_SHORT disabled, wakees could be
woken up on cpu0~cpu31. I undertsnad that SIS_SHORT inhits searching for idle CPUs
in node0, but why SIS_SHORT inhibits tasks migrating to node1?
> >> mpstat 60s, non-idle%
> >> vanilla sis_short
> >> node 0 94.25 91.67 -2.73%
> >> node 1 46.41 30.03 -35.29%
> >>
> >> Previously there's argument that sync wakeup from interrupts maybe incorrectly
> > I see. Do you mean, wakeup from interrupt might cause tasks stacking on the same
> > CPU, because the interrupt is delivered to the same CPU? I thought userspace
> > irq load balance would spread the interrupt? Or do you mean others?
>
> Nop. I'm using HiSilicon HNS3 network card which has 32 queues and 33 interrupts
> spread evenly on 32 CPUs, not on one single CPU. I considering a case that the p1
> on cpu0 and p2 on cpu32, interrupt happens on cpu0 and tries to wakeup p1, since
> p1 and p2 are both short tasks and then p2 will likely to wake and wait on cpu0.
> p1 is not the actual waker and has not finished yet and this will increase the wait
> time and also load on cpu0. Previously we may wake on the LLC of cpu0 and try
> to find an idle CPU, with SIS_SHORT we'll just wake on cpu0.
>
> Does it make sense to allow the SIS_SHORT wakeup only from the task context? I'll
> have a try on this.
>
This sounds reasonable to me. I think we can further enhance it.
And before you sending out the proposal, could you help test the following
experimental change, to see if it makes tbench netperf and MySQL better?

Hi Prateek, could you also help check if below change would make the Spec regression
go away?

thanks,
Chenyu

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc8a636eca5e..6e1ab5e2686c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4974,7 +4974,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
static inline int is_short_task(struct task_struct *p)
{
return sched_feat(SIS_SHORT) && p->se.dur_avg &&
- ((p->se.dur_avg * 8) <= sysctl_sched_min_granularity);
+ ((p->se.dur_avg * 1024) <= (__this_cpu_read(sd_llc_size) * sysctl_sched_min_granularity));
}

/*
--
2.25.1