Re: [PATCH] sched/fair: no sync wakeup from interrupt context

From: Libo Chen
Date: Thu Jul 14 2022 - 16:22:10 EST




On 7/14/22 07:18, Mel Gorman wrote:
On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:

On 7/13/22 09:40, Tim Chen wrote:
On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
Barry Song first pointed out that replacing sync wakeup with regular wakeup
seems to reduce overeager wakeup pulling and shows noticeable performance
improvement.[1]

This patch argues that allowing sync for wakeups from interrupt context
is a bug and fixing it can improve performance even when irq/softirq is
evenly spread out.

For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
waker can be any random task that happens to be running on that CPU when the
interrupt comes in. This is completely different from other wakups where the
task running on the waking CPU is the actual waker. For example, two tasks
communicate through a pipe or mutiple tasks access the same critical section,
etc. This difference is important because with sync we assume the waker will
get off the runqueue and go to sleep immedately after the wakeup. The
assumption is built into wake_affine() where it discounts the waker's presence
from the runqueue when sync is true. The random waker from interrupts bears no
relation to the wakee and don't usually go to sleep immediately afterwards
unless wakeup granularity is reached. Plus the scheduler no longer enforces the
preepmtion of waker for sync wakeup as it used to before
patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
wakeup preemption for wakeups from interrupt contexts doesn't seem to be
appropriate too but at least sync wakeup will do what it's supposed to do.
Will there be scenarios where you do want the task being woken up be pulled
to the CPU where the interrupt happened, as the data that needs to be accessed is
on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
packets received. Sync still seems desirable, at least if there is no task currently
running on the CPU where interrupt happened. So maybe we should have some consideration
of the load on the CPU/NUMA before deciding whether we should do sync wake for such
interrupt.

There are only two places where sync wakeup matters: wake_affine_idle() and
wake_affine_weight().
In wake_affine_idle(), it considers pulling if there is one runnable on the
waking CPU because
of the belief that this runnable will voluntarily get off the runqueue. In
wake_affine_weight(),
it basically takes off the waker's load again assuming the waker goes to
sleep after the wakeup.
My argument is that this assumption doesn't really hold for wakeups from the
interrupt contexts
when the waking CPU is non-idle. Wakeups from task context? sure, it seems
to be a reasonable
assumption. For your idle case, I totally agree but I don't think having
sync or not will actually
have any impacts here giving what the code does. Real impact comes fromMel's
patch 7332dec055f2457c3
which makes it less likely to pull tasks when the waking CPU is idle. I
believe we should consider
reverting 7332dec055f2 because a significant RDS latency regression has been
spotted recently on our
system due to this patch.

The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
It still allowed cache-local migrations on the assumption that the incoming
data was critical enough to justify losing any other cache-hot data. You
I am not too against cache-local migrations here because they are still under the
same LLC. We really focus on cross-node migrations that sometimes hurt us,
sometimes don't because this is where you need to determine who has the warmer L3
cache.
state explicitly that "the interrupt CPU isn't as performance critical as
cache from its previous CPU" so that assumption was incorrect, at least
in your case. I don't have a counter example where the interrupt data *is*
more important than any other cache-hot data so the check can go.

I think a revert would not achieve what you want as a plain revert would
still allow an interrupt to pull a task from an arbitrary location as sync
This is the tricky part, I didn't explain it well. For rds-stress, it's a
lot (~30%) better to allow pulling across nodes when the waking CPU is idle.
I think this may be an example of interrupt data being more important. Something
like below will help a lot for this particular benchmark (rds-stress):

if (available_idle_cpu(this_cpu))
return this_cpu;

But for db workloads, yes, in general, pulling does more damages than goods as
said before esp. under light load. But I think pulling to an idle CPU across nodes
is okay, because this usually happens at the beginning of a workload and once a
task is pulled it stays there or at least in the same LLC sched domain for a while.
What is not okay is when the waking CPU already has a task and the wakeup still pulls
the wakee task to that CPU across nodes irrespective of its previous CPU. And that's
what this patch tries to address.

Mel, I am thinking about a follow-up patch like below then we can continue the discussion
there since this is kinda a separate issue:

- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-

+       if (available_idle_cpu(this_cpu))
+               if (cpus_share_cache(this_cpu, prev_cpu))
+                       return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+       else
+               return this_cpu;

is not checked. A follow-up to your patch or an updated version should not
check available_idle_cpu at all in wake_affine_idle as it's only idle the
wake is from interrupt context and vcpu_is_preempted is not necessarily
justification for pulling a task due to an interrupt.

Something like this but needs testing with your target loads, particularly
the RDS (Relational Database Service?) latency regression;
Sorry, it's Reliable Datagram Sockets. We run rds-stress to measure rds
bandwidth and latency.

Libo
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7b275672c89..e55a3a67a442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
* soonest. For the purpose of speed we only consider the waking and previous
* CPU.
*
- * wake_affine_idle() - only considers 'now', it check if the waking CPU is
- * cache-affine and is (or will be) idle.
+ * wake_affine_idle() - only considers 'now', it checks if the waker task is a
+ * sync wakeup from a CPU that should be idle soon.
*
* wake_affine_weight() - considers the weight to reflect the average
* scheduling latency of the CPUs. This seems to work
@@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
- /*
- * If this_cpu is idle, it implies the wakeup is from interrupt
- * context. Only allow the move if cache is shared. Otherwise an
- * interrupt intensive workload could force all tasks onto one
- * node depending on the IO topology or IRQ affinity settings.
- *
- * If the prev_cpu is idle and cache affine then avoid a migration.
- * There is no guarantee that the cache hot data from an interrupt
- * is more important than cache hot data on the prev_cpu and from
- * a cpufreq perspective, it's better to have higher utilisation
- * on one CPU.
- */
- if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
- return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;