Re: [RFC PATCH v2 00/17] Core scheduling v2

From: Subhra Mazumdar
Date: Fri Apr 26 2019 - 14:42:04 EST



On 4/26/19 3:43 AM, Mel Gorman wrote:
On Fri, Apr 26, 2019 at 10:42:22AM +0200, Ingo Molnar wrote:
It should, but it's not perfect. For example, wake_affine_idle does not
take sibling activity into account even though select_idle_sibling *may*
take it into account. Even select_idle_sibling in its fast path may use
an SMT sibling instead of searching.

There are also potential side-effects with cpuidle. Some workloads
migration around the socket as they are communicating because of how the
search for an idle CPU works. With SMT on, there is potentially a longer
opportunity for a core to reach a deep c-state and incur a bigger wakeup
latency. This is a very weak theory but I've seen cases where latency
sensitive workloads with only two communicating tasks are affected by
CPUs reaching low c-states due to migrations.

Clearly it doesn't.

It's more that it's best effort to wakeup quickly instead of being perfect
by using an expensive search every time.
Yeah, but your numbers suggest that for *most* not heavily interacting
under-utilized CPU bound workloads we hurt in the 5-10% range compared to
no-SMT - more in some cases.

Indeed, it was higher than expected and we can't even use the excuse that
more resources are available to a single logical CPU as the scheduler is
meant to keep them apart.

So we avoid a maybe 0.1% scheduler placement overhead but inflict 5-10%
harm on the workload, and also blow up stddev by randomly co-scheduling
two tasks on the same physical core? Not a good trade-off.

I really think we should implement a relatively strict physical core
placement policy in the under-utilized case, and resist any attempts to
weaken this for special workloads that ping-pong quickly and benefit from
sharing the same physical core.

It's worth a shot at least. Changes should mostly be in the wake_affine
path for most loads of interest.
Doesn't select_idle_sibling already try to do that by calling
select_idle_core? For our OLTP workload we infact found the cost of
select_idle_core was actually hurting more than it helped to find a fully
idle core, so a net negative.