Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

From: Kirill Tkhai
Date: Wed Sep 30 2015 - 15:16:44 EST




On 29.09.2015 20:29, Mike Galbraith wrote:
> On Tue, 2015-09-29 at 19:00 +0300, Kirill Tkhai wrote:
>>
>> On 29.09.2015 17:55, Mike Galbraith wrote:
>>> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
>>>
>>>> ---
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4df37a4..dfbe06b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>>> int want_affine = 0;
>>>> int sync = wake_flags & WF_SYNC;
>>>>
>>>> - if (sd_flag & SD_BALANCE_WAKE)
>>>> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>>> + if (sd_flag & SD_BALANCE_WAKE) {
>>>> + want_affine = 1;
>>>> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>>>> + goto want_affine;
>>>> + if (wake_wide(p))
>>>> + goto want_affine;
>>>> + }
>>>
>>> That blew wake_wide() right out of the water.
>>>
>>> It's not only about things like pgbench. Drive multiple tasks in a Xen
>>> guest (single event channel dom0 -> domu, and no select_idle_sibling()
>>> to save the day) via network, and watch workers fail to be all they can
>>> be because they keep being stacked up on the irq source. Load balancing
>>> yanks them apart, next irq stacks them right back up. I met that in
>>> enterprise land, thought wake_wide() should cure it, and indeed it did.
>>
>> 1)Hm.. The patch makes select_task_rq_fair() to prefer old cpu instead of
>> current, doesn't it? We more often don't set affine_sd. So, the skipped
>> part of patch (skipped in quote) selects prev_cpu.
>
> Not the way I read it..
>
>>> - if (affine_sd) {
>>> +want_affine:
>>> + if (want_affine) {
>>> sd = NULL; /* Prefer wake_affine over balance flags */
>>> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>>> + if (affine_sd && wake_affine(affine_sd, p, sync))
>>> new_cpu = cpu;
>>> - }
>>> -
>>> - if (!sd) {
>>> - if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>>> - new_cpu = select_idle_sibling(p, new_cpu);
>>> -
>>> + new_cpu = select_idle_sibling(p, new_cpu);
>
> ..it sets new_cpu = cpu if wake_affine() says Ok, wake_wide() has no say
> in the matter.
>
>> 2)I thought about waking by irq handler and even was going to ask why
>> we use affine logic for such wakeups. Device handlers usually aren't
>> bound, timers may migrate since NO_HZ logic presents. The only explanation
>> I found is unbound timers is very unlikely case (I added statistics printk
>> to my local sched_debug to check that). But if we have the situations like
>> you described above, don't we have to disable affine logic for in_interrupt()
>> cases?
>
> BTDT. In my experience, the more you try to differentiate sources, the
> more corner cases you create. I've tried doing special things for irq,
> locks, wake_all, wake_one, and it always turned into a can of worms.
> IMHO, the best policy for the fast patch is KISS.
>
>> 3)I ask about just because (being outside of scheduler history) it's a little
>> bit strange, we prefer smp_processor_id()'s sd_llc so much. Sync wakeup's
>> profit is less or more clear: smp_processor_id()'s sd_llc may contain some
>> data, which is interesting for a wakee, and this minimizes cache misses.
>> But we do the same in other cases too, and at every migration we loose
>> itlb, dtlb... Of course, it requires more accurate patches, then posted
>> (not so rude patches).
>
> IMHO, the sync wakeup hint is more often a big fat lie than anything
> else, it really just gives us a bit more headroom for affine wakeups in
> cases where that's likely to be a very good thing (affine in the cache
> sense, not affine as in an individual CPU). What it means is that waker
> is likely to schedule RSN, but if you measure even very fast/light
> things, there is an overlap win to be had by NOT waking CPU affine,
> rather waking cache affine, that's why we cross core schedule so often.
> A real network app doing a wakeup does is not necessarily gonna schedule
> RSN, there is very often a latency win to be had by scheduling to a
> nearby core, ie a thread pool worker doing a "sync" wakeup may very
> instantly find that it has more work to do. If a fast/light wakee can
> slip into an idle crack and get to CPU instantly, it can generate more
> work a little bit sooner.

Yeah, in most places, where sync wakeup is used, task is not going to reschedule
soon..

Thanks for the explanation, Mike!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/