Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

From: Kirill Tkhai
Date: Mon Sep 28 2015 - 11:49:49 EST




On 28.09.2015 18:36, Kirill Tkhai wrote:
> On 28.09.2015 16:12, Mike Galbraith wrote:
>> On Mon, 2015-09-28 at 13:28 +0300, Kirill Tkhai wrote:
>>
>>> Looks like, NAK may be better, because it saves L1 cache, while the patch always invalidates it.
>>
>> Yeah, bounce hurts more when there's no concurrency win waiting to be
>> collected. This mixed load wasn't a great choice, but it turned out to
>> be pretty interesting. Something waking a gaggle of waiters on a busy
>> big socket should do very bad things.
>>
>>> Could you say, do you execute pgbench using just -cX -jY -T30 or something special? I've tried it,
>>> but the dispersion of the results much differs from time to time.
>>
>> pgbench -T $testtime -j 1 -S -c $clients
>
> Using -S the results stabilized. It looks like my db is enormous, and some problem with that. I will
> investigate.
>
> Thanks!
>
>>>> Ok, that's what I want to see, full repeat.
>>>> master = twiddle
>>>> master+ = twiddle+patch
>>>>
>>>> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>>>> master master+
>>>> pgbench 1 2 3 avg 1 2 3 avg comp
>>>> clients 1 tps = 18599 18627 18532 18586 17480 17682 17606 17589 .946
>>>> clients 2 tps = 32344 32313 32408 32355 25167 26140 23730 25012 .773
>>>> clients 4 tps = 52593 51390 51095 51692 22983 23046 22427 22818 .441
>>>> clients 8 tps = 70354 69583 70107 70014 66924 66672 69310 67635 .966
>>>>
>>>> Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.
>>>>
>>>> master master+
>>>> tbench 1 2 3 avg 1 2 3 avg comp
>>>> pairs 1 MB/s = 430 426 436 430 481 481 494 485 1.127
>>>> pairs 2 MB/s = 1083 1085 1072 1080 1086 1090 1083 1086 1.005
>>>> pairs 4 MB/s = 1725 1697 1729 1717 2023 2002 2006 2010 1.170
>>>> pairs 8 MB/s = 2740 2631 2700 2690 3016 2977 3071 3021 1.123
>>>>
>>>> tbench without competition
>>>> master master+ comp
>>>> pairs 1 MB/s = 694 692 .997
>>>> pairs 2 MB/s = 1268 1259 .992
>>>> pairs 4 MB/s = 2210 2165 .979
>>>> pairs 8 MB/s = 3586 3526 .983 (yawn, all within routine variance)
>>>
>>> Hm, it seems tbench with competition is better only because of a busy system makes tbench
>>> processes be woken on the same cpu.
>>
>> Yeah. When box is really full, select_idle_sibling() (obviously) turns
>> into a waste of cycles, but even as you approach that, especially when
>> filling the box with identical copies of nearly fully synchronous high
>> frequency localhost packet blasters, stacking is a win.
>>
>> What bent my head up a bit was the combined effect of making wake_wide()
>> really keep pgbench from collapsing then adding the affine wakeup grant
>> for tbench. It's not at all clear to me why 2,4 would be so demolished.
>
> Mike, one more moment. wake_wide() and current logic confuses me a bit.
> It makes us to decide if we want affine wakeup or not, but select_idle_sibling()
> if a function is not for choosing this_cpu's llc domain only. We use it
> for searching in prev_cpu llc domain too, and it seems we are not interested
> in current flips in this case. Imagine a situation, when we share a mutex
> with a task on another NUMA node. When the task is realising the mutex
> it is waking us, but we definitelly won't use affine logic in this case.
> We wake the wakee anywhere and loose hot cache. I changed the logic, and
> tried pgbench 1:8. The results (I threw away 3 first iterations, because
> they much differ with iter >= 4. Looks like, the reason is in uncached disk IO).
>
>
> Before:
>
> trans. | tps (i) | tps (e)
> --------------------------------------
> 12098226 | 60491.067392 | 60500.886373
> 12030184 | 60150.874285 | 60160.654295
> 11882977 | 59414.829150 | 59424.830637
> 12020125 | 60100.579023 | 60111.600176
> 12161917 | 60809.547906 | 60827.321639
> 12154660 | 60773.249254 | 60783.085165
>
> After:
>
> trans. | tps (i) | tps (e)
> --------------------------------------
> 12770407 | 63849.883578 | 63860.310019
> 12635366 | 63176.399769 | 63187.152569
> 12676890 | 63384.396440 | 63400.930755
> 12639949 | 63199.526330 | 63210.460753
> 12670626 | 63353.079951 | 63363.274143
> 12647001 | 63209.613698 | 63219.812331

All above is pgbench -j 1 -S -c 8 -T 200.

> I'm going to test other cases, but could you tell me (if you remember) are there reasons
> we skip prev_cpu, like I described above? Some types of workloads etc.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..dfbe06b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> - if (sd_flag & SD_BALANCE_WAKE)
> - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> + if (sd_flag & SD_BALANCE_WAKE) {
> + want_affine = 1;
> + if (cpu == prev_cpu || !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> + goto want_affine;
> + if (wake_wide(p))
> + goto want_affine;
> + }
>
> rcu_read_lock();
> for_each_domain(cpu, tmp) {
> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> break;
> }
>
> - if (affine_sd) {
> +want_affine:
> + if (want_affine) {
> sd = NULL; /* Prefer wake_affine over balance flags */
> - if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + if (affine_sd && wake_affine(affine_sd, p, sync))
> new_cpu = cpu;
> - }
> -
> - if (!sd) {
> - if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
> - new_cpu = select_idle_sibling(p, new_cpu);
> -
> + new_cpu = select_idle_sibling(p, new_cpu);
> } else while (sd) {
> struct sched_group *group;
> int weight;
>

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/