Re: [PATCH v4 04/11] sched/fair: rework load_balance

From: Mel Gorman
Date: Fri Nov 08 2019 - 13:37:38 EST


On Fri, Nov 08, 2019 at 05:35:01PM +0100, Vincent Guittot wrote:
> > Fair enough, netperf hits the corner case where it does not work but
> > that is also true without your series.
>
> I run mmtest/netperf test on my setup. It's a mix of small positive or
> negative differences (see below)
>
> <SNIP>
>
> netperf-tcp
> 5.3-rc2 5.3-rc2
> tip +rwk+fix
> Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%*
> Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%*
> Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%*
> Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%*
> Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%*
> Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%*
> Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%*
> Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%*
> Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%*
> Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%)
> Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%)
> Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%)
> Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%)
> Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%)
> Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%)
> Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%)
> Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%)
> Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%)
>
> 5.3-rc2 5.3-rc2
> tip +rwk+fix
> Duration User 138.05 140.95
> Duration System 1210.60 1208.45
> Duration Elapsed 1352.86 1352.90
>

This roughly matches what I've seen. The interesting part to me for
netperf is the next section of the report that reports the locality of
numa hints. With netperf on a 2-socket machine, it's generally around
50% as the client/server are pulled apart. Because netperf is not
heavily memory bound, it doesn't have much impact on the overall
performance but it's good at catching the cross-node migrations.

> >
> > > I agree that additional patches are probably needed to improve load
> > > balance at NUMA level and I expect that this rework will make it
> > > simpler to add.
> > > I just wanted to get the output of some real use cases before defining
> > > more numa level specific conditions. Some want to spread on there numa
> > > nodes but other want to keep everything together. The preferred node
> > > and fbq_classify_group was the only sensible metrics to me when he
> > > wrote this patchset but changes can be added if they make sense.
> > >
> >
> > That's fair. While it was possible to address the case before your
> > series, it was a hatchet job. If the changelog simply notes that some
> > special casing may still be required for SD_NUMA but it's outside the
> > scope of the series, then I'd be happy. At least there is a good chance
> > then if there is follow-up work that it won't be interpreted as an
> > attempt to reintroduce hacky heuristics.
> >
>
> Would the additional comment make sense for you about work to be done
> for SD_NUMA ?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0ad4b21..7e4cb65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all };
> * group. see update_sd_pick_busiest().
> */
> enum group_type {
> + /*
> + * The group has spare capacity that can be used to process more work.
> + */
> group_has_spare = 0,
> + /*
> + * The group is fully used and the tasks don't compete for more CPU
> + * cycles. Nevetheless, some tasks might wait before running.
> + */
> group_fully_busy,
> + /*
> + * One task doesn't fit with CPU's capacity and must be migrated on a
> + * more powerful CPU.
> + */
> group_misfit_task,
> + /*
> + * One local CPU with higher capacity is available and task should be
> + * migrated on it instead on current CPU.
> + */
> group_asym_packing,
> + /*
> + * The tasks affinity prevents the scheduler to balance the load across
> + * the system.
> + */
> group_imbalanced,
> + /*
> + * The CPU is overloaded and can't provide expected CPU cycles to all
> + * tasks.
> + */
> group_overloaded
> };

Looks good.

>
> @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>
> /*
> * Try to use spare capacity of local group without overloading it or
> - * emptying busiest
> + * emptying busiest.
> + * XXX Spreading tasks across numa nodes is not always the best policy
> + * and special cares should be taken for SD_NUMA domain level before
> + * spreading the tasks. For now, load_balance() fully relies on
> + * NUMA_BALANCING and fbq_classify_group/rq to overide the decision.
> */
> if (local->group_type == group_has_spare) {
> if (busiest->group_type > group_fully_busy) {

Perfect. Any patch in that are can then update the comment and it
should be semi-obvious to the next reviewer that it's expected.

Thanks Vincent.

--
Mel Gorman
SUSE Labs