Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4

From: Srikar Dronamraju
Date: Mon Jan 20 2020 - 03:09:49 EST

Next message: Benjamin GAIGNARD: "Re: [PATCH v2] drm: fix parameters documentation style"
Previous message: Marc Zyngier: "Re: [PATCH] irqdomain: Fix a memory leak in irq_domain_push_irq()"
In reply to: Mel Gorman: "Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4"
Next in thread: Mel Gorman: "Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> [2020-01-17 21:58:53]:

> On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> [2020-01-14 10:13:20]:
> >
> > We certainly are seeing better results than v1.
> > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > the others are improving.
> >
> > While numa04 improves by 14%, numa02 regress by around 12%.
> >

> Ok, so it's both a win and a loss. This is a curiousity that this patch
> may be the primary factor given that the logic only triggers when the
> local group has spare capacity and the busiest group is nearly idle. The
> test cases you describe should have fairly busy local groups.
>

Right, your code only seems to affect when the local group has spare
capacity and the busiest->sum_nr_running <=2

> >
> > numa01 is a set of 2 process each running 128 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
>
> Are the shared operations shared between the 2 processes? 256 threads
> in total would more than exceed the capacity of a local group, even 128
> threads per process would exceed the capacity of the local group. In such
> a situation, much would depend on the locality of the accesses as well
> as any shared accesses.

Except for numa02 and numa07, (both handle local memory operations) all
shared operations are within the process. i.e per process sharing.

>
> > numa02 is a single process with 256 threads;
> > each thread doing 800 loops on 32MB thread local memory operations.
> >
>
> This one is more interesting. False sharing shouldn't be an issue so the
> threads should be independent.
>
> > numa03 is a single process with 256 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Similar.

This is similar to numa01. Except now all threads belong to just one
process.

>
> > numa04 is a set of 8 process (as many nodes) each running 32 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Less clear as you don't say what is sharing the memory operations.

all sharing is within the process. In Numa04/numa09, I try to spawn as many
process as the number of nodes, other than that its same as Numa02.

>
> > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> > Details below:
>
> How many iterations for each test?

I run 5 iterations. Want me to run with more iterations?

>
>
> > ./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
> > ./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
> > ./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
>
> Before range: 58 to 72
> After range: 78 to 82
>
> This one is more interesting in general. Can you add trace_printks to
> the check for SD_NUMA the patch introduces and dump the sum_nr_running
> for both local and busiest when the imbalance is ignored please? That
> might give some hint as to the improper conditions where imbalance is
> ignored.

Can be done. Will get back with the results. But do let me know if you want
to run with more iterations or rerun the tests.

>
> However, knowing the number of iterations would be helpful. Can you also
> tell me if this is consistent between boots or is it always roughly 12%
> regression regardless of the number of iterations?
>

I have only measured for 5 iterations and I haven't repeated to see if the
numbers are consistent.

> > ./numa03.sh Real: 477.20 528.12 502.66 25.46 -4.85219%
> > ./numa03.sh Sys: 88.93 115.36 102.15 13.21 -25.629%
> > ./numa03.sh User: 119120.73 129829.89 124475.31 5354.58 -3.8219%
>
> Range before: 471 to 485
> Range after: 477 to 528
>
> > ./numa04.sh Real: 374.70 414.76 394.73 20.03 14.6708%
> > ./numa04.sh Sys: 357.14 379.20 368.17 11.03 3.27294%
> > ./numa04.sh User: 87830.73 88547.21 88188.97 358.24 5.7113%
>
> Range before: 450 -> 454
> Range after: 374 -> 414
>
> Big gain there but the fact the range changed so much is a concern and
> makes me wonder if this case is stable from boot to boot.
>
> > ./numa05.sh Real: 369.50 401.56 385.53 16.03 -5.64937%
> > ./numa05.sh Sys: 718.99 741.02 730.00 11.01 -3.76438%
> > ./numa05.sh User: 84989.07 85271.75 85130.41 141.34 -1.48142%
> >
>
> Big range changes again but the shared memory operations complicate
> matters. I think it's best to focus on numa02 for and identify if there
> is an improper condition where the patch has an impact, the local group
> has high utilisation but spare capacity while the busiest group is
> almost completely idle.
>
> > vmstat for numa01
>
> I'm not going to comment in detail on these other than noting that NUMA
> balancing is heavily active in all cases which may be masking any effect
> of the patch and may have unstable results in general.
>
> > <SNIP vmstat>
> > <SNIP description of loads that showed gains>
> >
> > numa09 is a set of 8 process (as many nodes) each running 4 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> No description of shared operations but NUMA balancing is very active so
> sharing is probably between processes.
>
> > numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Again, shared accesses without description and heavy NUMA balancing
> activity.
>
> So bottom line, a lot of these cases have shared operations where NUMA
> balancing decisions should dominate and make it hard to detect any impact
> from the patch. The exception is numa02 so please add tracing and dump
> out local and busiest sum_nr_running when the imbalance is ignored. I
> want to see if it's as simple as the local group is very busy but has
> capacity where the busiest group is almost idle. I also want to see how
> many times over the course of the numa02 workload that the conditions
> for the patch are even met.
>

--
Thanks and Regards
Srikar Dronamraju

Next message: Benjamin GAIGNARD: "Re: [PATCH v2] drm: fix parameters documentation style"
Previous message: Marc Zyngier: "Re: [PATCH] irqdomain: Fix a memory leak in irq_domain_push_irq()"
In reply to: Mel Gorman: "Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4"
Next in thread: Mel Gorman: "Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]