Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Phil Auld
Date: Thu May 07 2020 - 13:50:09 EST


On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> Hi Mel,
>
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
>
> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
>
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
>

I'm happy to spin up a patch once I'm sure what exactly the tuning would
effect. At an initial glance I'm thinking it would be the imbalance_min
which is currently hardcoded to 2. But there may be something else...


Cheers,
Phil


> Thanks a lot!
> Jirka
>
> On Thu, May 7, 2020 at 5:54 PM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > > > Yes, it's indeed OMP. With low threads count, I mean up to 2x number of
> > > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > > servers).
> > > >
> > > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > > be left idle.
> > >
> > > we have discussed today with my colleagues the performance drop for
> > > some workloads for low threads counts (roughly up to 2x number of NUMA
> > > nodes). We are worried that it can be a severe issue for some use
> > > cases, which require a full memory bandwidth even when only part of
> > > CPUs is used.
> > >
> > > We understand that scheduler cannot distinguish this type of workload
> > > from others automatically. However, there was an idea for a * new
> > > kernel tunable to control the imbalance threshold *. Based on the
> > > purpose of the server, users could set this tunable. See the tuned
> > > project, which allows creating performance profiles [1].
> > >
> >
> > I'm not completely opposed to it but given that the setting is global,
> > I imagine it could have other consequences if two applications ran
> > at different times have different requirements. Given that it's OMP,
> > I would have imagined that an application that really cared about this
> > would specify what was needed using OMP_PLACES. Why would someone prefer
> > kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> > specific knowledge of the application even to know that a particular
> > tuned profile is needed.
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
>
>
> --
> -Jirka
>

--