SD_LOAD_BALANCE

From: Julia Lawall
Date: Thu Sep 03 2020 - 10:58:00 EST


Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
released in v5.8), with the comment:

The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init().

I have the impression that this was not quite true. The NUMA domain was
not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
set. The effect is that in v5.8, the for_each_domain loop in
select_task_rq_fair can always end up at the global NUMA domain, and thus
consider any pair of waking cpu (cpu) and previous cpus of the wakee
(prev_cpu) as arguments to wake_affine. Up to v5.7, this was only
possible if cpu and prev_cpu were together in some lower level domain, ie
sharing the LLC. The effect is that in v5.8 wake_affine can often end up
choosing as a target a core that does not share the LLC with the core
where the thread ran previously. Threads then move around a lot between
the different sockets.

Was this intentional?

The effect can be seen in the traces of the parsec vips benchmark at the
following URL:

https://pages.lip6.fr/Julia.Lawall/vips.pdf

The first two graphs (complete run and small fragment) are Linux v5.7 and
the next two are Linux v5.8. The machine has 160 hardware threads
organized in 4 sockets and the colors are by socket. In the small
fragment for v5.7 (second graph), one can see that a given pid pretty much
stays on the same socket, while in the corresponding fragment for v5.8
(fourth graph), the pids move around between the sockets. The x's
describe the unblocks that result in a migration. A pink x means that the
migration is in the same socket, while a blue x means that the migration
is to another socket. It's not apparent from the graphs, but by adding
some tracing, it seems that the new socket is always the one of the core
that handles the wakeup.

I haven't yet studied the early part of the execution of vips in detail,
but I suspect that the same issue causes all of the threads to be
initially on the same socket in v5.7, while in v5.8 they are more quickly
dispersed to other sockets.

My impression from the parsec and the NAS benchmarks is that the v5.8
performance is a bit better than v5.7, probably because of getting more
threads to more different sockets earlier, but other benchmarks might
rely more on locality and might react less well to threads moving around
so much in this way.

julia