At load balance time, balance of last level cache domains and
above needs to be serialized. The scheduler checks the atomic var
sched_balance_running first and then see if time is due for a load
balance. This is an expensive operation as multiple CPUs can attempt
sched_balance_running acquisition at the same time.
On a 2 socket Granite Rapid systems enabling sub-numa cluster and
running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
sched_balance_running. Most of the time, a balance attempt is aborted
immediately after acquiring sched_balance_running as load balance time
is not due.
Instead, check balance due time first before acquiring
sched_balance_running. This skips many useless acquisitions
of sched_balance_running and knocks the 7.6% CPU overhead on
sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
improved by 11%.
Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Reported-by: Mohini Narkhede <mohini.narkhede@xxxxxxxxx>
Tested-by: Mohini Narkhede <mohini.narkhede@xxxxxxxxx>
---