WW_MUTEX_SELFTEST hangs w/ 6.9-rc workqueue changes

From: John Stultz
Date: Fri May 03 2024 - 21:02:12 EST


Hey All,
In doing some local testing, I noticed I've started to see boot
stalls with CONFIG_WW_MUTEX_SELFTEST with 6.9-rc on a 64cpu qemu
environment.

I've bisected the problem down to:
5797b1c18919 (workqueue: Implement system-wide nr_active enforcement
for unbound workqueues)
+ the fix needed for that change:
15930da42f89 (workqueue: Don't call cpumask_test_cpu() with -1 CPU
in wq_update_node_max_active())

I've seen problems in the past with the ww_mutex selftest code, so
it's likely a problem in the test itself, but I wanted to raise the
issue so folks were aware and see if there were suggestions for a
solution.

It seems to get stuck in __test_cycle() after a few runs when it hits
flush_workqueue()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n344

That seems to be because when the various work functions get queued,
they all don't seem to get a chance to run (they use a circular chain
of completions, so the 0th workfunc won't finish until after the
nrthreads-th workfunc runs).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n295

I'm noticing this happens when the test gets to nrthreads=9 (the test
usually goes up to NR_CPUS), so we queue work for 0->8 but the 9th
worker function never seems to run. Looking at __queue_work() I do
see pwq_tryinc_nr_active() fails for that 9th work struct and we end
up inserting the work as inactive.

I notice the change that uncovers this issue(5797b1c18919), both
tweaks pwq_tryinc_nr_active() and sets the WQ_DFL_MIN_ACTIVE to 8, so
maybe that's a hint as to if the test is abusing the number of queueud
work functions? Though that seems odd because that's the min not the
max (which seems to be 512).

Anyway, let me know if there's anything further I can help share to
debug this. I'll continue digging here as well.

thanks
-john