Nice work!
I've tested your patch based on commit fb4d33ab452e and found it
incredibly helpful for Verilator with large RTL simulations like
XiangShan [1] on AMD EPYC Geona.
I've created a simple benchmark [2] using a static build of an
8-thread Verilator of XiangShan. Simply clone the repository and
run `make run`.
In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
9T24, before the patch, we have a simulation time of 49.348ms. This
was because each thread was distributed across every CCX, resulting
in extremely high core-to-core latency. However, after applying the
patch, the entire 8-thread Verilator is allocated to a single CCX.
Consequently, the simulation time was reduced to 24.196ms, which
is a remarkable 2.03x faster than before. We don't need numactl
anymore!
[1] https://github.com/OpenXiangShan/XiangShan
[2] https://github.com/cyyself/chacha20-xiangshan
Tested-by: Yangyu Chen <cyy@xxxxxxxxxxxx>