Re: [RFC patch v3 00/20] Cache aware scheduling

From: Chen, Yu C
Date: Thu Jun 19 2025 - 09:23:23 EST


On 6/19/2025 2:39 PM, Yangyu Chen wrote:
Nice work!

I've tested your patch based on commit fb4d33ab452e and found it
incredibly helpful for Verilator with large RTL simulations like
XiangShan [1] on AMD EPYC Geona.

I've created a simple benchmark [2] using a static build of an
8-thread Verilator of XiangShan. Simply clone the repository and
run `make run`.

In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
9T24, before the patch, we have a simulation time of 49.348ms. This
was because each thread was distributed across every CCX, resulting
in extremely high core-to-core latency. However, after applying the
patch, the entire 8-thread Verilator is allocated to a single CCX.
Consequently, the simulation time was reduced to 24.196ms, which
is a remarkable 2.03x faster than before. We don't need numactl
anymore!

[1] https://github.com/OpenXiangShan/XiangShan
[2] https://github.com/cyyself/chacha20-xiangshan

Tested-by: Yangyu Chen <cyy@xxxxxxxxxxxx>


Thanks Yangyu for your test. May I know if these 8-threads have any
data sharing with each other, or each thread has their dedicated
data? Or, there is 1 main thread, the other 7 threads do the
chacha20 rotate and put the result to the main thread?
Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
reduction in the total time.

Thanks,
Chenyu