Re: [RFC patch v3 00/20] Cache aware scheduling
From: Yangyu Chen
Date: Thu Jun 19 2025 - 10:18:51 EST
> On 19 Jun 2025, at 21:21, Chen, Yu C <yu.c.chen@xxxxxxxxx> wrote:
>
> On 6/19/2025 2:39 PM, Yangyu Chen wrote:
>> Nice work!
>> I've tested your patch based on commit fb4d33ab452e and found it
>> incredibly helpful for Verilator with large RTL simulations like
>> XiangShan [1] on AMD EPYC Geona.
>> I've created a simple benchmark [2] using a static build of an
>> 8-thread Verilator of XiangShan. Simply clone the repository and
>> run `make run`.
>> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
>> 9T24, before the patch, we have a simulation time of 49.348ms. This
>> was because each thread was distributed across every CCX, resulting
>> in extremely high core-to-core latency. However, after applying the
>> patch, the entire 8-thread Verilator is allocated to a single CCX.
>> Consequently, the simulation time was reduced to 24.196ms, which
>> is a remarkable 2.03x faster than before. We don't need numactl
>> anymore!
>> [1] https://github.com/OpenXiangShan/XiangShan
>> [2] https://github.com/cyyself/chacha20-xiangshan
>> Tested-by: Yangyu Chen <cyy@xxxxxxxxxxxx>
>
> Thanks Yangyu for your test. May I know if these 8-threads have any
> data sharing with each other, or each thread has their dedicated
> data? Or, there is 1 main thread, the other 7 threads do the
> chacha20 rotate and put the result to the main thread?
Ah, I had forgotten to mention the benchmark. The workload is not
about chacha20 itself. This benchmark utilizes a RTL-level simulator
[1] that runs an Open Source OoO CPU core called XiangShan [2]. The
chacha20 algorithm is executed on the guest CPU within this simulator.
The verilator partitions a large RTL design into multiple blocks
of functions and distributes them to each thread. These signals
require synchronization every guest cycle, and synchronization is
also necessary when a dependency exists. Given that we have
approximately 5K guest cycles per second, there is a significant
amount of data that needs to be transferred between each thread.
If there are signal dependencies, this could lead to latency-bound
performance.
[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/XiangShan
Thanks,
Yangyu Chen
> Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
> reduction in the total time.
Nice result!
>
> Thanks,
> Chenyu