Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing

From: Huang, Ying
Date: Mon Feb 27 2023 - 02:55:30 EST


Bharata B Rao <bharata@xxxxxxx> writes:

> On 17-Feb-23 11:33 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@xxxxxxx> writes:
>>
>>> On 14-Feb-23 10:25 AM, Bharata B Rao wrote:
>>>> On 13-Feb-23 12:00 PM, Huang, Ying wrote:
>>>>>> I have a microbenchmark where two sets of threads bound to two
>>>>>> NUMA nodes access the two different halves of memory which is
>>>>>> initially allocated on the 1st node.
>>>>>>
>>>>>> On a two node Zen4 system, with 64 threads in each set accessing
>>>>>> 8G of memory each from the initial allocation of 16G, I see that
>>>>>> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
>>>>>> to complete a fixed number of memory accesses. This could well
>>>>>> be the best case and real workloads/benchmarks may not get this much
>>>>>> uplift, but it does show the potential gain to be had.
>>>>>
>>>>> Can you find a way to show the overhead of the original implementation
>>>>> and your method? Then we can compare between them? Because you think
>>>>> the improvement comes from the reduced overhead.
>>>>
>>>> Sure, will measure the overhead.
>>>
>>> I used ftrace function_graph tracer to measure the amount of time (in us)
>>> spent in fault handling and task_work handling in both the methods when
>>> the above mentioned benchmark was running.
>>>
>>> Default IBS
>>> Fault handling 29879668.71 1226770.84
>>> Task work handling 24878.894 10635593.82
>>> Sched switch handling 78159.846
>>>
>>> Total 29904547.6 11940524.51
>>
>> Thanks! You have shown the large overhead difference between the
>> original method and your method. Can you show the number of the pages
>> migrated too? I think the overhead / page can be a good overhead
>> indicator too.
>>
>> Can it be translated to the performance improvement? Per my
>> understanding, the total overhead is small compared with total run time.
>
> I captured some of the numbers that you wanted for two different runs.
> The first case shows the data for a short run (less number of memory access
> iterations) and the second one is for a long run (more number of iterations)
>
> Short-run
> =========
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
>
> Default IBS
> Fault handling 29017953.99 1196828.67
> Task work handling 10354.40 10356778.53
> Sched switch handling 56572.21
> Total overhead 29028308.39 11610179.41
>
> Benchmark score(us) 194050290 53963650
> numa_pages_migrated 2097256 662755
> Overhead / page 13.84 17.51

>From above, the overhead/page is similar.

> Pages migrated per sec 72248.64 57083.95
>
> Default
> -------
> Total Min Max Avg
> do_numa_page 29017953.99 0.1 307.63 15.97
> task_numa_work 10354.40 2.86 4573.60 175.50
> Total 29028308.39
>
> IBS
> ---
> Total Min Max Avg
> ibs_overflow_handler 1196828.67 0.15 100.28 1.26
> task_ibs_access_work 10356778.53 0.21 10504.14 28.42
> hw_access_sched_in 56572.21 0.15 16.94 1.45
> Total 11610179.41
>
>
> Long-run
> ========
> Time taken or overhead (us) for fault, task_work and sched_switch
> handling
> Default IBS
> Fault handling 27437756.73 901406.37
> Task work handling 1741.66 4902935.32
> Sched switch handling 100590.33
> Total overhead 27439498.38 5904932.02
>
> Benchmark score(us) 306786210.0 153422489.0
> numa_pages_migrated 2097218 1746099
> Overhead / page 13.08 3.38

But from this, the overhead/page is quite different.

One possibility is that there's more "local" hint page faults in the
original implementation, we can check "numa_hint_faults" and
"numa_hint_faults_local" in /proc/vmstat for that.

If

numa_hint_faults_local / numa_hint_faults

is similar. For each page migrated, the number of hint page fault is
similar, and the run time for each hint page fault handler is similar
too. Or I made some mistake in analysis?

> Pages migrated per sec 6836.08 11380.98
>
> Default
> -------
> Total Min Max Avg
> do_numa_page 27437756.73 0.08 363.475 15.03
> task_numa_work 1741.66 3.294 1200.71 42.48
> Total 27439498.38
>
> IBS
> ---
> Total Min Max Avg
> ibs_overflow_handler 901406.37 0.15 95.51 1.06
> task_ibs_access_work 4902935.32 0.22 11013.68 9.64
> hw_access_sched_in 100590.33 0.14 91.97 1.52
> Total 5904932.02

Thank you very much for detailed data. Can you provide some analysis
for your data?

Best Regards,
Huang, Ying