Re: Re: [RFC] autonuma: Support to scan page table asynchronously

From: SeongJae Park
Date: Fri Apr 17 2020 - 03:06:02 EST


On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" <ying.huang@xxxxxxxxx> wrote:

> Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> writes:
>
> > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> >> In current AutoNUMA implementation, the page tables of the processes
> >> are scanned periodically to trigger the NUMA hint page faults. The
> >> scanning runs in the context of the processes, so will delay the
> >> running of the processes. In a test with 64 threads pmbench memory
> >> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> >> and 256 GB memory, there are more than 20000 latency outliers that are
> >> > 1 ms in 3600s run time. These latency outliers are almost all
> >> caused by the AutoNUMA page table scanning. Because they almost all
> >> disappear after applying this patch to scan the page tables
> >> asynchronously.
> >>
> >> Because there are idle CPUs in system, the asynchronous running page
> >> table scanning code can run on these idle CPUs instead of the CPUs the
> >> workload is running on.
> >>
> >> So on system with enough idle CPU time, it's better to scan the page
> >> tables asynchronously to take full advantages of these idle CPU time.
> >> Another scenario which can benefit from this is to scan the page
> >> tables on some service CPUs of the socket, so that the real workload
> >> can run on the isolated CPUs without the latency outliers caused by
> >> the page table scanning.
> >>
> >> But it's not perfect to scan page tables asynchronously too. For
> >> example, on system without enough idle CPU time, the CPU time isn't
> >> scheduled fairly because the page table scanning is charged to the
> >> workqueue thread instead of the process/thread it works for. And
> >> although the page tables are scanned for the target process, it may
> >> run on a CPU that is not in the cpuset of the target process.
> >>
> >> One possible solution is to let the system administrator to choose the
> >> better behavior for the system via a sysctl knob (implemented in the
> >> patch). But it's not perfect too. Because every user space knob adds
> >> maintenance overhead.
> >>
> >> A better solution may be to back-charge the CPU time to scan the page
> >> tables to the process/thread, and find a way to run the work on the
> >> proper cpuset. After some googling, I found there's some discussion
> >> about this as in the following thread,
> >>
> >> https://lkml.org/lkml/2019/6/13/1321
> >>
> >> So this patch may be not ready to be merged by upstream yet. It
> >> quantizes the latency outliers caused by the page table scanning in
> >> AutoNUMA. And it provides a possible way to resolve the issue for
> >> users who cares about it. And it is a potential customer of the work
> >> related to the cgroup-aware workqueue or other asynchronous execution
> >> mechanisms.
> >>
> >
> > The caveats you list are the important ones and the reason why it was
> > not done asynchronously. In an earlier implementation all the work was
> > done by a dedicated thread and ultimately abandoned.
> >
> > There is no guarantee there is an idle CPU available and one that is
> > local to the thread that should be doing the scanning. Even if there is,
> > it potentially prevents another task from scheduling on an idle CPU and
> > similarly other workqueue tasks may be delayed waiting on the scanner. The
> > hiding of the cost is also problematic because the CPU cost is hidden
> > and mixed with other unrelated workqueues. It also has the potential
> > to mask bugs. Lets say for example there is a bug whereby a task is
> > scanning excessively, that can be easily missed when the work is done by
> > a workqueue.
>
> Do you think something like cgroup-aware workqueue is a solution deserve
> to be tried when it's available? It will not hide the scanning cost,
> because the CPU time will be charged to the original cgroup or task.
> Although the other tasks may be disturbed, cgroup can provide some kind
> of management via cpusets.
>
> > While it's just an opinion, my preference would be to focus on reducing
> > the cost and amount of scanning done -- particularly for threads. For
> > example, all threads operate on the same address space but there can be
> > significant overlap where all threads are potentially scanning the same
> > areas or regions that the thread has no interest in. One option would be
> > to track the highest and lowest pages accessed and only scan within
> > those regions for example. The tricky part is that library pages may
> > create very wide windows that render the tracking useless but it could
> > at least be investigated.
>
> In general, I think it's good to reduce the scanning cost.

I think the main idea of DAMON[1] might be able to applied here. Have you
considered it?

[1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@xxxxxxxxxx/


Thanks,
SeongJae Park

>
> Why do you think there will be overlap between the threads of a process?
> If my understanding were correctly, the threads will scan one by one
> instead of simultaneously. And how to determine whether a vma need to
> be scanned or not? For example, there may be only a small portion of
> pages been accessed in a vma, but they may be accessed remotely and
> consumes quite some inter-node bandwidth, so need to be migrated.
>
> Best Regards,
> Huang, Ying
>