Re: [PATCH 7/8] sched: Split accounting of NUMA hinting faults thatpass two-stage filter

From: Mel Gorman
Date: Fri Jun 28 2013 - 10:29:35 EST


On Fri, Jun 28, 2013 at 12:33:04PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 28, 2013 at 03:42:45PM +0530, Srikar Dronamraju wrote:
> > > >
> > > > > Ideally it would be possible to distinguish between NUMA hinting faults
> > > > > that are private to a task and those that are shared. This would require
> > > > > that the last task that accessed a page for a hinting fault would be
> > > > > recorded which would increase the size of struct page. Instead this patch
> > > > > approximates private pages by assuming that faults that pass the two-stage
> > > > > filter are private pages and all others are shared. The preferred NUMA
> > > > > node is then selected based on where the maximum number of approximately
> > > > > private faults were measured.
> > > >
> > > > Should we consider only private faults for preferred node?
> > >
> > > I don't think so; its optimal for the task to be nearest most of its pages;
> > > irrespective of whether they be private or shared.
> >
> > Then the preferred node should have been chosen based on both the
> > private and shared faults and not just private faults.
>
> Oh duh indeed. I totally missed it did that. Changelog also isn't giving
> rationale for this. Mel?
>

There were a few reasons

First, if there are many tasks sharing the page then they'll all move towards
the same node. The node will be compute overloaded and then scheduled away
later only to bounce back again. Alternatively the shared tasks would
just bounce around nodes because the fault information is effectively
noise. Either way I felt that accounting for shared faults with private
faults would be slower overall.

The second reason was based on a hypothetical workload that had a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason was because multiple threads in a process will race
each other to fault the shared page making the information unreliable.

It is important that *something* be done with shared faults but I haven't
thought of what exactly yet. One possibility would be to give them a
different weight, maybe based on the number of active NUMA nodes, but I had
not tested anything yet. Peter suggested privately that if shared faults
dominate the workload that the shared pages would be migrated based on an
interleave policy which has some potential.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/