Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when__GFP_COLD is set"

From: Mel Gorman
Date: Tue Mar 02 2010 - 06:25:35 EST


On Tue, Mar 02, 2010 at 10:18:27PM +1100, Nick Piggin wrote:
> On Tue, Mar 02, 2010 at 11:01:50AM +0000, Mel Gorman wrote:
> > On Tue, Mar 02, 2010 at 09:36:46PM +1100, Nick Piggin wrote:
> > > On Tue, Mar 02, 2010 at 10:04:02AM +0000, Mel Gorman wrote:
> > > > On Tue, Mar 02, 2010 at 05:52:25PM +1100, Nick Piggin wrote:
> > > > > The zone pressure waitqueue patch makes sense.
> > > >
> > > > I've just started the rebase and considering what sort of test is best
> > > > for it.
> > > >
> > > > > We may even want to make
> > > > > it more strictly FIFO (eg. check upfront if there are waiters on the
> > > > > queue before allocating a page, and if yes then add ourself to the back
> > > > > of the waitqueue).
> > > >
> > > > To be really strict about this, we'd have to check in the hot-path of the
> > > > per-cpu allocator which would be undesirable.
> > >
> > > Yes, but it would also be desirable for other reasons (eliminating
> > > all unnecessary latency here). Which is why I say it is a tradeoff
> > > but it might be worthwhile.
> > >
> > > We obviously wouldn't load the waitqueue itself in the fastpath, but
> > > probably some field in a cacheline we already touch.
> > >
> >
> > I'm struggling to justify it in my head. Granted, right now the pressure_wq is
> > located beside the wait_table as a "rarely-used" field. It could be located
> > before the free_area[] so it would be pulled in which would have some impact
> > for the high-order lists but probably not too bad. This would make
> > accessing pressure_wq a little cheaper.
>
> No you would mark it in another field in the slowpath.
>

i.e. a flag in zone->flags that is set if something is waiting on the queue?

> It would be justified if the aggregate cost of the work is outweighed
> by the reduction in unnecessary wait time in the congestion code, plus
> some unquantifiable goodness from the extra fairness.
>
> Obviously that depends on the workload, but so does every change we
> do like this.
>

True. I'll sketch out what needs to happen along these lines and see
what it looks like.

> > The impact of *not* checking is unfairness. A late process makes forward
> > progress while others sleep on a queue waiting for a reclaim pass to finish
> > and the queue to be woken up. At worst, processes continue sleeping for the
> > entire timeout because late processes kept the zone below the watermark.
> >
> > Is a little unfairness in a path concerned with heavy memory pressure
> > enough to justify checking on every page allocation?
>
> Well, we've had patches that do a lot more than that in the allocator
> that most people don't really use :) (and they've had relatively small
> impact).
>
> But no, it adds at least another branch, icache cost, and at least one
> more memory bit in the fastpath. Obviously that's a drawback.
>

Of which the branch would be the greatest concern but it's also a very
unlikely branch.

> > > > We could check further in the
> > > > slow-path but I bet it'd be very rare that the logic would be triggered. For
> > > > a process to enter the FIFO due to waiters that were not yet woken up, the
> > > > system would have to be a) under heavy memory pressure b) reclaim taking such
> > > > a long time that check_zone_pressure() is not being called in time and c)
> > > > a process exiting or otherwise freeing memory such that the watermarks are
> > > > cleared without reclaim being involved.
> > >
> > > I don't think it would be too rare. Things can get freed up and
> > > other allocations come in while reclaim is happening. But anyway
> > > the nasty thing about the "rare" events is that they do add a
> > > rare source of unexpected latency or starvation.
> > >
> >
> > If processes are asleep on the waitqueue, reclaim must be active (by kswapd
> > if nothing else). If pages are getting freed above the necessary watermark,
> > then the processes will be woken up when the current shrink_zone() finished
> > unless unfair processes are keeping the zone below watermarks. But unless
> > reclaim is taking an extraordinary long length of time, there would be little
> > difference between waking the queue in the free path and waking it in the
> > reclaim path.
>
> Reclaim can take quite a while, yes.
>
>
> > Again, is a little unfairness under heavy memory pressure enough to
> > alter the main paths?
>
> Well you didn't let me answer yet :) See above.
>
>
> > > I'm not saying necessarily that it will make a noticable improvement
> > > to throughput for this test case, but that now that we have the
> > > waitqueue there, we should think about exactly how far we want to
> > > go with it.
> >
> > Fair point but I'd sooner look at the other places the VM waits on timeouts
> > and see can they be converted to waiting on events. Just in case, after these
> > tests complete (and assuming they generate usable figures), I'll prototype
> > some of the suggestions and see what the impact is.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/