Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset"

From: Christian Ehrhardt
Date: Thu Feb 11 2010 - 11:11:38 EST


Mel Gorman wrote:
>>>>> How reproducible are these results with patch e084b reverted? i.e. I know
>>>>> it does not work now, but did reverting on the old kernel always fix it
>>>>> or were there occasions where the figures would be still bad?
>>>>>
>>>> Reverting e084b in the past showed something like +40%, so I thought it
>>>> helps.
>>>> Afterwards I found out it wasn't a good testcase for reproducibility and
>>>> when looking at a series it had +5%,-7%,+20%,...
>>>> So by far too noisy to be useful for bisect, discussion or fix testing.
>>>> Thats why I made my big round reinventing the testcases in a more
>>>> reproducible way like described above.
>>>> In those the original issue is still very visible - which means 4 runs
>>>> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
>>>> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>>> About all e084b can be affecting is cache hotness when fallback is occuring
>>> but 60% regression still seems way too high for just that.
>> Yes, most the ideas you, I an a few others so far had went for cache
>> hotness - but always the summary was "couldn't be that much".
>
> Well, it's cache hotness or the driver or hardware doing automatic merging
> of requests (which you say is not happening). There are two other
> possibilties but their effect should be very low as well.
>
>> On the other hand we have to consider that it could be a small timing due
>> to cache or whatever else which leads into congestion_wait and then it waits
>> HZ/50. That way a 1µs cause can lead to 20001µs lost which then cause 60%.
>
> Maybe although it would imply the window where it can go wrong is
> very small. It seems unlikely that it would be hit every time.
>
>> Another important thing is that "cold page" afaik means the page content,
>> and not the management struct etc. As the time lost is in management only
>> (=> before accessing the actual 4k page content) it should not be important
>> if a page is cold/hot for the mm code itself.
>
> True. Even the zeroing should be a non-temporal store so the timing
> should not be affected.
>
>> That much for direct_reclaim that either runs into progress, but !page
>> (BAD case) or not (GOOD case) - there only cache effects of struct page,
>> pcp lists etc should have an effect.
>
> There are only two other possibilities I can think of.
>
> 1. The extra parameter to rmqueue_bulk() means that your platform
> generates particularly bad code to pass a 6th parameter.

I never checked that number, but out of the box I can tell you that 5 parameters can be passed per register and starting with the 6th they need to go via stack.
So much about hope, some numbers below to keep results together.

> 2. The branch inside the for loop is causing serious problems somehow

I also some branches can be tricky on s390, so I look forward to see your idea how we could verify that by not changing too much behavior.

>
> Causing a 60% slowdown from either of these seems unlikely but there isn't
> much choice than to eliminate them as possibilities. The attached patch is
> for "option 1" above. I'll think more on how to best handle "option 2"

>>>>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21
>>>>>> which came in later. Both patches unapplied individually don't
>>>>>> improve anything. But both patches reverted at the same time on git
>>>>>> v2.6.32 bring us back to our old values (remember that is more than
>>>>>> 2Gb/s improvement in throughput)!
>>>>>>
>>>>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>>>>>> can cause so much trouble.
>>>>>>
>>>>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>>>>> workloads can suffer. A second is that users of high-order allocations can
>>>>> enter direct reclaim a lot more than previously. This was fixed in commit
>>>>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>>>>> your case.
>>>>>
>>>>> The only other interesting thing in commit 5f8dcc21 is that it increases
>>>>> the memory overhead of a per-cpu structure. If your memory usage is really
>>>>> borderline, it might have been just enough to push it over the edge.
>>>>>
>>>> I had this thoughts too, but as it only shows an effect with 5f8dcc21
>>>> and e084b reverted at the same time I'm wondering if it can be that.
>>>> But I agree that this should be considered too until a verification can
>>>> be done.
>>> Another consequence of 5f8dcc is that pages are on the per-cpu lists that
>>> it is possible for pages to be on the per-cpu lists when the page allocator
>>> is called. There is potential that page reclaim is being entered as a
>>> result even though pages were free on the per-cpu lists.
>>>
>>> It would not explain why reverting e084b makes such a difference but I
>>> can prototype a patch that falls back to other per-cpu lists before
>>> calling the page allocator as it used to do but without linear searching
>>> the per-cpu lists.
>> I tested these patch you submitted in another reply.
>> Short - it only gives it ~+4% by probably speeding up things a bit but
>> not touching the issue in any way.
>> Some more detail below at the other test result.
>>
>
> You mention it gains 4% which is not great, but what impact, if any, had
> it on the number of calls to direct reclaim?

sorry I missed to add that numbers yesterday.
Unfortunately it had no effect. vomparing 2.6.32 to 2.6.32 with avoid direct reclaim patch it has these for e.g. 16 threads:


with avoid direct reclaim (pcp list fallback)
perf_count_pages_direct_reclaim 32909
perf_count_failed_pages_direct_reclaim_but_progress 9802 =29.79%

remmember original 2.6.32 had:
perf_count_pages_direct_reclaim 33662
perf_count_failed_pages_direct_reclaim_but_progress 10152 =30.16%

So no big progress due to that (~4% TP as mentioned before).


>> <SNIP>
>>> Ok. Two avenues of attack then although both are focused on 5f8dcc21.
>>> The first is the prototype below that should avoid congestion_wait. The
>>> second is to reintroduce a fallback to other per-cpu lists and see if
>>> that was a big factor.
>> I tested both of your patches, btw thanks for the quick work!
>>
>
> Thanks for persisting with this! I know it's dragging on a lot longer
> than anyone would like.
>
>>> Hmm, the patch as it currently stands is below. However, I warn you that
>>> it has only been boot-tested on qemu with no proper testing doing, either
>>> functional or performance testing. It may just go mad, drink your beer and
>>> chew on your dog.
>> In opposition to all your warnings it worked pretty well.
>
> Nice one.
>
>> The patch replacing congestion_wait calls - which we know re futile in my scenario
>> anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
>>
>> From a throughput perspective alone it looks almost fixed:
>> 4 thread 8 thread 16 thread
>> plain 2.6.32 100% 100% 100%
>> 2.6.32 avoid direct reclaim 103.85% 103.60% 104.04%
>> 2.6.32 zone wait 125.46% 140.87% 131.89%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 124.46% 143.35% 144.21%
>>
>
> Ok, the fact that the congestion_wait-avoidance patch makes such a large
> difference is very interesting although if the direct reclaim calls were
> lower, it would be less pronounced. I will be having those patches on the
> back-burner for another while unless we cannot resolve this problem a better
> way but it's nice to know they may have some potential.
>
>> Looking at the counters this theory (only fixing symptoms) seems to be
>> confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that
>> ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the
>> good cases and ~30% in bad cases. As congestion_wait in my case almost always
>> waits the full HZ/50 this is a lot.
>
>> With the zone wait patch I see a huge throughput win, most probably by
>> all the (still ~30%!) waits that don't have to wait for the full timeout.
>
> That would be my expectation. The underlying problem of
> too-many-direct-reclaims is not solved.
>
>> But still reverting 5f8dcc21 and e084b fix the issue of so much direct
>> reclaims running in into that condition and that way giving throughput
>> another last push.
>> % ran into try_to_free did progress but get_page got !page
>> plain 2.6.32 30.16%
>> 2.6.32 zone wait 29.05%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 2.93%
>>
>
> And what impact did "Fallback to other per-cpu lists when the target list
> is empty and memory is low" have on the number of calls to direct reclaim?

see above

>
>> Note - with the zone wait patch there are also almost twice as much
>> direct_reclaim calls. Probably by several waiters coming of the list when
>> the watermark is restored and allocating too much.
>
> It could be either a thundering herd problem or just the fact that the
> processes are not asleep.

Thats sounds right, but as other things we had that is just related to the symptoms so I guess we can neglect it for now.

>> Another good side note - 5f8dcc21 and e084b reverted plus your zone wait
>> patch is better than anything I have seen so far :-)
>
> Well, that's something positive out of all this. The congestion_wait() patch
> was developed in response to this thread. Even though I knew it was treating
> symptons and not problems, the congestion_wait() was identified as a bad
> idea some time ago because the lack of pages may have nothing to do with IO.
>
>> Maybe its worth to measure both of your patches plus 5f8dcc21 and e084b
>> reverted with 64 threads (the effect grows with # threads) just to rejoice
>> in huge throughput values for a short time.
>
> :)
>
>>> <SNIP>
>>>
>>> There is a good chance that 5f8dcc is causing direct reclaim to be entered
>>> when it could have been avoided as pages were on the per-cpu lists but
>>> I don't think the difference in cache hotness is enough to explain a 60%
>>> loss on your machine.
>>>
>> As far as I get it thats what your patch #2 fixed (page allocator: Fallback
>> to other per-cpu lists when the target list is empty and memory is low),
>> but as mentioned below that gave me +4%.
>
> I need to know what effect, if any, it had on the direct reclaim
> figures. If it does drop them, but not by enough, then fallback can be
> allowed to occur in more cirsumstances - for example, by changing
>
> + if (pcp->count && free_pages <= low_wmark_pages(zone)) {
>
> with
>
> + if (pcp->count && free_pages <= high_wmark_pages(zone)) {
>
>> Which is actually a fine win, but not for the issue currently in discussion.
>> As bonus I can give you now a little tested-by for both patches :-)
>>
>
> Thanks :)
>
> Still, the grind continues to try resolve this :(
>
> 1. If the "Fallback to other per-cpu lists" does reduce the number of
> calls to direct reclaim, but not by enough, please replace the
> comparison of low_wmark_pages with high_wmark_pages and let me know a)
> how much it affects throughput and b) how much it affects the number
> of calls to direct reclaim

The difference between low or high watermark is minimal.

here with avoid direct reclaims with high watermark
perf_count_pages_direct_reclaim 32839
perf_count_failed_pages_direct_reclaim_but_progress 488 =30.16%

as reference from above the one with low watermark
perf_count_pages_direct_reclaim 32909
perf_count_failed_pages_direct_reclaim_but_progress 9802 =29.79%

=> this is inside the small amount of noise that is left.

>
> This is addressing potential problems in 5f8dcc
>
> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> number of parameters being passed is making some unexpected large
> difference

BINGO - this definitely hit something.
This experimental patch does two things - on one hand it closes the race we had:

4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
perf_count_congestion_wait 13 27 52
perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
perf_count_pages_direct_reclaim 30867 56811 131552
perf_count_failed_pages_direct_reclaim 14 24 52
perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!

On the other hand we see that the number of direct_reclaim calls increased by ~x4.

I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.

This leaves me with two ideas what the real issue could be:
1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
2. maybe just the fact that we now only do "order 0" operations changes the behaviour

I think I try to cover that tomorrow by keeping the old call with 6 parms but rewriting it internally to 0 again (and hope I don't need to much tricks to stop gcc from optimizing it out).
Maybe I could also extend statistics to track order in the failing cases vs. all cases or something like this.

I don't have full access to the machines tomorrow, so it might get monday for a proper response.
Thanks a lot Mel, this brought us much closer to a final fix.


> This is trying to figure out more what the problem with e084b is
>
> ==== CUT HERE ====
> rmqueue_bulk-fewer-parameters
[...]
--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/