post-3.18 performance regression in TLB flushing code

From: Dave Hansen
Date: Tue Dec 16 2014 - 16:37:05 EST


I'm running the 'brk1' test from will-it-scale:

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/brk1.c

on a 8-socket/160-thread system. It's seeing about a 6% drop in
performance (263M -> 247M ops/sec at 80-threads) from this commit:

commit fb7332a9fedfd62b1ba6530c86f39f0fa38afd49
Author: Will Deacon <will.deacon@xxxxxxx>
Date: Wed Oct 29 10:03:09 2014 +0000

mmu_gather: move minimal range calculations into generic code

tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
applied, but does not show up at all on the commit before.

This isn't a major regression, but it is rather unfortunate for a patch
that is apparently a code cleanup. It also _looks_ to show up even when
things are single-threaded, although I haven't looked at it in detail.

I suspect the tlb->need_flush logic was serving some role that the
modified code isn't capturing like in this hunk:

> void tlb_flush_mmu(struct mmu_gather *tlb)
> {
> - if (!tlb->need_flush)
> - return;
> tlb_flush_mmu_tlbonly(tlb);
> tlb_flush_mmu_free(tlb);
> }

tlb_flush_mmu_tlbonly() has tlb->end check (which replaces the
->need_flush logic), but tlb_flush_mmu_free() does not.

If we add a !tlb->end (patch attached) to tlb_flush_mmu(), that gets us
back up to ~258M ops/sec, but that's still ~2% down from where we started.


---

b/mm/memory.c | 3 +++
1 file changed, 3 insertions(+)

diff -puN mm/memory.c~fix-old-need_flush-logic mm/memory.c
--- a/mm/memory.c~fix-old-need_flush-logic 2014-12-16 13:24:27.338557014 -0800
+++ b/mm/memory.c 2014-12-16 13:24:50.412598019 -0800
@@ -258,6 +258,9 @@ static void tlb_flush_mmu_free(struct mm

void tlb_flush_mmu(struct mmu_gather *tlb)
{
+ if (!tlb->end)
+ return;
+
tlb_flush_mmu_tlbonly(tlb);
tlb_flush_mmu_free(tlb);
}
_