Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

From: Benjamin Serebrin
Date: Thu Sep 08 2016 - 20:10:04 EST


Sorry for the delay, I was eaten by a grue.

I found that my initial study did not actually measure the number of
TLB shootdown IPIs sent per TLB shootdown. I think the intuition was
correct but I didn't actually observe what I thought I had; my
original use of probe points was incorrect. However, after fixing my
methodology, I'm having trouble proving that the existing Lazy TLB
mode is working properly.



I've spent some time trying to reproduce this in a microbenchmark.
One thread does mmap, touch page, munmap, while other threads in the
same process are configured to either busy-spin or busy-spin and
yield. All threads set their own affinity to a unique cpu, and the
system is otherwise idle. I look at the per-cpu delta of the TLB and
CAL lines of /proc/interrupts over the run of the microbenchmark.

Let's say I have 4 spin threads that never yield. The mmap thread
does N unmaps. I observe each spin-thread core receives N (+/- small
noise) TLB shootdown interrupts, and the total TLB interrupt count is
4N (+/- small noise). This is expected behavior.

Then I add some synchronization: the unmap thread rendezvouses with
all the spinners, and when they are all ready, the spinners busy-spin
for D milliseconds and then yield (pthread_yield, sched_yield produce
identical results, though I'm not confident here that this is the
right yield). Meanwhile, the unmap thread busy-spins for D+E
milliseconds and then does M map/touch/unmaps. (D, E are single-digit
milliseconds). The idea here is that the unmap happens a little while
after the spinners yielded; the kernel should be in the user process'
mm but lazy TLB mode should defer TLB flushes. It seems that lazy
mode on each CPU should take 1 interrupt and then suppress subsequent
interrupts.

I expect lazy TLB invalidation to take 1 interrupt on each spinner
CPU, per rendezvous sequence, and I expect Rik's extra-lazy version to
take 0. I see M in all cases. This leads me to wonder if I'm failing
to trigger lazy TLB invalidation, or if lazy TLB invalidation is not
working as intended.

I get similar results using perf record on probe points: I filter by
CPU number and count the number of IPIs sent per each pair of probe
points in the tlb flush routines. I put probe points on
flush_tlb_mm_range and flush_tlb_mm_range%return. Counting number of
IPIs sent: In a VM that uses x2_physical mode, probing
native_x2apic_icr_write or __x2apic_send_IPI_dest is usually
convenient if it doesn't get inlined away (which sometimes happens),
since that function is called once per CPU target in the cpu_mask of
__x2apic_send_IPI_mask (in x2 physical mode). I filter perf script to
look at the distribution of cpus targeted per TLB shootdown.


Rik's patch definitely looks correct, but I can't yet cite the gains.

Thanks!
Ben





On Wed, Sep 7, 2016 at 11:56 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Rik van Riel <riel@xxxxxxxxxx> wrote:
>
>> On Sat, 27 Aug 2016 16:02:25 -0700
>> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> > Yeah, with those small fixes from Ingo, I definitely don't think this
>> > looks hacky at all. This all seems to be exactly what we should always
>> > have done.
>>
>> OK, so I was too tired yesterday to do kernel hacking, and
>> missed yet another bit (xen_flush_tlb_others). Sigh.
>>
>> Otherwise, the patch is identical.
>>
>> Looking forward to Ben's test results.
>
> Gentle ping to Ben.
>
> I can also apply this without waiting for the test result, the patch looks sane
> enough to me.
>
> Thanks,
>
> Ingo