Re: [PATCH 0/9] x86: Concurrent TLB flushes and other improvements

From: Nadav Amit
Date: Tue Jun 25 2019 - 21:34:23 EST


> On Jun 25, 2019, at 3:02 PM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 6/12/19 11:48 PM, Nadav Amit wrote:
>> Running sysbench on dax w/emulated-pmem, write-cache disabled, and
>> various mitigations (PTI, Spectre, MDS) disabled on Haswell:
>>
>> sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
>> --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run
>>
>> events (avg/stddev)
>> -------------------
>> 5.2-rc3: 1247669.0000/16075.39
>> +patchset: 1290607.0000/13617.56 (+3.4%)
>
> Why did you decide on disabling the side-channel mitigations? While
> they make things slower, they're also going to be with us for a while,
> so they really are part of real-world testing IMNHO. I'd be curious
> whether this set has more or less of an advantage when all the
> mitigations are on.

It seemed reasonable since I wanted to avoid all kind of ânoiseâ. I presume
the relative speedup would be smaller, due to the overhead of the
mitigations, would be smaller. Note that in this benchmark every TLB
invalidation is of a single entry. The benefit (in the terms of absolute
time saved) would have been greater if a flush was of multiple entries.

> Also, why only 4 threads? Does this set help most when using a moderate
> number of threads since the local and remote cost are (relatively) close
> vs. a large system where doing lots of remote flushes is *way* more
> time-consuming than a local flush?

Donât overthink it. My server was busy doing something else, so I was
running the tests on a lame desktop I have. I will rerun it on a bigger
machine.

I presume the performance benefit will be smaller when more cores are
involved, since the TLB shootdown time will be dominated by the inter-core
communication time (IPI+cache coherency) and the tail latency of the IPI
delivery (if interrupts are disabled on the target).

I am working on some patches to reduce these overheads as well.