Re: [PATCH v2 10/12] x86/mm: Move flush_tlb_info back to the stack

From: Chuyi Zhou

Date: Thu Mar 05 2026 - 02:01:59 EST

Hi Peter,

在 2026/3/2 22:58, Peter Zijlstra 写道:
> On Mon, Mar 02, 2026 at 03:52:14PM +0800, Chuyi Zhou wrote:
>> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
>> stack") converted flush_tlb_info from stack variable to per-CPU variable.
>> This brought about a performance improvement of around 3% in extreme test.
>> However, it also required that all flush_tlb* operations keep preemption
>> disabled entirely to prevent concurrent modifications of flush_tlb_info.
>> flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
>> all remote CPUs to complete their local TLB flushes. The process could
>> take tens of milliseconds when interrupts are disabled or with a large
>> number of remote CPUs.
>>
>> From the perspective of improving kernel real-time performance, this patch
>> reverts flush_tlb_info back to stack variables. This is a preparation for
>> enabling preemption during TLB flush in next patch.
>
> This isn't properly justified. You've got to show that 'most' workloads
> are not adversely affected by this.
>
> Most people still swing towards performance most of the time.

I attempted to reproduce the microbenchmark mentioned in Commit
3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
stack") using the script below.

The baseline was tip/sched/core: f74d204baf9f (sched/hrtick: Mark
hrtick_clear() as always used).

The test environment was an Ice Lake system (Intel(R) Xeon(R) Platinum
8336C) with 128 CPUs and 2 NUMA nodes.

Using the per-CPU flush_tlb_info showed only a very marginal performance
advantage, approximately 1%.

base on-stack
---- ---------
avg (usec/op) 5.9362 5.9956 (+1%)
stddev 0.0240 0.0096

I also tested with mmtest/stress-ng-madvise, which randomly calls
madvise on pages within a mmap range and triggers a large number of
high-frequency TLB flushes. However, I did not observe any significant
difference.

baseline on-stack

Amean bops-madvise-1 13.64 ( 0.00%) 13.56 ( 0.59%)
Amean bops-madvise-2 27.32 ( 0.00%) 27.26 ( 0.24%)
Amean bops-madvise-4 53.35 ( 0.00%) 53.54 ( -0.35%)
Amean bops-madvise-8 103.09 ( 0.00%) 103.30 ( -0.20%)
Amean bops-madvise-16 191.88 ( 0.00%) 191.75 ( 0.07%)
Amean bops-madvise-32 287.98 ( 0.00%) 291.01 * -1.05%*
Amean bops-madvise-64 365.84 ( 0.00%) 368.09 * -0.61%*
Amean bops-madvise-128 422.72 ( 0.00%) 423.47 ( -0.18%)
Amean bops-madvise-256 435.61 ( 0.00%) 435.63 ( -0.01%)

Thanks.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <unistd.h>

#define NUM_OPS 1000000
#define NUM_THREADS 3
#define NUM_RUNS 5
#define PAGE_SIZE 4096

volatile int stop_threads = 0;

void *busy_wait_thread(void *arg) {
while (!stop_threads) {
__asm__ volatile ("nop");
}
return NULL;
}

long long get_usec() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000LL + tv.tv_usec;
}

int main() {
pthread_t threads[NUM_THREADS];
char *addr;
int i, r;

addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}

for (i = 0; i < NUM_THREADS; i++) {
if (pthread_create(&threads[i], NULL, busy_wait_thread, NULL)
!= 0) {
perror("pthread_create");
exit(1);
}
}

printf("Running benchmark: %d runs, %d ops each, %d background
threads\n", NUM_RUNS, NUM_OPS, NUM_THREADS);

for (r = 0; r < NUM_RUNS; r++) {
long long start, end;

start = get_usec();
for (i = 0; i < NUM_OPS; i++) {

addr[0] = 1;

if (madvise(addr, PAGE_SIZE, MADV_DONTNEED) != 0) {
perror("madvise");
exit(1);
}
}
end = get_usec();

double duration = (double)(end - start);
double avg_lat = duration / NUM_OPS;
printf("Run %d: Total time %.2f us, Avg latency %.4f us/op\n",
r + 1, duration, avg_lat);
}

stop_threads = 1;
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}

munmap(addr, PAGE_SIZE);
return 0;
}