Re: [PATCH v5 00/25] context_tracking,x86: Defer some IPIs until a user->kernel transition

From: Dave Hansen
Date: Wed Apr 30 2025 - 16:01:46 EST


On 4/30/25 12:42, Steven Rostedt wrote:
>> Look at the syscall code for instance:
>>
>>> SYM_CODE_START(entry_SYSCALL_64)
>>> swapgs
>>> movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
>>> SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
>> You can _trivially_ audit this and know that swapgs doesn't touch memory
>> and that as long as PER_CPU_VAR()s and the process stack don't have
>> their mappings munged and flushes deferred that this would be correct.
> Hmm, so there is still a path for this?
>
> At least if it added more ways to debug it, and some other changes to make
> the locations where vmalloc is dangerous smaller?

Being able to debug it would be a good start. But, more generally, what
we need is for more people to be able to run the code in the first
place. Would a _normal_ system (without setups that are trying to do
NOHZ_FULL) ever be able to defer TLB flush IPIs?

If the answer is no, then, yeah, I'll settle for some debugging options.

But if you shrink the window as small as I'm talking about, it would
look very different from this series.

For instance, imagine when a CPU goes into the NOHZ mode. Could it just
unconditionally flush the TLB on the way back into the kernel (in the
same SWITCH_TO_KERNEL_CR3 spot)? Yeah, it'll make entry into the kernel
expensive for NOHZ tasks, but it's not *THAT* bad. And if the entire
point of a NOHZ_FULL task is to minimize the number of kernel entries
then a little extra overhead there doesn't sound too bad.

Also, about the new hardware, I suspect there's some mystery customer
lurking in the shadows asking folks for this functionality. Could you at
least go _talk_ to the mystery customer(s) and see which hardware they
care about? They might already even have the magic CPUs they need for
this, or have them on the roadmap. If they've got Intel CPUs, I'd be
happy to help figure it out.