Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request

From: Nadav Amit
Date: Fri Jun 20 2025 - 17:24:58 EST




On 20/06/2025 4:10, Rik van Riel wrote:
On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
+
+static void set_payload(struct rar_payload *p, u16 pcid, unsigned
long start,
+ long pages)
+{
+ p->must_be_zero_1 = 0;
+ p->must_be_zero_2 = 0;
+ p->must_be_zero_3 = 0;
+ p->page_size = RAR_INVLPG_PAGE_SIZE_4K;

I think you can propagate the stride to this point instead of using
fixed 4KB stride.

Agreed. That's another future optimization, for
once this code all works right.

It might not be an optimization if it leads to performance regression relatively to the current software-based flush that does take the stride into account. IOW: flushing 2MB huge-page would end up doing a full TLB flush with RAR in contrast to selectively flushing the specific entry with the software-based shootdown. Correct me if I'm wrong.


Currently I am in a situation where the receiving
CPU clears the action vector from RAR_PENDING to
RAR_SUCCESS, but the TLB does not appear to always
be correctly flushed.

Interesting. I really do not know anything about it, but you may want to look on performance counters to see whether the flush is at least reported to take place.



+ p->type = RAR_TYPE_INVPCID;
+ p->pcid = pcid;
+ p->linear_address = start;
+
+ if (pcid) {
+ /* RAR invalidation of the mapping of a specific
process. */
+ if (pages < RAR_INVLPG_MAX_PAGES) {
+ p->num_pages = pages;
+ p->subtype = RAR_INVPCID_ADDR;
+ } else {
+ p->subtype = RAR_INVPCID_PCID;

I wonder whether it would be safer to set something to p->num_pages.
(then we can do it unconditionally)

We have a limited number of bits available for
p->num_pages. I'm not sure we want to try
writing a larger number than what fits in those
bits.

I was just looking for a way to simplify the code and at the same time avoid "undefined" state. History proved the all kind of unintended consequences happen when some state is uninitialized (even if the spec does not require it).



+ }
+ } else {
+ /*
+ * Unfortunately RAR_INVPCID_ADDR excludes global
translations.
+ * Always do a full flush for kernel
invalidations.
+ */
+ p->subtype = RAR_INVPCID_ALL;
+ }
+
+ /* Ensure all writes are visible before the action entry
is set. */
+ smp_wmb();

Maybe you can drop the smp_wmb() here and instead change the
WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should
have the same effect and I think would be cleaner and convey your
intent
better.

We need protection against two different things here.

1) Receiving CPUs must see all the writes done by
the originating CPU before we send the RAR IPI.

2) Receiving CPUs must see all the writes done by
set_payload() before the write done by
set_action_entry(), in case another CPU sends
the RAR IPI before we do.

That other RAR IPI could even be sent between
when we write the payload, and when we write
the action entry. The receiving CPU could take
long enough processing other RAR payloads that
it can see our action entry after we write it.

Does removing the smp_wmb() still leave everything
safe in that scenario?

Admittedly, I do not understand your concern. IIUC until set_action_entry() is called, and until RAR_PENDING is set, IPIs are irrelevant and the RAR entry would not be processed. Once it is set, you want all the prior writes to be visible. Semantically, that's exactly what smp_store_release() means.

Technically, smp_wmb() + WRITE_ONCE() are equivalent to smp_store_release() on x86. smp_wmb() is actually a compiler barrier as x86 follows the TSO memory model, and smp_store_release() is actually a compiler-barrier + WRITE_ONCE(). So I think it's all safe and at the same time clearer.