Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit

From: Baolin Wang
Date: Tue Oct 24 2023 - 22:02:47 EST




On 10/25/2023 7:31 AM, Barry Song wrote:
On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@xxxxxxxxx> wrote:

On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

Now ptep_clear_flush_young() is only called by folio_referenced() to
check if the folio was referenced, and now it will call a tlb flush on
ARM64 architecture. However the tlb flush can be expensive on ARM64
servers, especially for the systems with a large CPU numbers.

Similar to the x86 architecture, below comments also apply equally to
ARM64 architecture. So we can drop the tlb flush operation in
ptep_clear_flush_young() on ARM64 architecture to improve the performance.
"
/* Clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
"
Running the thpscale to show some obvious improvements for compaction
latency with this patch:
base patched
Amean fault-both-1 1093.19 ( 0.00%) 1084.57 * 0.79%*
Amean fault-both-3 2566.22 ( 0.00%) 2228.45 * 13.16%*
Amean fault-both-5 3591.22 ( 0.00%) 3146.73 * 12.38%*
Amean fault-both-7 4157.26 ( 0.00%) 4113.67 * 1.05%*
Amean fault-both-12 6184.79 ( 0.00%) 5218.70 * 15.62%*
Amean fault-both-18 9103.70 ( 0.00%) 7739.71 * 14.98%*
Amean fault-both-24 12341.73 ( 0.00%) 10684.23 * 13.43%*
Amean fault-both-30 15519.00 ( 0.00%) 13695.14 * 11.75%*
Amean fault-both-32 16189.15 ( 0.00%) 14365.73 * 11.26%*
base patched
Duration User 167.78 161.03
Duration System 1836.66 1673.01
Duration Elapsed 2074.58 2059.75

Barry Song submitted a similar patch [1] before, that replaces the
ptep_clear_flush_young_notify() with ptep_clear_young_notify() in
folio_referenced_one(). However, I'm not sure if removing the tlb flush
operation is applicable to every architecture in kernel, so dropping
the tlb flush for ARM64 seems a sensible change.

Note: I am okay for both approach, if someone can help to ensure that
all architectures do not need the tlb flush when clearing the accessed
bit, then I also think Barry's patch is better (hope Barry can resend
his patch).


Thanks!

ptep_clear_flush_young() with "flush" in its name clearly says it needs a
flush. but it happens in arm64, all other code which needs a flush has
called other variants, for example __flush_tlb_page_nosync():

static inline void arch_tlbbatch_add_pending(struct
arch_tlbflush_unmap_batch *batch,
struct mm_struct *mm, unsigned long uaddr)
{
__flush_tlb_page_nosync(mm, uaddr);
}

so it seems folio_referenced is the only left user of this
ptep_clear_flush_young().
The fact makes Baolin's patch look safe now.

but this function still has "flush" in its name, so one day, one person might
call it with the understanding that it will flush tlb but actually it
won't. This is
bad smell in code.

Agree. I think this is jsut a start, we can replace ptep_clear_flush_young() once other architectures have completed the conversion, if we can confirm that other architectures also do not require tlb flush when clearing the accessed bit.

I guess one side effect of not flushing tlb while clearing the access
flag is that
hardware won't see this cleared flag in the tlb, so it might not set this bit in
memory even though the bit has been cleared before in memory(but not in tlb)
while the page is accessed *again*. >>
next time, someone reads the access flag in memory again by folio_referenced,
he/she will see the page is cold as hardware has lost a chance to set
the bit again
since it finds tlb already has a true access flag.

But anyway, tlb is so small, it will be flushed by context switch and
other running
code soon. so it seems we don't actually require the access flag being instantly
updated. the time gap, in which access flag might lose the new set by hardware,
seems to be too short to really affect the accuracy of page reclamation. but its
cost is large.

(A). Constant flush cost vs. (B). very very occasional reclaimed hot
page, B might
be a correct choice.

Plus, I doubt B is really going to happen. as after a page is promoted to
the head of lru list or new generation, it needs a long time to slide back
to the inactive list tail or to the candidate generation of mglru. the time
should have been large enough for tlb to be flushed. If the page is really
hot, the hardware will get second, third, fourth etc opportunity to set an
access flag in the long time in which the page is re-moved to the tail
as the page can be accessed multiple times if it is really hot.

Thanks Barry, that's also what I thought. On the other hand, even if there is no tlb flush for a long time, I think the system is not under memory pressure at that time, so the incorrect page aging would not have much impact.