Re: [PATCH v5 0/7] Optimize mprotect() for large folios

From: Dev Jain
Date: Fri Jul 18 2025 - 05:51:17 EST

Next message: Alireza Sanaee: "[PATCH v3 4/6] coresight: cti: Use of_cpu_phandle_to_id for grabbing CPU id"
Previous message: Alireza Sanaee: "[PATCH v3 3/6] arch_topology: update CPU map to use of_cpu_phandle_to_id"
In reply to: Catalin Marinas: "Re: [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v5 0/7] Optimize mprotect() for large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 18/07/25 2:32 pm, Dev Jain wrote:

Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all
16 entries to collect a/d bits. Hence this optimization will result in
a 16x reduction in the number of ptep_get() calls. Next,
ptep_modify_prot_start() will eventually call contpte_try_unfold() on
every contig block, thus flushing the TLB for the complete large folio
range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
each contig block, and only do them on the starting and ending
contig block.

For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement
is expected due to a reduction in the number of function calls.

mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
no new tests fail as a result of this patchset.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 2.1 seconds T2: 2 seconds T3: 1 second

After the patchset:
T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds

For the note: the numbers are different from the previous versions.
I must have run the test for more number of iterations and then
pasted the test program here for 40 iterations, that's why the mismatch.

Next message: Alireza Sanaee: "[PATCH v3 4/6] coresight: cti: Use of_cpu_phandle_to_id for grabbing CPU id"
Previous message: Alireza Sanaee: "[PATCH v3 3/6] arch_topology: update CPU map to use of_cpu_phandle_to_id"
In reply to: Catalin Marinas: "Re: [PATCH v5 7/7] arm64: Add batched versions of ptep_modify_prot_start/commit"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v5 0/7] Optimize mprotect() for large folios"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]