Re: [PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata
From: Kairui Song
Date: Fri Apr 24 2026 - 14:13:02 EST
On Tue, Apr 21, 2026 at 2:17 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
>
> This series unifies the allocation and charging of anon and shmem swap
> in folios, provides better synchronization, consolidates the metadata
> management, hence dropping the static array and map, and improves the
> performance. The static metadata overhead is now close to zero, and
> workload performance is slightly improved.
>
> For example, mounting a 1TB swap device saves about 512MB of memory:
>
> Before:
> free -m
> total used free shared buff/cache available
> Mem: 1464 805 346 1 382 658
> Swap: 1048575 0 1048575
>
> After:
> free -m
> total used free shared buff/cache available
> Mem: 1464 277 899 1 356 1187
> Swap: 1048575 0 1048575
>
> Memory usage is ~512M lower, and we now have a close to 0 static
> overhead. It was about 2 bytes per slot before, now roughly 0.09375
> bytes per slot (48 bytes ci info per cluster, which is 512 slots).
>
> Performance test is also looking good, testing Redis in a 1.5G VM using
> 5G ZRAM as swap:
>
> valkey-server --maxmemory 2560M
> redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> Before: 3289011.918750 RPS
> After: 3312087.142241 RPS (0.99% better)
>
> Testing with build kernel under global pressure on a 48c96t system,
> limiting the total memory to 8G, using 12G ZRAM, 24 test runs,
> enabling THP:
>
> make -j96, using defconfig
>
> Before: user time 2904.59s system time 4773.99s
> After: user time 2909.38s system time 4641.55s (2.77% better)
>
> Testing with usemem on a 32c machine using 48G brd ramdisk and 16G
> RAM, 12 test run:
>
> usemem --init-time -O -y -x -n 48 1G
>
> Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us
> After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us
>
> Seems similar, or slightly better.
>
> This series also reduces memory thrashing, I no longer see any:
> "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was
> shown several times during stress testing before this series when under
> great pressure:
>
> Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18
> After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0
>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
> Changes in v3:
> - This is based on mm-unstable, also applies to mm-new, and has no
> conflict with YoungJun's tier series, and only trivial conflict with
> Baoquan's swapops due to filename change.
> - Fix zero map build issue on 32 bit archs [ YoungJun Park ]
> - Cleanup memcg table allocation helpers [ YoungJun Park ]
> - Fix WARN for non NUMA build:
> https://lore.kernel.org/linux-mm/CAMgjq7ANih7u7SJB8uWcQHS8XRJySNRc3ti9V-SVey0nGE3gLQ@xxxxxxxxxxxxxx/
> - Improve of commit messages.
> - Re-test several tests, the conclusion is the same as v2.
> - Link to v2: https://patch.msgid.link/20260417-swap-table-p4-v2-0-17f5d1015428@xxxxxxxxxxx
>
> Changes in v2:
> - Drop the RFC prefix and also the RFC part.
> - Now there is zero change to cgroup or refault tracking, RFC v1 changed
> some cgroup behavior. To archive that v2 use a standalone memcg_table
> for each cluster. It can be dropped or better optimized later if we
> have a better solution. The performance gain is partly cancelled
> compared to RFC v1 since we now need an extra allocation for free cluster
> isolation and peak memory usage is 2 bytes higher. But still looking
> good. That table size is accetable (1024 bytes), no RCU needed, and
> fits for kmalloc. Even if we keep it as it is in the future,
> it's still accetable.
> - Link to v1: https://lore.kernel.org/r/20260220-swap-table-p4-v1-0-104795d19815@xxxxxxxxxxx
>
> To: linux-mm@xxxxxxxxx
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Chris Li <chrisl@xxxxxxxxxx>
> Cc: Kairui Song <kasong@xxxxxxxxxxx>
> Cc: Kemeng Shi <shikemeng@xxxxxxxxxxxxxxx>
> Cc: Nhat Pham <nphamcs@xxxxxxxxx>
> Cc: Baoquan He <bhe@xxxxxxxxxx>
> Cc: Barry Song <baohua@xxxxxxxxxx>
> Cc: Youngjun Park <youngjun.park@xxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Yosry Ahmed <yosry@xxxxxxxxxx>
> Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx>
> Cc: David Hildenbrand <david@xxxxxxxxxx>
> Cc: Lorenzo Stoakes <ljs@xxxxxxxxxx>
> Cc: Zi Yan <ziy@xxxxxxxxxx>
> Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> Cc: Dev Jain <dev.jain@xxxxxxx>
> Cc: Lance Yang <lance.yang@xxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
> Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> Cc: Muchun Song <muchun.song@xxxxxxxxx>
> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> Cc: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: cgroups@xxxxxxxxxxxxxxx
>
> ---
> Kairui Song (12):
> mm, swap: simplify swap cache allocation helper
> mm, swap: move common swap cache operations into standalone helpers
> mm/huge_memory: move THP gfp limit helper into header
> mm, swap: add support for stable large allocation in swap cache directly
> mm, swap: unify large folio allocation
> mm/memcg, swap: tidy up cgroup v1 memsw swap helpers
> mm, swap: support flexible batch freeing of slots in different memcgs
> mm, swap: delay and unify memcg lookup and charging for swapin
> mm, swap: consolidate cluster allocation helpers
> mm/memcg, swap: store cgroup id in cluster table directly
> mm/memcg: remove no longer used swap cgroup array
> mm, swap: merge zeromap into swap table
>
> MAINTAINERS | 1 -
> include/linux/huge_mm.h | 30 +++
> include/linux/memcontrol.h | 16 +-
> include/linux/swap.h | 19 +-
> include/linux/swap_cgroup.h | 47 ----
> mm/Makefile | 3 -
> mm/huge_memory.c | 2 +-
> mm/internal.h | 11 +-
> mm/memcontrol-v1.c | 66 +++---
> mm/memcontrol.c | 32 +--
> mm/memory.c | 88 ++------
> mm/page_io.c | 58 ++++-
> mm/shmem.c | 122 +++--------
> mm/swap.h | 91 +++-----
> mm/swap_cgroup.c | 172 ---------------
> mm/swap_state.c | 516 +++++++++++++++++++++++++-------------------
> mm/swap_table.h | 169 ++++++++++++---
> mm/swapfile.c | 212 +++++++++---------
> mm/vmscan.c | 2 +-
> mm/zswap.c | 25 +--
> 20 files changed, 783 insertions(+), 899 deletions(-)
> ---
> base-commit: f1541b40cd422d7e22273be9b7e9edfc9ea4f0d7
> change-id: 20260111-swap-table-p4-98ee92baa7c4
>
> Best regards,
> --
> Kairui Song <kasong@xxxxxxxxxxx>
>
>
I checked sashiko's review, it seems sashiko itself is bugged or
something wrong, Most patched end up with:
Tool error: Review tool timed out (active time exceeded)
The rest of the results are all false positives, maybe I can add a few
more comments in the code or commit so it can understand the code
better.
And checking V2's review:
https://sashiko.dev/#/patchset/20260417-swap-table-p4-v2-0-17f5d1015428%40tencent.com
Which are mostly false positives and I've fixed the two real but
trivial issues already. Things should be fine.