Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path
From: Wenchao Hao
Date: Sun Apr 26 2026 - 00:14:02 EST
On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android's low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
>
Hi Nhat, Kairui, Barry, Xueyuan,
Thanks for the review. I agree with the direction and have some ideas for
an alternative approach.
My approach: first eliminate pool->lock from zs_free() itself, then defer
free to per-cpu buffers with a lockless handoff, and finally reduce
class->lock overhead during drain by exploiting natural class locality.
Achieving both per-cpu and per-class is difficult, so the class->lock
optimization is a compromise — but one that works well in practice.
1. Encode class_idx in obj to eliminate pool->lock
OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
(chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
for obj_idx, leaving 14 spare bits.
We can split OBJ_INDEX into class_idx + obj_idx:
obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
(8 bits for 4K pages, 9 for 64K).
Since class_idx is invariant across migration (only PFN changes), zs_free()
can extract class_idx locklessly, then acquire class->lock and re-read obj for a
stable PFN. No pool->lock needed.
2. Per-cpu deferred free with lockless buffer swap
Defer zs_free() to per-cpu dynamically-allocated buffers (~2048 entries).
Enqueue: one array write + WRITE_ONCE under preempt_disable — no lock,
no atomic. When buffers full, schedule a drain worker; overflow falls back
to sync zs_free().
Drain: allocate a fresh buffer, swap it in, reset count. Since
the producer stops writing at count==SIZE, the handoff is
race-free without any lock.
Pseudo-code:
/* enqueue - hot path */
def = get_cpu_ptr(pool->deferred);
if (def->count < SIZE) {
def->handles[def->count] = handle;
WRITE_ONCE(def->count, def->count + 1);
if (def->count == SIZE)
schedule_work(&pool->drain_work);
} else {
zs_free(pool, handle); /* fallback */
}
put_cpu_ptr(pool->deferred);
/* drain - worker */
for_each_possible_cpu(cpu) {
def = per_cpu_ptr(pool->deferred, cpu);
if (def->count < SIZE)
continue;
new_buf = kvmalloc_array(SIZE, sizeof(long));
old_buf = def->handles;
old_count = def->count;
def->handles = new_buf;
WRITE_ONCE(def->count, 0);
/* now drain old_buf[0..old_count-1] */
...
kvfree(old_buf);
}
3. Consecutive-class batching during drain
The drain worker extracts class_idx from each handle locklessly, and holds
class->lock across consecutive same-class handles.
On the exit path, compressed sizes tend to cluster, so consecutive handles
naturally share the same class — giving batch-like lock
amortization without sorting.
Pseudo-code:
cur_cls = -1;
for (i = 0; i < count; i++) {
obj = handle_to_obj(handles[i]);
cls = obj_to_class_idx(obj);
if (cls != cur_cls) {
if (cur_cls >= 0)
spin_unlock(&pool->size_class[cur_cls]->lock);
spin_lock(&pool->size_class[cls]->lock);
cur_cls = cls;
}
__zs_free(pool, handles[i]); /* free under lock */
}
if (cur_cls >= 0)
spin_unlock(&pool->size_class[cur_cls]->lock);
---
Benefits over current mainline:
- Removes pool->lock from zs_free() entirely
- Deferred free path is nearly zero-cost
- class->lock is amortized across batches instead of acquired per-handle
- Producer-consumer handoff is fully lockless
I've prototyped this on 64-bit and it works. Still need to sort out
32-bit compatibility and Kconfig gating. Does this direction look reasonable?
Thanks,
Wenchao