Re: [RFC PATCH 2/2] mm: swap: apply per cgroup swap priority mechansim on swap layer
From: Kairui Song
Date: Thu Jun 12 2025 - 14:20:38 EST
On Fri, Jun 13, 2025 at 1:28 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Thu, Jun 12, 2025 at 4:14 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > On Thu, Jun 12, 2025 at 6:43 PM <youngjun.park@xxxxxxx> wrote:
> > >
> > > From: "youngjun.park" <youngjun.park@xxxxxxx>
> > >
> >
> > Hi, Youngjun,
> >
> > Thanks for sharing this series.
> >
> > > This patch implements swap device selection and swap on/off propagation
> > > when a cgroup-specific swap priority is set.
> > >
> > > There is one workaround to this implementation as follows.
> > > Current per-cpu swap cluster enforces swap device selection based solely
> > > on CPU locality, overriding the swap cgroup's configured priorities.
> >
> > I've been thinking about this, we can switch to a per-cgroup-per-cpu
> > next cluster selector, the problem with current code is that swap
>
> What about per-cpu-per-order-per-swap-device :-? Number of swap
> devices is gonna be smaller than number of cgroups, right?
Hi Nhat,
The problem is per cgroup makes more sense (I was suggested to use
cgroup level locality at the very beginning of the implementation of
the allocator in the mail list, but it was hard to do so at that
time), for container environments, a cgroup is a container that runs
one type of workload, so it has its own locality. Things like systemd
also organize different desktop workloads into cgroups. The whole
point is about cgroup.
There could be a lot of cgroups indeed, but not every one of them is
going to enable a cgroup level swap configuration. Youngjun used a
pointer in mem_cgroup, so disabled cgroups have no overhead.
We had a per-device-per-cpu-per-order table previously (before
1b7e90020eb77). It works. Only minor problem is allocation has to
iterate the plist first, then use the si->percpu, and usually there
are only a few swap devices, much less flexible than cgroups.
>
> At swap slot allocation time, we check the folio's swap device
> priority list, then pump that all the way to the swap allocator.
>
> swap allocator, given a priority list, for each priority level, try to
> allocate from that level first. It will get a cluster (either locally
> cached or a new one) from swap devices in that priority level, before
> moving on to the next priority level.