Re: [PATCH v5 00/21] Virtual Swap Space

From: Yosry Ahmed

Date: Thu Apr 23 2026 - 16:52:19 EST

> > > Yes, this absolutely works. In fact, I previously posted a working RFC
> > > based on this idea. In that series, clusters are dynamically
> > > allocated, allowing the swap space to be dynamically sized
> > > (essentially infinite) while reusing all the existing infrastructure:
> > > https://lore.kernel.org/all/20260220-swap-table-p4-v1-0-104795d19815@xxxxxxxxxxx/
> >
> > There are a few aspects that I don't agree with in this RFC, and I think
> > Nhat and Johannes raised most of them. Mostly that I don't want to
> > expose ghost swapfiles or similar to userspace.
> >
> > I think userspace's view of swapfiles should remain the same and reflect
> > the physical swap slots. The virtual swap layer should be completely
> > transparent in this case. Userspace shouldn't need to configure it in
> > any way.
>
> That approach is definitely doable. For example, with that RFC we
> could simply drop the interface I introduced and enable it via a
> different knob, and that would be very close to it. :)
>
> Using a swapfile to represent the virtual layer externally just made
> it more flexible.

I think it makes it less flexible to be honest. Once it's exposed to
userspace there's little we can change about it, and userspace needs
to set it up.

> I agree that the RFC design was a bit confusing and
> could be improved. There is no technical difficulty in hiding it from
> userspace; it's mostly a design choice. And even if we don't use a
> swapfile to represent it internally, all the other infrastructure can
> still be reused without much modification.

Yeah that's what I was getting at. It doesn't even need to be a
swapfile in the kernel, at the very least it should be named
differently to avoid confusion with actual swapfiles.

> Using a swapfile does have its benefits, though. For example, the
> virtual layer could act as an ordinary tier following YoungJun's
> design:
> https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@xxxxxxx/

Hmm I didn't look too closely at this but I don't understand how
making it a swapfile helps with tiering? If anything, I think it makes
tiering more difficult. For tiering to work, we need an
abstraction/redirection layer, such that we don't need to update the
page tables (or shmem pagecache) if we demote/promote pages. That is
exactly the use case for a virtual swap layer. The page tables point
at a virtual swap ID and the backend could change transparently (e.g.
for zswap writeback, or tiering).

If we make the virtual layer a swapfile, how do we demote/promote
without updating page tables?

IOW, I think the whole reason we want a virtual layer is to separate
the backends, which would facilitate tiering. If the virtual layer is
itself a swapfile, wouldn't it become one of the tiers?

> It also means we wouldn't need to introduce things like a new,
> virtual-specific swapoff mechanism.

We don't *need* to introduce this, at least not initially. Only if we
have a good use case for it.

> > In an ideal world, the only noticeable change from userspace is that
> > with zswap, compressed pages would stop using slots in the swapfile and
> > charging the memcg for them -- and that zswap would work even without a
> > swapfile, by just enabling it. This is admittedly a user-visible
> > behavioral change, but I am hoping that's a good one that we can live
> > with.
>
> Totally agree with the ideal end goal for zswap. just not sure if
> that's the right place to start for this usage, zswap doesn't always
> apply. For instance, we have SSDs with built-in compression,
> software-based storage stacks with built-in compression and
> deduplication, swap over RDMA, and, most notably, ZRAM users. They
> don't necessarily need zswap or a virtual layer, and the upper layer
> better be as much simplified as possible.

Right, it's not necessarily zswap at all. As I mentioned above, the
same logic applies for swap tiering. You can actually consider zswap
one of the tiers (more-or-less). If you have one swapfile (or one
tier) like the ones you mention above, the virtual layer just always
points to a single backend (e.g. the slot in the swapfile). There
might be some additional overhead, but I think it would be minimal if
we use the cluster-based approach you have been pushing to eliminate
static overhead and make it all dynamic based on actual usage.

At a high-level, if we have a single tier/swapfile, I think the only
additional overhead would be the reverse mapping from the swap slot to
the virtual swap layer, which would be 8 bytes or so for every swapped
out entry, right?

I think this was discussed before but I still wonder if we really need
a reverse mapping, if it's only to optimize swapoff then I don't think
it's a requirement. We can still scan the virtual swap layer to look
for slots to swapin. It would still be better than scanning the page
tables as we do today. But I think there were other use cases for the
reverse mapping, I just forgot what they were.