RE: [GIT PULL] mm: frontswap (for 3.2 window)

From: Dan Magenheimer
Date: Tue Nov 01 2011 - 17:01:46 EST


> From: Andrea Arcangeli [mailto:aarcange@xxxxxxxxxx]
> Sent: Tuesday, November 01, 2011 12:07 PM
> To: Dan Magenheimer
> Cc: Pekka Enberg; Cyclonus J; Sasha Levin; Christoph Hellwig; David Rientjes; Linus Torvalds; linux-
> mm@xxxxxxxxx; LKML; Andrew Morton; Konrad Wilk; Jeremy Fitzhardinge; Seth Jennings; ngupta@xxxxxxxxxx;
> Chris Mason; JBeulich@xxxxxxxxxx; Dave Hansen; Jonathan Corbet
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Andrea --

Pardon me for complaining about my typing fingers, but it seems
like you are making statements and asking questions as if you
are not reading the whole reply before you start responding
to the first parts. So it's going to be hard to answer each
sub-thread in order. So let me hit a couple of the high
points first.

> This basically proofs the API must be fixed....

Let me emphasize and repeat: Many of your comments
here are addressing zcache, which is a staging driver.
You are commenting on intra-zcache APIs only, _not_ on the
tmem ABI. I realize there is some potential confusion here
since the file in the zcache directory is called tmem.c;
but it is NOT defining or implementing the tmem ABI/API used
by the kernel. The ONLY kernel API that need be debated here
is the code in the frontswap patchset, which provides
registration for a set of function pointers (see the
struct frontswap_ops in frontswap.h in the patch) and
provides the function calls (API) between the frontswap
(and cleancache) "frontends" and the "backends" in
the driver directory. The zcache file "tmem.c" is simply
a very early attempt to tease out core operations and
data structures that are likely to be common to multiple
tmem users.

Everything in zcache (including tmem.c) is completely open
to evolution as needed (by KVM or other users) and this
will need to happen before zcache is promoted out of staging.
So your comments will be very useful when "we" work
on that promotion process.

So, I'm going to attempt to ignore the portions of your
reply that are commenting specifically about zcache coding
issues and reply to the parts that potentially affect
acceptance of the frontswap patchset, but if I miss anything
important to you, please let me know.

> About the rest of zcache I think it's interesting but because it works
> inside tmem I'm unsure how we're going to write it to disk.
>
> The local_irq_save would be nice to understand why it's needed for
> frontswap but not for pagecache.

It's because the already-merged cleancache hooks that call
cleancache_put are invoked from mm/vfs code where irqs are
already disabled. This is not true for the hook calling
frontswap_get, but since there's a lot of shared code,
I disabled irqs for frontswap_get also.

> All that VM code never runs from
> irqs, so it's hard to see how the irq disabling is relevant. A bit fat
> comment on why local_irq_save is needed in zcache code (in staging
> already) would be helpful. Maybe it's tmem that can run from irq? The
> only thing running from irqs is the tlb flush and I/O completion
> handlers, everything else in the VM isn't irq/softirq driven so we
> never have to clear irqs.

Other than the fact that cleancache_put is called with
irqs disabled (and, IIRC, sometimes cleancache_flush?)
and the coding complications that causes, you are correct.

Preemption does need to be disabled though and, IIRC,
in some cases, softirqs.

> My feeling is this zcache should be based on a memory pool abstraction
> that we can write to disk with a bio and working with "pages".

Possible I suppose. But unless you can teach bio to deal
with dynamically time-and-size varying devices, you are
not implementing the most important value of the tmem concept,
you are just re-implementing zram. And, as I said, Nitin
supports frontswap because it is better than zram for
exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8

> I'm also not sure how you balance the pressure in the tmem pool, when
> you fail the allocation and swap to disk, or when you keep moving to
> compressed swap.

Just like all existing memory management code, zcache depends
on some heuristics, which can be improved as necessary over time.
Some of the metrics that feed into the heuristics are in
debugfs so they can be manipulated as zcache continue to
develop. (See zv_page_count_policy_percent for example...
yes this is still a bit primitive. And before you start up
about dynamic sizing, this is only a maximum.)

For Xen, there is a "memmax" for each guest, and Xen tmem disallows
a guest from using a page for swap (tmem calls it a "persistent" pool)
if it has reached its memmax. Thus unless a tmem-enabled guest
is "giving", it can never expect to "get".

For KVM, you can overcommit in the host, so you could choose a
different heuristic... if you are willing to accept host swapping
(which I think is evil :-)

> > This is a known problem: zcache is currently not very
> > good for high-response RT environments because it currently
> > compresses a page of data with interrupts disabled, which
> > takes (IIRC) about 20000 cycles. (I suspect though, without proof,
> > that this is not the worst irq-disabled path in the kernel.)
>
> That's certainly more than the irq latency so it's probably something
> the rt folks don't want and yes they should keep it in mind not to use
> frontswap+zcache in embedded RT environments.

Well, you have yet to convince me that an extra copy is
so damning, especially on a modern many-core CPU where it
can be done in 256 cycles and especially when the cache-pollution
for the copy is necessary for the subsequent compression anyway.

But for now, yes, don't turn on zcache in embedded RT.

> Besides there was no benchmark comparing zram performance to zcache
> performance so latency aside we miss a lot of info.

Think of zcache as zram PLUS dynamicity PLUS ability to dynamically
trade off memory utilization against compressed page cache.

> And what is the exact reason of the local_irq_save for doing it
> zerocopy?

(Answered above I think? If not, let me know.)

> Would I'd like is a mechanism where you:
>
> 1) add swapcache to zcache (with fallback to swap immediately if zcache
> allocation fails)

Current swap code pre-selects the swap device several layers
higher in the call chain, so this requires fairly major surgery
on the swap subsystem... and the long bug-tail that implies.

> 2) when some threshold is hit or zcache allocation fails, we write the
> compressed data in a compact way to swap (freeing zcache memory),
> or swapcache directly to swap if no zcache is present

Has efficient writing (and reading) of smaller-than-page chunks
through blkio every been implemented? I know compression can be
done "behind the curtain" of many I/O devices, but am unaware
that the same functionality exists in the kernel. If it doesn't
exist, this requires fairly major surgery on the blkio subsystem.
If it does exist, I doubt the swap subsystem is capable of using
it without major surgery.

> 3) newly added swapcache is added to zcache (old zcache was written to
> swap device compressed and freed)
>
> Once we already did the compression it's silly to write to disk the
> uncompressed data. Ok initially it's ok because compacting the stuff
> on disk is super tricky but we want a design that will allow writing
> the zcache to disk and add new swapcache to zcache, instead of the
> current way of swapping the new swapcache to disk uncompressed and not
> being able to writeout the compressed zcache.
>
> If nobody called zcache_get and uncompressed it, it means it's
> probably less likely to be used than the newly added swapcache that
> wants to be compressed.

Yeah, I agree that sounds like a cool high-level design for a
swap subsystem rewrite. Problem is it doesn't replace the dynamicity
to do what frontswap does for virtualization and multiple physical
machines (RAMster). Just not as flexible.

And do you really want to rewrite the swap subsystem anyway
when a handful of frontswap hooks do the same thing (and more)?

> I'm afraid adding frontswap in this form will still get stuck us in
> the wrong model and most of it will have to be dropped and rewritten
> to do just the above 3 points I described to do proper swap
> compression.

This is a red herring. I translate this as "your handful of hooks
might interfere with some major effort that I've barely begun to
design". And even if you DO code that major effort... the
frontswap hooks are almost trivial and clearly separated from
most of the core swap code... how do you know those hooks will
interfere with your grand plan anyway?

Do I have to quote Linus's statement from the KS2011 minutes
again? :-)

> The final swap design must also include the pre-swapout from Avi by
> writing data to swapcache in advance and relaying on the dirty bit to
> rewrite it. And the pre-swapin as well (original idea from Con). The
> pre-swapout would need to stop before compressing. The pre-swapin
> should stop before decompressing.

IIUC, you're talking about improvements to host-swapping here.
That is (IMHO) putting lipstick on a pig. And, in any case, you
are talking about significant swap subsystm changes that only help
a single user, KVM. You seem to be already measuring non-existent
KVM patches by a different/easier standard than you are applying
to a simple frontswap patchset that's been public for nearly
three years.

> I mean I see an huge potential for improvement in the swap space, just
> I guess most are busy with more pressing issues, like James said most
> data centers don't use swap, desktop is irrelevant and android (as
> relevant as data center) don't use swap.

Yep. I agree that it is unlikely to get done. But James' data
centers are running cgroups, not Xen, not KVM. And there is
a solution proposed that exists today for Xen, and that
KVM can at least attempt if not heavily leverage.

> But your improvement to frontswap don't look the right direction if
> you really want to improve swap for the long term. It may be better
> than nothing but I don't see it going the way it should go and I
> prefer to remove the tmem dependency on zcache all together. Zcache
> alone would be way more interesting.

There is no tmem dependency on zcache. Feel free to rewrite
zcache entirely. It still needs the hooks in the frontswap
patch, or something at least very similar.

> And tmem_put must be fixed to take a page, that cast to char * of a
> page, to avoid crashing on highmem is not allowed.
>
> Of course I didn't have the time to read 100% of the code so please
> correct me again if I misunderstood something.

Then feel free to rewrite that code.. or wait until it gets
fixed. I agree that it's unlikely that zcache will be promoted
out of staging with that hack. That's all still unrelated to
merging frontswap.
>
> > This is the "fix highmem" bug fix from Seth Jennings. The file
> > tmem.c in zcache is an attempt to separate out the core tmem
> > functionality and data structures so that it can (eventually)
> > be in the lib/ directory and be used by multiple backends.
> > (RAMster uses tmem.c unchanged.) The code in tmem.c reflects
> > my "highmem-blindness" in that a single pointer is assumed to
> > be able to address the "PAMPD" (as opposed to a struct page *
> > and an offset, necessary for a 32-bit highmem system). Seth
> > cleverly discovered this ugly two-line fix that (at least for now)
> > avoided major mods to tmem.c.
>
> Well you need to do the major mods, it's not ok to do that cast,
> passing pages is correct instead. Let's fix the tmem_put API before
> people can use it wrong. Maybe then I'll dislike passing through tmem
> less? Dunno.

Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c
but isn't even in staging yet.

> The whole logic deciding the size of the frontswap zcache is going to
> be messy.

It's not messy, and is entirely dynamic. Finding the ideal
heuristics for the maximum size, and when and how much to
decompress pages back out of zcache back into the swap cache,
I agree, is messy and will take some time.

Still not sure how this is related to the proposed frontswap
patch now (which just provides some mechanism for the heuristics
to drive).

> But to do the real swapout you should not pull the memory
> out of frontswap zcache, you should write it to disk compacted and
> compressed compared to how it was inserted in frontswap... That would
> be the ideal.

Agreed, that would be cool... and very difficult to implement.

> > The selfballooning code in drivers/xen calls frontswap_shrink
> > to pull swap pages out of the Xen hypervisor when memory pressure
> > is reduced. Frontswap_shrink is not yet called from zcache.
>
> So I wonder how zcache is dealing with the dynamic size. Or has it a
> fixed size? How do you pull pages out of zcache to max out the real
> RAM availability?

Dynamic. Pulled out with frontswap_shrink, see above.

> > Note, however, that unlike swap-disks, compressed pages in
> > frontswap CAN be silently moved to another "device". This is
> > the foundation of RAMster, which moves those compressed pages
> > to the RAM of another machine. The device _could_ be some
> > special type of real-swap-disk, I suppose.
>
> Yeah you can do ramster with frontswap+zcache but not writing the
> zcache to disk into the swap device. Writing to disk doesn't require
> new allocations. Migrating to other node does. And you must deal with
> OOM conditions there. Or it'll deadlock. So the basic should be to
> write compressed data to disk (which at least can be done reliably for
> swapcache, unlike ramster which has the same issues of nfs swapping
> and nbd swapping and iscsi sapping) before wondering if to send it to
> another node.

I guess you are missing the key magic for RAMster, or really
for tmem. Because everything in tmem is entirely dynamic (e.g.
any attempt to put a page can be rejected), the "remote" machine
has complete control over how many pages to accept from whom,
and can manage its own needs as higher priority. Think of
a machine in RAMster as a KVM/Xen "host" for a bunch of
virtual-machines-that-are-really-physical-machines. And it
is all peer-to-peer, so each machine can act as a host when
necessary. None of this is possible through anything that
exists today in the swap subsystem or blkio subsystem.
And RAMster runs on the same cleancache and frontswap hooks
as Xen and zcache and, potentially, KVM.

Yeah, the heuristics may be even harder for RAMster. But
the first response to this thread (from Christoph) said
that this stuff isn't sexy. Personally I can't think of
anything sexier than the first CROSS-MACHINE memory management
subsystem in a mainstream OS. Again... NO additional core
VM changes.

> > Yes, this is a good example of the most important feature of
> > tmem/frontswap: Every frontswap_put can be rejected for whatever reason
> > the tmem backend chooses, entirely dynamically. Not only is it true
> > that hardware can't handle this well, but the Linux block I/O subsystem
> > can't handle it either. I've suggested in the frontswap documentation
> > that this is also a key to allowing "mixed RAM + phase-change RAM"
> > systems to be useful.
>
> Yes what is not clear is how the size of the zcache is choosen.

Is that answered clearly now?

> > Also I think this is also why many linux vm/vfs/fs/bio developers
> > "don't like it much" (where "it" is cleancache or frontswap).
> > They are not used to losing control of data to some other
> > non-kernel-controlled entity and not used to being told "NO"
> > when they are trying to move data somewhere. IOW, they are
> > control freaks and tmem is out of their control so it must
> > be defeated ;-)
>
> Either tmem works on something that is a core MM structure and is
> compatible with all bios and operations we can want to do on memory,
> or I've an hard time to think it's a good thing in trying to make the
> memory it handles not-kernel-controlled.
>
> This non-kernel-controlled approach to me looks like exactly a
> requirement coming from Xen, not really something useful.

C'mon Andrea. You're an extremely creative guy and you are
disappointing me.

Think RAMster. Think a version of RAMster with a "memory server"
(where the RAM expandability is in one server in a rack). Think
fast SSDs that can be attached to one machine and shared by other
machines. Think phase-change (or other future limited-write-cycle)
RAM without a separate processor counting how many times a cell
has been written. This WAS all about Xen a year or two ago.
I haven't written a line of Xen in over a year because I am
excited about the FULL value of tmem.

> There is no reason why a kernel abstraction should stay away from
> using kernel data structures like "struct page" just to cast it back
> from char * to struct page * when it needs to handle highmem in
> zcache. Something seriously wrong is going on there in API terms so
> you can start by fixing that bit.

Yep, let's fix that problem in zcache. That is a stupid
coding error by me and irrelevant to frontswap and the bigger
transcendent memory picture.

> > I hope the earlier explanation about frontswap_shrink helps.
> > It's also good to note that the only other successful Linux
> > implementation of swap compression is zram, and zram's
> > creator fully supports frontswap (https://lkml.org/lkml/2011/10/28/8)
> >
> > So where are we now? Are you now supportive of merging
> > frontswap? If not, can you suggest any concrete steps
> > that will gain your support?
>
> My problem is this is like zram, like mentioned it only solves the
> compression. There is no way it can store the compressed data on
> disk. And this is way more complex than zram, and it only makes the
> pooling size not fixed at swapon time... so very very small gain and
> huge complexity added (again compared to zram). zram in fact required
> absolutely zero changes to the VM. So it's hard to see how this is
> overall better than zram. If we deal with that amount of complexity we
> should at least be a little better than zram at runtime, while this is
> same.

Zram required exactly ONE change to the VM, and Nitin placed it
there AFTER he looked at how frontswap worked. Then he was forced
down the "gotta do it as a device" path which lost a lot of the
value. Then, when he wanted to do compression on page cache, he
found that the cleancache interface was perfect for it. Why
does everyone keep telling me to "do it like zram" when the author
of zram has seen the light? Did I mention Nitin's support for
frontswap already? https://lkml.org/lkml/2011/10/28/8

So, I repeat, where are we now? Have I sufficiently answered
your concerns and questions? Or are you going to go start
coding to prove me wrong with a swap subsystem rewrite? :-)

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/