Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

From: Avi Kivity
Date: Fri Apr 23 2010 - 10:54:18 EST


On 04/23/2010 05:43 PM, Dan Magenheimer wrote:

Perhaps I misunderstood. Isn't frontswap in front of the normal swap
device? So we do have double swapping, first to frontswap (which is in
memory, yes, but still a nonzero cost), then the normal swap device.
The io subsystem is loaded with writes; you only save the reads.
Better to swap to the hypervisor, and make it responsible for
committing
to disk on overcommit or keeping in RAM when memory is available. This
way we avoid the write to disk if memory is in fact available (or at
least defer it until later). This way you avoid both reads and writes
if memory is available.
Each page is either in frontswap OR on the normal swap device,
never both. So, yes, both reads and writes are avoided if memory
is available and there is no write issued to the io subsystem if
memory is available. The is_memory_available decision is determined
by the hypervisor dynamically for each page when the guest attempts
a "frontswap_put". So, yes, you are indeed "swapping to the
hypervisor" but, at least in the case of Xen, the hypervisor
never swaps any memory to disk so there is never double swapping.

I see. So why not implement this as an ordinary swap device, with a higher priority than the disk device? this way we reuse an API and keep things asynchronous, instead of introducing a special purpose API.

Doesn't this commit the hypervisor to retain this memory? If so, isn't it simpler to give the page to the guest (so now it doesn't need to swap at all)?

What about live migration? do you live migrate frontswap pages?

If I understand correctly, SSDs work much more efficiently when
writing 64KB blocks. So much more efficiently in fact that waiting
to collect 16 4KB pages (by first copying them to fill a 64KB buffer)
will be faster than page-at-a-time DMA'ing them. If so, the
frontswap interface, backed by an asynchronous "buffering layer"
which collects 16 pages before writing to the SSD, may work
very nicely. Again this is still just speculation... I was
only pointing out that zero-copy DMA may not always be the best
solution.
The guest can easily (and should) issue 64k dmas using scatter/gather.
No need for copying.
In many cases, this is true. For the swap subsystem, it may not always
be true, though I see recent signs that it may be headed in that
direction.

I think it will be true in an overwhelming number of cases. Flash is new enough that most devices support scatter/gather.

In any case, unless you see this SSD discussion as
critical to the proposed acceptance of the frontswap patchset,
let's table it until there's some prototyping done.

It isn't particularly related.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/