Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory

From: David Hildenbrand
Date: Mon Feb 22 2021 - 10:33:03 EST


On 22.02.21 15:02, Michal Hocko wrote:
On Mon 22-02-21 14:22:37, David Hildenbrand wrote:
Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
want.

OK, then I must have misread your requirements. Maybe I just got lost in
all the combinations you have listed.

Another special case could be dax/pmem I think. You might want to fault it
in readable/writable but not perform an actual read/write unless really
required.

QEMU phrases this as "don't cause wear on the storage backing".

Sorry for being dense here but I still do not follow. If you do not want
to read then what do you want to populate from? Only map if it is in the

In the context of VMs it's usually rather a mean to preallocate backend storage - which would also happen on read access. See below on case 4).

page cache?

Let's try to untangle my thoughts regarding VMs. We could have as backend storage for our VM:

1) Anonymous memory
2) hugetlbfs (private/shared)
3) tmpfs/shmem (private/shared)
4) Ordinary files (shared)
5) DAX/PMEM (shared)

Excluding special cases (hypervisor upgrades with 2) and 3) ), we expect to have pre-existing content in files only in 4) and 5). 4) and 5) might be used as NVDIMM backend for a guest, or as DIMM backend.

The first access of our VM to memory could be
a) Write: the usual case when exposed as RAM/DIMM to out guest.
b) Read: possible case when exposed as an NVDIMM to our guest (we don't
know). But eventually, we might write to (parts of) NVDIMMs later.

We "preallocate"/"populate" memory of our VM so that
- We know we have sufficient backend storage (esp. hugetlbfs, shmem,
files) - so we don't randomly crash the VM. My most important use
case.
- We avoid page faults (including page zeroing!) at runtime. Especially
relevant for RT workloads.

With 1), 2), and 3) we want to have pages faulted in writable - we expect that our guest will write to that memory. MADV_POPULATE would do that only for 1), and MAP_PRIVATE of 2). For the shared parts, we would want MADV_POPULATE_WRITE semantics.

With 5), we already had complaints that preallcoation in QEMU takes a long time - because we end up actually reading/writing slow PMEM (libvirt now disables preallcoation for that reason, which makes sense). However, MADV_POPULATE_WRITE would help to prefault without actually reading/writing pmem - if we want to avoid any minor faults.

With 4), I think we primarily prealloc/prefault to make sure we have sufficient backend storage. fallocate() might do a better job just for the allocation. But if there is sufficient RAM it might make sense to prefault all guest RAM at least readable - then we only have a minor fault when the VM writes to it and might avoid having to go to disk. Prefaulting everything writable means that we *have to* write back all guest RAM even if the guest never accessed it. So I think there are cases where MADV_POPULATE_READ (current MADV_POPULATE) semantics could make sense.


--
Thanks,

David / dhildenb