Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior
From: Prakash Sangappa
Date: Fri May 10 2024 - 12:29:03 EST
> On May 8, 2024, at 10:28 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 08.05.24 19:00, Prakash Sangappa wrote:
>>> On May 7, 2024, at 5:00 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>
>>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>>> file size does not get updated in the mmap call.
>>>> The current behavior is that hugetlbfs file size will get extended by a
>>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>>> not normal filesystem behavior.
>>>> There seem to have been very little discussion about this. There was a
>>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>>> extending because of the hugetlb page reservations. Looks like this was
>>>> not merged.
>>>> It appears there is no correlation between file size and hugetlb page
>>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>>> not extended even though hugetlb pages are reserved.
>>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>>> even though hugetlb pages are not reserved.
>>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>>> range specified in the mmap call.
>>>> Issue:
>>>> Some applications would prefer to manage hugetlb page allocations explicity
>>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>>> access beyond file size is expected to generate a signal. This does not
>>>> work properly due to current behavior which extends file size in mmap call.
>>>
>>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>> Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.
>
> I'd assume that most applications that mmap() hugetlb files need to
> special-case hugetlb because of the different logical page size
> granularity already. But yes, it's all unfortunate.
Will run this by out application/Database team regarding implementing workarounds.
>
>> However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.
>
> The issue is, as you write, that it's existing behavior and changing it
> could cause harm to other apps that rely on that. But I do wonder if really
> anybody relies on that ...
>
> Let's explore the history:
>
> The current VM_WRITE check was added in:
>
> commit b6174df5eec9cdfd598c03d6d0807e344e109213
> Author: Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>
> Date: Mon Jul 10 04:44:49 2006 -0700
>
> [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area
> Sometimes, applications need below call to be successful although
> "/mnt/hugepages/file1" doesn't exist.
> fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
> *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);
> As for regular pages (or files), above call does work, but as for huge
> pages, above call would fail because hugetlbfs_file_mmap would fail if
> (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
> This capability on huge page is useful on ia64 when the process wants to
> protect one area on region 4, so other threads couldn't read/write this
> area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the
> capability.
>
> But it was only moved.
>
> Before that patch:
> * mmap(PROT_WRITE) would have failed if the file size would be exceeded
> * mmap(PROT_READ/PROT_NONE) would have extended the file
>
> After that patch
> * mmap(PROT_WRITE) will extend the file
> * mmap(PROT_READ/PROT_NONE) do not extend the file
>
> The code before that predates git times.
>
> Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
> around all hugetlbfs quirks.
>
> I suggest either
>
> (a) Document it, along with the workaround
At least needs documentation.
> (b) Change it an cross fingers.
>
>
> In QEMU source code is a very interesting comment:
>
> * ftruncate is not supported by hugetlbfs in older
> * hosts, so don't bother bailing out on errors.
> * If anything goes wrong with it under other filesystems,
> * mmap will fail.
>
> So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
> was added?
>
> QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
> smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
> a rare case to happen, though, and likely also undesired here: we want it to behave just
> like ordinary files!
Ideally yes.
Thanks for your feedback.
-Prakash.
>
> --
> Cheers,
>
> David / dhildenb