Re: [PATCH] hugetlbfs: dirty pages as they are added to pagecache

From: Michal Hocko
Date: Tue Oct 23 2018 - 13:41:55 EST


On Tue 23-10-18 10:30:44, Mike Kravetz wrote:
> On 10/23/18 12:43 AM, Michal Hocko wrote:
> > On Wed 17-10-18 21:10:22, Mike Kravetz wrote:
> >> Some test systems were experiencing negative huge page reserve
> >> counts and incorrect file block counts. This was traced to
> >> /proc/sys/vm/drop_caches removing clean pages from hugetlbfs
> >> file pagecaches. When non-hugetlbfs explicit code removes the
> >> pages, the appropriate accounting is not performed.
> >>
> >> This can be recreated as follows:
> >> fallocate -l 2M /dev/hugepages/foo
> >> echo 1 > /proc/sys/vm/drop_caches
> >> fallocate -l 2M /dev/hugepages/foo
> >> grep -i huge /proc/meminfo
> >> AnonHugePages: 0 kB
> >> ShmemHugePages: 0 kB
> >> HugePages_Total: 2048
> >> HugePages_Free: 2047
> >> HugePages_Rsvd: 18446744073709551615
> >> HugePages_Surp: 0
> >> Hugepagesize: 2048 kB
> >> Hugetlb: 4194304 kB
> >> ls -lsh /dev/hugepages/foo
> >> 4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
> >>
> >> To address this issue, dirty pages as they are added to pagecache.
> >> This can easily be reproduced with fallocate as shown above. Read
> >> faulted pages will eventually end up being marked dirty. But there
> >> is a window where they are clean and could be impacted by code such
> >> as drop_caches. So, just dirty them all as they are added to the
> >> pagecache.
> >>
> >> In addition, it makes little sense to even try to drop hugetlbfs
> >> pagecache pages, so disable calls to these filesystems in drop_caches
> >> code.
> >>
> >> Fixes: 70c3547e36f5 ("hugetlbfs: add hugetlbfs_fallocate()")
> >> Cc: stable@xxxxxxxxxxxxxxx
> >> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> >
> > I do agree with others that HUGETLBFS_MAGIC check in drop_pagecache_sb
> > is wrong in principal. I am not even sure we want to special case memory
> > backed filesystems. What if we ever implement MADV_FREE on fs? Should
> > those pages be dropped? My first idea take would be yes.
>
> Ok, I have removed that hard coded check. Implementing MADV_FREE on
> hugetlbfs would take some work, but it could be done.
>
> > Acked-by: Michal Hocko <mhocko@xxxxxxxx> to the set_page_dirty dirty
> > part.
> >
> > Although I am wondering why you haven't covered only the fallocate path
> > wrt Fixes tag. In other words, do we need the same treatment for the
> > page fault path? We do not set dirty bit on page there as well. We rely
> > on the dirty bit in pte and only for writable mappings. I have hard time
> > to see why we have been safe there as well. So maybe it is your Fixes:
> > tag which is not entirely correct, or I am simply missing the fault
> > path.
>
> No, you are not missing anything. In the commit log I mentioned that this
> also does apply to the fault path. The change takes care of them both.
>
> I was struggling with what to put in the fixes tag. As mentioned, this
> problem also exists in the fault path. Since 3.16 is the oldest stable
> release, I went back and used the commit next to the add_to_page_cache code
> there. However, that seems kind of random. Is there a better way to say
> the patch applies to all stable releases?

OK, good, I was afraid I was missing something, well except for not
reading the changelog properly. I would go with

Cc: stable # all kernels with hugetlb

> Here is updated patch without the drop_caches change and updated fixes tag.
>
> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>
> hugetlbfs: dirty pages as they are added to pagecache
>
> Some test systems were experiencing negative huge page reserve
> counts and incorrect file block counts. This was traced to
> /proc/sys/vm/drop_caches removing clean pages from hugetlbfs
> file pagecaches. When non-hugetlbfs explicit code removes the
> pages, the appropriate accounting is not performed.
>
> This can be recreated as follows:
> fallocate -l 2M /dev/hugepages/foo
> echo 1 > /proc/sys/vm/drop_caches
> fallocate -l 2M /dev/hugepages/foo
> grep -i huge /proc/meminfo
> AnonHugePages: 0 kB
> ShmemHugePages: 0 kB
> HugePages_Total: 2048
> HugePages_Free: 2047
> HugePages_Rsvd: 18446744073709551615
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> Hugetlb: 4194304 kB
> ls -lsh /dev/hugepages/foo
> 4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
>
> To address this issue, dirty pages as they are added to pagecache.
> This can easily be reproduced with fallocate as shown above. Read
> faulted pages will eventually end up being marked dirty. But there
> is a window where they are clean and could be impacted by code such
> as drop_caches. So, just dirty them all as they are added to the
> pagecache.
>
> Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>

Acked-by: Mihcla Hocko <mhocko@xxxxxxxx>

> ---
> mm/hugetlb.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5c390f5a5207..7b5c0ad9a6bd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3690,6 +3690,12 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
> return err;
> ClearPagePrivate(page);
>
> + /*
> + * set page dirty so that it will not be removed from cache/file
> + * by non-hugetlbfs specific code paths.
> + */
> + set_page_dirty(page);
> +
> spin_lock(&inode->i_lock);
> inode->i_blocks += blocks_per_huge_page(h);
> spin_unlock(&inode->i_lock);
> --
> 2.17.2

--
Michal Hocko
SUSE Labs