Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling

From: Aneesh Kumar K.V
Date: Wed Mar 07 2012 - 23:18:00 EST


On Wed, 7 Mar 2012 20:28:39 +0800, Hillf Danton <dhillf@xxxxxxxxx> wrote:
> On Wed, Mar 7, 2012 at 12:48 PM, David Gibson
> <david@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > hugetlbfs_{get,put}_quota() are badly named. ÂThey don't interact with the
> > general quota handling code, and they don't much resemble its behaviour.
> > Rather than being about maintaining limits on on-disk block usage by
> > particular users, they are instead about maintaining limits on in-memory
> > page usage (including anonymous MAP_PRIVATE copied-on-write pages)
> > associated with a particular hugetlbfs filesystem instance.
> >
> > Worse, they work by having callbacks to the hugetlbfs filesystem code from
> > the low-level page handling code, in particular from free_huge_page().
> > This is a layering violation of itself, but more importantly, if the kernel
> > does a get_user_pages() on hugepages (which can happen from KVM amongst
> > others), then the free_huge_page() can be delayed until after the
> > associated inode has already been freed. ÂIf an unmount occurs at the
> > wrong time, even the hugetlbfs superblock where the "quota" limits are
> > stored may have been freed.
> >
> > Andrew Barry proposed a patch to fix this by having hugepages, instead of
> > storing a pointer to their address_space and reaching the superblock from
> > there, had the hugepages store pointers directly to the superblock, bumping
> > the reference count as appropriate to avoid it being freed. ÂAndrew Morton
> > rejected that version, however, on the grounds that it made the existing
> > layering violation worse.
> >
> > This is a reworked version of Andrew's patch, which removes the extra, and
> > some of the existing, layering violation. ÂIt works by introducing the
> > concept of a hugepage "subpool" at the lower hugepage mm layer - that is
> > a finite logical pool of hugepages to allocate from. Âhugetlbfs now creates
> > a subpool for each filesystem instance with a page limit set, and a pointer
> > to the subpool gets added to each allocated hugepage, instead of the
> > address_space pointer used now. ÂThe subpool has its own lifetime and is
> > only freed once all pages in it _and_ all other references to it (i.e.
> > superblocks) are gone.
> >
> > subpools are optional - a NULL subpool pointer is taken by the code to mean
> > that no subpool limits are in effect.
> >
> > Previous discussion of this bug found in: Â"Fix refcounting in hugetlbfs
> > quota handling.". See: Âhttps://lkml.org/lkml/2011/8/11/28 or
> > http://marc.info/?l=linux-mm&m=126928970510627&w=1
> >
> > v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
> > alloc_huge_page() - since it already takes the vma, it is not necessary.
> >
> > Signed-off-by: Andrew Barry <abarry@xxxxxxxx>
> > Signed-off-by: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxx>
> > Cc: Minchan Kim <minchan.kim@xxxxxxxxx>
> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Cc: Hillf Danton <dhillf@xxxxxxxxx>
> > ---
> > Âfs/hugetlbfs/inode.c  Â|  54 +++++++-----------
> > Âinclude/linux/hugetlb.h | Â 14 ++++--
> > Âmm/hugetlb.c      Â| Â135 +++++++++++++++++++++++++++++++++++++---------
> > Â3 files changed, 139 insertions(+), 64 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index bb0e366..74c6ba2 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -626,9 +626,15 @@ static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> > Â Â Â Â Â Â Â Âspin_lock(&sbinfo->stat_lock);
> > Â Â Â Â Â Â Â Â/* If no limits set, just report 0 for max/free/used
> > Â Â Â Â Â Â Â Â * blocks, like simple_statfs() */
> > - Â Â Â Â Â Â Â if (sbinfo->max_blocks >= 0) {
> > - Â Â Â Â Â Â Â Â Â Â Â buf->f_blocks = sbinfo->max_blocks;
> > - Â Â Â Â Â Â Â Â Â Â Â buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
> > + Â Â Â Â Â Â Â if (sbinfo->spool) {
> > + Â Â Â Â Â Â Â Â Â Â Â long free_pages;
> > +
> > + Â Â Â Â Â Â Â Â Â Â Â spin_lock(&sbinfo->spool->lock);
> > + Â Â Â Â Â Â Â Â Â Â Â buf->f_blocks = sbinfo->spool->max_hpages;
> > + Â Â Â Â Â Â Â Â Â Â Â free_pages = sbinfo->spool->max_hpages
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â - sbinfo->spool->used_hpages;
> > + Â Â Â Â Â Â Â Â Â Â Â buf->f_bavail = buf->f_bfree = free_pages;
> > + Â Â Â Â Â Â Â Â Â Â Â spin_unlock(&sbinfo->spool->lock);
> > Â Â Â Â Â Â Â Â Â Â Â Âbuf->f_files = sbinfo->max_inodes;
> > Â Â Â Â Â Â Â Â Â Â Â Âbuf->f_ffree = sbinfo->free_inodes;
> > Â Â Â Â Â Â Â Â}
> > @@ -644,6 +650,10 @@ static void hugetlbfs_put_super(struct super_block *sb)
> >
> > Â Â Â Âif (sbi) {
> > Â Â Â Â Â Â Â Âsb->s_fs_info = NULL;
> > +
> > + Â Â Â Â Â Â Â if (sbi->spool)
> > + Â Â Â Â Â Â Â Â Â Â Â hugepage_put_subpool(sbi->spool);
> > +
> > Â Â Â Â Â Â Â Âkfree(sbi);
> > Â Â Â Â}
> > Â}
> > @@ -874,10 +884,14 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
> > Â Â Â Âsb->s_fs_info = sbinfo;
> > Â Â Â Âsbinfo->hstate = config.hstate;
> > Â Â Â Âspin_lock_init(&sbinfo->stat_lock);
> > - Â Â Â sbinfo->max_blocks = config.nr_blocks;
> > - Â Â Â sbinfo->free_blocks = config.nr_blocks;
> > Â Â Â Âsbinfo->max_inodes = config.nr_inodes;
> > Â Â Â Âsbinfo->free_inodes = config.nr_inodes;
> > + Â Â Â sbinfo->spool = NULL;
> > + Â Â Â if (config.nr_blocks != -1) {
> > + Â Â Â Â Â Â Â sbinfo->spool = hugepage_new_subpool(config.nr_blocks);
> > + Â Â Â Â Â Â Â if (!sbinfo->spool)
> > + Â Â Â Â Â Â Â Â Â Â Â goto out_free;
> > + Â Â Â }
> > Â Â Â Âsb->s_maxbytes = MAX_LFS_FILESIZE;
> > Â Â Â Âsb->s_blocksize = huge_page_size(config.hstate);
> > Â Â Â Âsb->s_blocksize_bits = huge_page_shift(config.hstate);
> > @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent)
> > Â Â Â Âsb->s_root = root;
> > Â Â Â Âreturn 0;
> > Âout_free:
> > + Â Â Â if (sbinfo->spool)
> > + Â Â Â Â Â Â Â kfree(sbinfo->spool);
> > Â Â Â Âkfree(sbinfo);
> > Â Â Â Âreturn -ENOMEM;
> > Â}
> >
> > -int hugetlb_get_quota(struct address_space *mapping, long delta)
> > -{
> > - Â Â Â int ret = 0;
> > - Â Â Â struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> > -
> > - Â Â Â if (sbinfo->free_blocks > -1) {
> > - Â Â Â Â Â Â Â spin_lock(&sbinfo->stat_lock);
> > - Â Â Â Â Â Â Â if (sbinfo->free_blocks - delta >= 0)
> > - Â Â Â Â Â Â Â Â Â Â Â sbinfo->free_blocks -= delta;
> > - Â Â Â Â Â Â Â else
> > - Â Â Â Â Â Â Â Â Â Â Â ret = -ENOMEM;
> > - Â Â Â Â Â Â Â spin_unlock(&sbinfo->stat_lock);
> > - Â Â Â }
> > -
> > - Â Â Â return ret;
> > -}
> > -
> > -void hugetlb_put_quota(struct address_space *mapping, long delta)
> > -{
> > - Â Â Â struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);
> > -
> > - Â Â Â if (sbinfo->free_blocks > -1) {
> > - Â Â Â Â Â Â Â spin_lock(&sbinfo->stat_lock);
> > - Â Â Â Â Â Â Â sbinfo->free_blocks += delta;
> > - Â Â Â Â Â Â Â spin_unlock(&sbinfo->stat_lock);
> > - Â Â Â }
> > -}
> > -
> > Âstatic struct dentry *hugetlbfs_mount(struct file_system_type *fs_type,
> > Â Â Â Âint flags, const char *dev_name, void *data)
> > Â{
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 7adc492..cf01817 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -14,6 +14,15 @@ struct user_struct;
> > Â#include <linux/shm.h>
> > Â#include <asm/tlbflush.h>
> >
> > +struct hugepage_subpool {
> > + Â Â Â spinlock_t lock;
> > + Â Â Â long count;
> > + Â Â Â long max_hpages, used_hpages;
> > +};
> > +
> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks);
> > +void hugepage_put_subpool(struct hugepage_subpool *spool);
> > +
> > Âint PageHuge(struct page *page);
> >
> > Âvoid reset_vma_resv_huge_pages(struct vm_area_struct *vma);
> > @@ -129,12 +138,11 @@ enum {
> >
> > Â#ifdef CONFIG_HUGETLBFS
> > Âstruct hugetlbfs_sb_info {
> > -    long  Âmax_blocks;  /* blocks allowed */
> > -    long  Âfree_blocks; Â/* blocks free */
> >    Âlong  Âmax_inodes;  /* inodes allowed */
> >    Âlong  Âfree_inodes; Â/* inodes free */
> >    Âspinlock_t   Âstat_lock;
> > Â Â Â Âstruct hstate *hstate;
> > + Â Â Â struct hugepage_subpool *spool;
> > Â};
> >
> > Âstatic inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> > @@ -146,8 +154,6 @@ extern const struct file_operations hugetlbfs_file_operations;
> > Âextern const struct vm_operations_struct hugetlb_vm_ops;
> > Âstruct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âstruct user_struct **user, int creat_flags);
> > -int hugetlb_get_quota(struct address_space *mapping, long delta);
> > -void hugetlb_put_quota(struct address_space *mapping, long delta);
> >
> > Âstatic inline int is_file_hugepages(struct file *file)
> > Â{
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5f34bd8..36b38b3a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size;
> > Â*/
> > Âstatic DEFINE_SPINLOCK(hugetlb_lock);
> >
> > +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
> > +{
> > + Â Â Â bool free = (spool->count == 0) && (spool->used_hpages == 0);
> > +
> > + Â Â Â spin_unlock(&spool->lock);
> > +
> > + Â Â Â /* If no pages are used, and no other handles to the subpool
> > + Â Â Â Â* remain, free the subpool the subpool remain */
> > + Â Â Â if (free)
> > + Â Â Â Â Â Â Â kfree(spool);
> > +}
> > +
> > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks)
> > +{
> > + Â Â Â struct hugepage_subpool *spool;
> > +
> > + Â Â Â spool = kmalloc(sizeof(*spool), GFP_KERNEL);
> > + Â Â Â if (!spool)
> > + Â Â Â Â Â Â Â return NULL;
> > +
> > + Â Â Â spin_lock_init(&spool->lock);
> > + Â Â Â spool->count = 1;
> > + Â Â Â spool->max_hpages = nr_blocks;
> > + Â Â Â spool->used_hpages = 0;
> > +
> > + Â Â Â return spool;
> > +}
> > +
> > +void hugepage_put_subpool(struct hugepage_subpool *spool)
> > +{
> > + Â Â Â spin_lock(&spool->lock);
> > + Â Â Â BUG_ON(!spool->count);
> > + Â Â Â spool->count--;
> > + Â Â Â unlock_or_release_subpool(spool);
> > +}
> > +
> > +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool,
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â long delta)
> > +{
> > + Â Â Â int ret = 0;
> > +
> > + Â Â Â if (!spool)
> > + Â Â Â Â Â Â Â return 0;
> > +
> > + Â Â Â spin_lock(&spool->lock);
> > + Â Â Â if ((spool->used_hpages + delta) <= spool->max_hpages) {
> > + Â Â Â Â Â Â Â spool->used_hpages += delta;
> > + Â Â Â } else {
> > + Â Â Â Â Â Â Â ret = -ENOMEM;
> > + Â Â Â }
> > + Â Â Â spin_unlock(&spool->lock);
> > +
> > + Â Â Â return ret;
> > +}
> > +
> > +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool,
> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âlong delta)
> > +{
> > + Â Â Â if (!spool)
> > + Â Â Â Â Â Â Â return;
> > +
> > + Â Â Â spin_lock(&spool->lock);
> > + Â Â Â spool->used_hpages -= delta;
> > + Â Â Â /* If hugetlbfs_put_super couldn't free spool due to
> > + Â Â Â * an outstanding quota reference, free it now. */
> > + Â Â Â unlock_or_release_subpool(spool);
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
> > +{
> > + Â Â Â return HUGETLBFS_SB(inode->i_sb)->spool;
> > +}
> > +
> > +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > +{
> > + Â Â Â return subpool_inode(vma->vm_file->f_dentry->d_inode);
> > +}
> > +
> > Â/*
> > Â* Region tracking -- allows tracking of reservations and instantiated pages
> > Â* Â Â Â Â Â Â Â Â Â Âacross the pages in a mapping.
> > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page)
> > Â Â Â Â */
> > Â Â Â Âstruct hstate *h = page_hstate(page);
> > Â Â Â Âint nid = page_to_nid(page);
> > - Â Â Â struct address_space *mapping;
> > + Â Â Â struct hugepage_subpool *spool =
> > + Â Â Â Â Â Â Â (struct hugepage_subpool *)page_private(page);
> >
> > - Â Â Â mapping = (struct address_space *) page_private(page);
> > Â Â Â Âset_page_private(page, 0);
> > Â Â Â Âpage->mapping = NULL;
> > Â Â Â ÂBUG_ON(page_count(page));
> > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page)
> > Â Â Â Â Â Â Â Âenqueue_huge_page(h, page);
> > Â Â Â Â}
> > Â Â Â Âspin_unlock(&hugetlb_lock);
> > - Â Â Â if (mapping)
> > - Â Â Â Â Â Â Â hugetlb_put_quota(mapping, 1);
> > + Â Â Â hugepage_subpool_put_pages(spool, 1);
>
> Like current code, quota is handed back *unconditionally*, but ...


We will end up doing get_quota for every allocated page. get_quota
happens either during mmap() if MAP_NORESERVE is not specified.
or during alloc_huge_page if we haven't done a quota reservation during
mmap for that range. Are you finding any part of code where we miss that ?

-aneesh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/