Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

From: David Hildenbrand
Date: Tue Aug 21 2018 - 09:17:19 EST


> add_device_memory is in charge of

I wouldn't use the terminology of onlining/offlining here. That applies
rather to memory that is exposed to the rest of the system (e.g. buddy
allocator, has underlying memory block devices). I guess it is rather a
pure setup/teardown of that device memory.

>
> a) calling either arch_add_memory() or add_pages(), depending on whether
> we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
> expand zone/pgdat spanned pages and initialize its pages
>
> del_device_memory, on the other hand, is in charge of
>
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
> whether we need to tear down the linear mapping or not
>
> These two functions are called from:
>
> add_device_memory:
> - devm_memremap_pages()
> - hmm_devmem_pages_create()
>
> del_device_memory:
> - devm_memremap_pages_release()
> - hmm_devmem_release()
>
> I think that this will get easier as soon as [1] gets merged.
>
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
>
> [1] https://lkml.org/lkml/2018/6/19/110
>

[...]

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f90fa077b0c4..d04338ffabec 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct page *page = pfn_to_page(start_pfn);
> - struct zone *zone;
> int ret;
>
> - /* With altmap the first mapped page is offset from @start */
> - if (altmap)
> - page += vmem_altmap_offset(altmap);
> - zone = page_zone(page);
> - ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
> + ret = __remove_pages(nid, start_pfn, nr_pages, altmap);

These changes certainly look nice.

[...]

> index 7a832b844f24..95df37686196 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -119,6 +119,8 @@ static void devm_memremap_pages_release(void *data)
> struct dev_pagemap *pgmap = data;
> struct device *dev = pgmap->dev;
> struct resource *res = &pgmap->res;
> + struct vmem_altmap *altmap = pgmap->altmap_valid ?
> + &pgmap->altmap : NULL;
> resource_size_t align_start, align_size;
> unsigned long pfn;
> int nid;
> @@ -138,8 +140,7 @@ static void devm_memremap_pages_release(void *data)
> nid = dev_to_node(dev);
>
> mem_hotplug_begin();

I would really like to see the mem_hotplug_begin/end also getting moved
inside add_device_memory()/del_device_memory(). (just like for
add/remove_memory)

I wonder if kasan_ stuff actually requires this lock, or if it could
also be somehow moved inside add_device_memory/del_device_memory.

> - arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
> - &pgmap->altmap : NULL);
> + del_device_memory(nid, align_start, align_size, altmap, true);
> kasan_remove_zero_shadow(__va(align_start), align_size);
> mem_hotplug_done();
>
> @@ -175,7 +176,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> {
> resource_size_t align_start, align_size, align_end;
> struct vmem_altmap *altmap = pgmap->altmap_valid ?
> - &pgmap->altmap : NULL;
> + &pgmap->altmap : NULL;
> struct resource *res = &pgmap->res;
> unsigned long pfn, pgoff, order;
> pgprot_t pgprot = PAGE_KERNEL;
> @@ -249,11 +250,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
> goto err_kasan;
> }
>
> - error = arch_add_memory(nid, align_start, align_size, altmap, false);
> - if (!error)
> - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> - align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, altmap);
> + error = add_device_memory(nid, align_start, align_size, altmap, true);
> +
> mem_hotplug_done();
> if (error)
> goto err_add_memory;
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 30e1bc68503b..8e68b5ca67ca 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1262,6 +1262,22 @@ int release_mem_region_adjustable(struct resource *parent,
> continue;
> }
>
> + /*
> + * All memory regions added from memory-hotplug path
> + * have the flag IORESOURCE_SYSTEM_RAM.
> + * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
> + * If the resource does not have this flag, we know that
> + * we are dealing with a resource coming from HMM/devm.
> + * HMM/devm use another mechanism to add/release a resource.
> + * This goes via devm_request_mem_region/devm_release_mem_region.
> + * HMM/devm take care to release their resources when they want, so
> + * if we are dealing with them, let us just back off nicely
> + */

Maybe shorten that a bit

"HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
devm_request_mem_region/devm_release_mem_region to add/release a
resource. Just back off here."

> + if (!(res->flags & IORESOURCE_SYSRAM)) {
> + ret = 0;
> + break;
> + }
> +
> if (!(res->flags & IORESOURCE_MEM))
> break;
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 21787e480b4a..23ce7fbdb651 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -996,6 +996,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
> struct zone *zone;
> struct page *page;
> int nid;
> + bool mapping;
>
> if (percpu_ref_tryget_live(&devmem->ref)) {
> dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -1012,12 +1013,14 @@ static void hmm_devmem_release(struct device *dev, void *data)
>
> mem_hotplug_begin();
> if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> - __remove_pages(zone, start_pfn, npages, NULL);
> + mapping = false;
> else
> - arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
> - npages << PAGE_SHIFT, NULL);
> - mem_hotplug_done();
> + mapping = true;
>
> + del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
> + NULL,
> + mapping);
> + mem_hotplug_done();
> hmm_devmem_radix_release(resource);
> }
>
> @@ -1027,6 +1030,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> struct device *device = devmem->device;
> int ret, nid, is_ram;
> unsigned long pfn;
> + bool mapping;
>
> align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
> align_size = ALIGN(devmem->resource->start +
> @@ -1085,7 +1089,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> if (nid < 0)
> nid = numa_mem_id();
>
> - mem_hotplug_begin();
> /*
> * For device private memory we call add_pages() as we only need to
> * allocate and initialize struct page for the device memory. More-
> @@ -1096,21 +1099,18 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
> * For device public memory, which is accesible by the CPU, we do
> * want the linear mapping and thus use arch_add_memory().
> */
> + mem_hotplug_begin();
> if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> - ret = arch_add_memory(nid, align_start, align_size, NULL,
> - false);
> + mapping = true;
> else
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, NULL, false);
> - if (ret) {
> - mem_hotplug_done();
> - goto error_add_memory;
> - }
> - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> - align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, NULL);
> + mapping = false;
> +
> + ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
> mem_hotplug_done();
>
> + if (ret)
> + goto error_add_memory;
> +
> for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
> struct page *page = pfn_to_page(pfn);
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9bd629944c91..60b67f09956e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -320,12 +320,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
> unsigned long start_pfn,
> unsigned long end_pfn)
> {
> - struct mem_section *ms;
> -
> for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
> - ms = __pfn_to_section(start_pfn);
> + struct mem_section *ms = __pfn_to_section(start_pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (unlikely(pfn_to_nid(start_pfn) != nid))
> @@ -345,15 +343,14 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
> unsigned long start_pfn,
> unsigned long end_pfn)
> {
> - struct mem_section *ms;
> unsigned long pfn;
>
> /* pfn is the end pfn of a memory section. */
> pfn = end_pfn - 1;
> for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
> - ms = __pfn_to_section(pfn);
> + struct mem_section *ms = __pfn_to_section(pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (unlikely(pfn_to_nid(pfn) != nid))
> @@ -415,7 +412,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
> ms = __pfn_to_section(pfn);
>
> - if (unlikely(!valid_section(ms)))
> + if (unlikely(!online_section(ms)))
> continue;
>
> if (page_zone(pfn_to_page(pfn)) != zone)
> @@ -502,23 +499,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
> pgdat->node_spanned_pages = 0;
> }
>
> -static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +static void shrink_pages(struct zone *zone, unsigned long start_pfn,
> + unsigned long end_pfn,
> + unsigned long offlined_pages)
> {
> struct pglist_data *pgdat = zone->zone_pgdat;
> int nr_pages = PAGES_PER_SECTION;
> unsigned long flags;
> + unsigned long pfn;
>
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> - shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + zone->present_pages -= offlined_pages;
> +
> + clear_zone_contiguous(zone);
> + pgdat_resize_lock(pgdat, &flags);
> +
> + for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
> + shrink_zone_span(zone, pfn, pfn + nr_pages);
> + shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
> + }
> + pgdat->node_present_pages -= offlined_pages;
> +
> + pgdat_resize_unlock(pgdat, &flags);
> + set_zone_contiguous(zone);
> }
>
> -static int __remove_section(struct zone *zone, struct mem_section *ms,
> +static int __remove_section(int nid, struct mem_section *ms,
> unsigned long map_offset, struct vmem_altmap *altmap)
> {
> - unsigned long start_pfn;
> - int scn_nr;
> int ret = -EINVAL;
>
> if (!valid_section(ms))
> @@ -528,17 +535,13 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
> if (ret)
> return ret;
>
> - scn_nr = __section_nr(ms);
> - start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
> - __remove_zone(zone, start_pfn);
> -
> - sparse_remove_one_section(zone, ms, map_offset, altmap);
> + sparse_remove_one_section(nid, ms, map_offset, altmap);
> return 0;
> }
>
> /**
> * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * @nid: nid from which pages belong to
> * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
> * @nr_pages: number of pages to remove (must be multiple of section size)
> * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -548,35 +551,27 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
> * sure that pages are marked reserved and zones are adjust properly by
> * calling offline_pages().
> */
> -int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> +int __remove_pages(int nid, unsigned long phys_start_pfn,
> unsigned long nr_pages, struct vmem_altmap *altmap)
> {
> unsigned long i;
> unsigned long map_offset = 0;
> int sections_to_remove, ret = 0;
> + resource_size_t start, size;
>
> - /* In the ZONE_DEVICE case device driver owns the memory region */
> - if (is_dev_zone(zone)) {
> - if (altmap)
> - map_offset = vmem_altmap_offset(altmap);
> - } else {
> - resource_size_t start, size;
> -
> - start = phys_start_pfn << PAGE_SHIFT;
> - size = nr_pages * PAGE_SIZE;
> + start = phys_start_pfn << PAGE_SHIFT;
> + size = nr_pages * PAGE_SIZE;
>
> - ret = release_mem_region_adjustable(&iomem_resource, start,
> - size);
> - if (ret) {
> - resource_size_t endres = start + size - 1;
> + if (altmap)
> + map_offset = vmem_altmap_offset(altmap);
>
> - pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> - &start, &endres, ret);
> - }
> + ret = release_mem_region_adjustable(&iomem_resource, start, size);
> + if (ret) {
> + resource_size_t endres = start + size - 1;
> + pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> + &start, &endres, ret);
> }
>
> - clear_zone_contiguous(zone);
> -
> /*
> * We can only remove entire sections
> */
> @@ -587,15 +582,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> for (i = 0; i < sections_to_remove; i++) {
> unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
>
> - ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
> - altmap);
> + ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
> + altmap);
> map_offset = 0;
> if (ret)
> break;
> }
>
> - set_zone_contiguous(zone);
> -
> return ret;
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -1595,7 +1588,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
> unsigned long pfn, nr_pages;
> long offlined_pages;
> int ret, node;
> - unsigned long flags;
> unsigned long valid_start, valid_end;
> struct zone *zone;
> struct memory_notify arg;
> @@ -1665,11 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
> undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
> /* removal success */
> adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> - zone->present_pages -= offlined_pages;
>
> - pgdat_resize_lock(zone->zone_pgdat, &flags);
> - zone->zone_pgdat->node_present_pages -= offlined_pages;
> - pgdat_resize_unlock(zone->zone_pgdat, &flags);
> + /* Here we will shrink zone/node's spanned/present_pages */
> + shrink_pages(zone, valid_start, valid_end, offlined_pages);
>
> init_per_zone_wmark_min();
>
> @@ -1902,4 +1892,57 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> mem_hotplug_done();
> }
> EXPORT_SYMBOL_GPL(remove_memory);
> +
> +static int __del_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + int ret;
> + unsigned long start_pfn = PHYS_PFN(start);
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> + struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> + offline_mem_sections(start_pfn, start_pfn + nr_pages);
> + shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
> +
> + if (mapping)
> + ret = arch_remove_memory(nid, start, size, altmap);
> + else
> + ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
> +
> + return ret;
> +}
> +
> +int del_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + return __del_device_memory(nid, start, size, altmap, mapping);
> +}
> #endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +static int __add_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + int ret;
> + unsigned long start_pfn = PHYS_PFN(start);
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> +
> + if (mapping)
> + ret = arch_add_memory(nid, start, size, altmap, false);
> + else
> + ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
> +
> + if (!ret) {
> + struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> + online_mem_sections(start_pfn, start_pfn + nr_pages);
> + move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
> + }
> +
> + return ret;
> +}
> +
> +int add_device_memory(int nid, unsigned long start, unsigned long size,
> + struct vmem_altmap *altmap, bool mapping)
> +{
> + return __add_device_memory(nid, start, size, altmap, mapping);
> +}

Any reason for these indirections?

> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..016020bd20b5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -766,12 +766,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
> free_map_bootmem(memmap);
> }


I guess for readability, this patch could be split up into several
patches. E.g. factoring out of add_device_memory/del_device_memory,
release_mem_region_adjustable change ...

--

Thanks,

David / dhildenb