Re: [RFC][PATCH 3/3] a big contig memory allocator

From: Bob Liu
Date: Thu Oct 28 2010 - 23:55:26 EST


On Tue, Oct 26, 2010 at 6:08 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Add an function to allocate contiguous memory larger than MAX_ORDER.
> The main difference between usual page allocator is that this uses
> memory offline technique (Isolate pages and migrate remaining pages.).
>
> I think this is not 100% solution because we can't avoid fragmentation,
> but we have kernelcore= boot option and can create MOVABLE zone. That
> helps us to allow allocate a contiguous range on demand.
>
> The new function is
>
> Âalloc_contig_pages(base, end, nr_pages, alignment)
>
> This function will allocate contiguous pages of nr_pages from the range
> [base, end). If [base, end) is bigger than nr_pages, some pfn which
> meats alignment will be allocated. If alignment is smaller than MAX_ORDER,
> it will be raised to be MAX_ORDER.
>
> __alloc_contig_pages() has much more arguments.
>
> Some drivers allocates contig pages by bootmem or hiding some memory
> from the kernel at boot. But if contig pages are necessary only in some
> situation, kernelcore= boot option and using page migration is a choice.
>
> Note: I'm not 100% sure __GFP_HARDWALL check is required or not..
>
>
> Changelog: 2010-10-26
> Â- support gfp_t
> Â- support zonelist/nodemask
> Â- support [base, end)
> Â- support alignment
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> ---
> Âinclude/linux/page-isolation.h | Â 15 ++
> Âmm/page_alloc.c        Â|  29 ++++
> Âmm/page_isolation.c      Â| Â239 +++++++++++++++++++++++++++++++++++++++++
> Â3 files changed, 283 insertions(+)
>
> Index: mmotm-1024/mm/page_isolation.c
> ===================================================================
> --- mmotm-1024.orig/mm/page_isolation.c
> +++ mmotm-1024/mm/page_isolation.c
> @@ -5,6 +5,7 @@
> Â#include <linux/mm.h>
> Â#include <linux/page-isolation.h>
> Â#include <linux/pageblock-flags.h>
> +#include <linux/swap.h>
> Â#include <linux/memcontrol.h>
> Â#include <linux/migrate.h>
> Â#include <linux/memory_hotplug.h>
> @@ -398,3 +399,241 @@ retry:
> Â Â Â Â}
> Â Â Â Âreturn 0;
> Â}
> +
> +/*
> + * Comparing user specified [user_start, user_end) with physical memory layout
> + * [phys_start, phys_end). If no intersection of length nr_pages, return 1.
> + * If there is an intersection, return 0 and fill range in [*start, *end)
> + */
> +static int
> +__calc_search_range(unsigned long user_start, unsigned long user_end,
> + Â Â Â Â Â Â Â unsigned long nr_pages,
> + Â Â Â Â Â Â Â unsigned long phys_start, unsigned long phys_end,
> + Â Â Â Â Â Â Â unsigned long *start, unsigned long *end)
> +{
> + Â Â Â if ((user_start >= phys_end) || (user_end <= phys_start))
> + Â Â Â Â Â Â Â return 1;
> + Â Â Â if (user_start <= phys_start) {
> + Â Â Â Â Â Â Â *start = phys_start;
> + Â Â Â Â Â Â Â *end = min(user_end, phys_end);
> + Â Â Â } else {
> + Â Â Â Â Â Â Â *start = user_start;
> + Â Â Â Â Â Â Â *end = min(user_end, phys_end);
> + Â Â Â }
> + Â Â Â if (*end - *start < nr_pages)
> + Â Â Â Â Â Â Â return 1;
> + Â Â Â return 0;
> +}
> +
> +
> +/**
> + * __alloc_contig_pages - allocate a contiguous physical pages
> + * @base: the lowest pfn which caller wants.
> + * @end: Âthe highest pfn which caller wants.
> + * @nr_pages: the length of a chunk of pages to be allocated.
> + * @align_order: alignment of start address of returned chunk in order.
> + * Â Returned' page's order will be aligned to (1 << align_order).If smaller
> + * Â than MAX_ORDER, it's raised to MAX_ORDER.
> + * @node: allocate near memory to the node, If -1, current node is used.
> + * @gfpflag: used to specify what zone the memory should be from.
> + * @nodemask: allocate memory within the nodemask.
> + *
> + * Search a memory range [base, end) and allocates physically contiguous
> + * pages. If end - base is larger than nr_pages, a chunk in [base, end) will
> + * be allocated
> + *
> + * This returns a page of the beginning of contiguous block. At failure, NULL
> + * is returned.
> + *
> + * Limitation: at allocation, nr_pages may be increased to be aligned to
> + * MAX_ORDER before searching a range. So, even if there is a enough chunk
> + * for nr_pages, it may not be able to be allocated. Extra tail pages of
> + * allocated chunk is returned to buddy allocator before returning the caller.
> + */
> +
> +#define MIGRATION_RETRY Â Â Â Â(5)
> +struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
> + Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages, int align_order,
> + Â Â Â Â Â Â Â Â Â Â Â int node, gfp_t gfpflag, nodemask_t *mask)
> +{
> + Â Â Â unsigned long found, aligned_pages, start;
> + Â Â Â struct page *ret = NULL;
> + Â Â Â int migration_failed;
> + Â Â Â bool no_search = false;
> + Â Â Â unsigned long align_mask;
> + Â Â Â struct zoneref *z;
> + Â Â Â struct zone *zone;
> + Â Â Â struct zonelist *zonelist;
> + Â Â Â enum zone_type highzone_idx = gfp_zone(gfpflag);
> + Â Â Â unsigned long zone_start, zone_end, rs, re, pos;
> +
> + Â Â Â if (node == -1)
> + Â Â Â Â Â Â Â node = numa_node_id();
> +
> + Â Â Â /* check unsupported flags */
> + Â Â Â if (gfpflag & __GFP_NORETRY)
> + Â Â Â Â Â Â Â return NULL;
> + Â Â Â if ((gfpflag & (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)) !=
> + Â Â Â Â Â Â Â (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL))
> + Â Â Â Â Â Â Â return NULL;
> +
> + Â Â Â if (gfpflag & __GFP_THISNODE)
> + Â Â Â Â Â Â Â zonelist = &NODE_DATA(node)->node_zonelists[1];
> + Â Â Â else
> + Â Â Â Â Â Â Â zonelist = &NODE_DATA(node)->node_zonelists[0];
> + Â Â Â /*
> + Â Â Â Â* Base/nr_page/end should be aligned to MAX_ORDER
> + Â Â Â Â*/
> + Â Â Â found = 0;
> +
> + Â Â Â if (align_order < MAX_ORDER)
> + Â Â Â Â Â Â Â align_order = MAX_ORDER;
> +
> + Â Â Â align_mask = (1 << align_order) - 1;
> + Â Â Â if (end - base == nr_pages)
> + Â Â Â Â Â Â Â no_search = true;

no_search is not used ?

> + Â Â Â /*
> + Â Â Â Â* We allocates MAX_ORDER aligned pages and cut tail pages later.
> + Â Â Â Â*/
> + Â Â Â aligned_pages = ALIGN(nr_pages, (1 << MAX_ORDER));
> + Â Â Â /*
> + Â Â Â Â* If end - base == nr_pages, we can't search range. base must be
> + Â Â Â Â* aligned.
> + Â Â Â Â*/
> + Â Â Â if ((end - base == nr_pages) && (base & align_mask))
> + Â Â Â Â Â Â Â return NULL;
> +
> + Â Â Â base = ALIGN(base, (1 << align_order));
> + Â Â Â if ((end <= base) || (end - base < aligned_pages))
> + Â Â Â Â Â Â Â return NULL;
> +
> + Â Â Â /*
> + Â Â Â Â* searching contig memory range within [pos, end).
> + Â Â Â Â* pos is updated at migration failure to find next chunk in zone.
> + Â Â Â Â* pos is reset to the base at searching next zone.
> + Â Â Â Â* (see for_each_zone_zonelist_nodemask in mmzone.h)
> + Â Â Â Â*
> + Â Â Â Â* Note: we cannot assume zones/nodes are in linear memory layout.
> + Â Â Â Â*/
> + Â Â Â z = first_zones_zonelist(zonelist, highzone_idx, mask, &zone);
> + Â Â Â pos = base;
> +retry:
> + Â Â Â if (!zone)
> + Â Â Â Â Â Â Â return NULL;
> +
> + Â Â Â zone_start = ALIGN(zone->zone_start_pfn, 1 << align_order);
> + Â Â Â zone_end = zone->zone_start_pfn + zone->spanned_pages;
> +
> + Â Â Â /* check [pos, end) is in this zone. */
> + Â Â Â if ((pos >= end) ||
> + Â Â Â Â Â Â(__calc_search_range(pos, end, aligned_pages,
> + Â Â Â Â Â Â Â Â Â Â Â zone_start, zone_end, &rs, &re))) {
> +next_zone:
> + Â Â Â Â Â Â Â /* go to the next zone */
> + Â Â Â Â Â Â Â z = next_zones_zonelist(++z, highzone_idx, mask, &zone);
> + Â Â Â Â Â Â Â /* reset the pos */
> + Â Â Â Â Â Â Â pos = base;
> + Â Â Â Â Â Â Â goto retry;
> + Â Â Â }
> + Â Â Â /* [pos, end) is trimmed to [rs, re) in this zone. */
> + Â Â Â pos = rs;
> +
> + Â Â Â found = find_contig_block(rs, re, aligned_pages, align_order, zone);
> + Â Â Â if (!found)
> + Â Â Â Â Â Â Â goto next_zone;
> +
> + Â Â Â /*
> + Â Â Â Â* OK, here, we have contiguous pageblock marked as "isolated"
> + Â Â Â Â* try migration.
> + Â Â Â Â*/
> + Â Â Â drain_all_pages();
> + Â Â Â lru_add_drain_all();
> +
> + Â Â Â /*
> + Â Â Â Â* scan_lru_pages() finds the next PG_lru page in the range
> + Â Â Â Â* scan_lru_pages() returns 0 when it reaches the end.
> + Â Â Â Â*/
> + Â Â Â migration_failed = 0;
> + Â Â Â rs = found;
> + Â Â Â re = found + aligned_pages;
> + Â Â Â for (rs = scan_lru_pages(rs, re);
> + Â Â Â Â Â Ârs && rs < re;
> + Â Â Â Â Â Ârs = scan_lru_pages(rs, re)) {
> + Â Â Â Â Â Â Â if (do_migrate_range(rs, re)) {
> + Â Â Â Â Â Â Â Â Â Â Â /* it's better to try another block ? */
> + Â Â Â Â Â Â Â Â Â Â Â if (++migration_failed >= MIGRATION_RETRY)
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
> + Â Â Â Â Â Â Â Â Â Â Â /* take a rest and synchronize LRU etc. */
> + Â Â Â Â Â Â Â Â Â Â Â drain_all_pages();
> + Â Â Â Â Â Â Â Â Â Â Â lru_add_drain_all();
> + Â Â Â Â Â Â Â } else /* reset migration_failure counter */
> + Â Â Â Â Â Â Â Â Â Â Â migration_failed = 0;
> + Â Â Â }
> +
> + Â Â Â if (!migration_failed) {
> + Â Â Â Â Â Â Â drain_all_pages();
> + Â Â Â Â Â Â Â lru_add_drain_all();
> + Â Â Â }
> + Â Â Â /* Check all pages are isolated */
> + Â Â Â if (test_pages_isolated(found, found + aligned_pages)) {
> + Â Â Â Â Â Â Â undo_isolate_page_range(found, aligned_pages);
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* We failed at [found...found+aligned_pages) migration.
> + Â Â Â Â Â Â Â Â* "rs" is the last pfn scan_lru_pages() found that the page
> + Â Â Â Â Â Â Â Â* is LRU page. Update pos and try next chunk.
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â pos = ALIGN(rs + 1, (1 << align_order));
> + Â Â Â Â Â Â Â goto retry; /* goto next chunk */
> + Â Â Â }
> + Â Â Â /*
> + Â Â Â Â* OK, here, [found...found+pages) memory are isolated.
> + Â Â Â Â* All pages in the range will be moved into the list with
> + Â Â Â Â* page_count(page)=1.
> + Â Â Â Â*/
> + Â Â Â ret = pfn_to_page(found);
> + Â Â Â alloc_contig_freed_pages(found, found + aligned_pages, gfpflag);
> + Â Â Â /* unset ISOLATE */
> + Â Â Â undo_isolate_page_range(found, aligned_pages);
> + Â Â Â /* Free unnecessary pages in tail */
> + Â Â Â for (start = found + nr_pages; start < found + aligned_pages; start++)
> + Â Â Â Â Â Â Â __free_page(pfn_to_page(start));
> + Â Â Â return ret;
> +
> +}
> +EXPORT_SYMBOL_GPL(__alloc_contig_pages);
> +
> +void free_contig_pages(struct page *page, int nr_pages)
> +{
> + Â Â Â int i;
> + Â Â Â for (i = 0; i < nr_pages; i++)
> + Â Â Â Â Â Â Â __free_page(page + i);
> +}
> +EXPORT_SYMBOL_GPL(free_contig_pages);
> +
> +/*
> + * Allocated pages will not be MOVABLE but MOVABLE zone is a suitable
> + * for allocating big chunk. So, using ZONE_MOVABLE is a default.
> + */
> +
> +struct page *alloc_contig_pages(unsigned long base, unsigned long end,
> + Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages, int align_order)
> +{
> + Â Â Â return __alloc_contig_pages(base, end, nr_pages, align_order, -1,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GFP_KERNEL | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages);
> +
> +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order)
> +{
> + Â Â Â return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, -1,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GFP_KERNEL | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages_host);
> +
> +struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â int align_order)
> +{
> + Â Â Â return __alloc_contig_pages(0, max_pfn, nr_pages, align_order, nid,
> + Â Â Â Â Â Â Â Â Â Â Â GFP_KERNEL | __GFP_THISNODE | __GFP_MOVABLE, NULL);
> +}
> +EXPORT_SYMBOL_GPL(alloc_contig_pages_node);
> Index: mmotm-1024/include/linux/page-isolation.h
> ===================================================================
> --- mmotm-1024.orig/include/linux/page-isolation.h
> +++ mmotm-1024/include/linux/page-isolation.h
> @@ -32,6 +32,8 @@ test_pages_isolated(unsigned long start_
> Â*/
> Âextern int set_migratetype_isolate(struct page *page);
> Âextern void unset_migratetype_isolate(struct page *page);
> +extern void alloc_contig_freed_pages(unsigned long pfn,
> + Â Â Â Â Â Â Â unsigned long pages, gfp_t flag);
>
> Â/*
> Â* For migration.
> @@ -41,4 +43,17 @@ int test_pages_in_a_zone(unsigned long s
> Âunsigned long scan_lru_pages(unsigned long start, unsigned long end);
> Âint do_migrate_range(unsigned long start_pfn, unsigned long end_pfn);
>
> +/*
> + * For large alloc.
> + */
> +struct page *__alloc_contig_pages(unsigned long base, unsigned long end,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages, int align_order,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â int node, gfp_t flag, nodemask_t *mask);
> +struct page *alloc_contig_pages(unsigned long base, unsigned long end,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages, int align_order);
> +struct page *alloc_contig_pages_host(unsigned long nr_pages, int align_order);
> +struct page *alloc_contig_pages_node(int nid, unsigned long nr_pages,
> + Â Â Â Â Â Â Â int align_order);
> +void free_contig_pages(struct page *page, int nr_pages);
> +
> Â#endif
> Index: mmotm-1024/mm/page_alloc.c
> ===================================================================
> --- mmotm-1024.orig/mm/page_alloc.c
> +++ mmotm-1024/mm/page_alloc.c
> @@ -5430,6 +5430,35 @@ out:
> Â Â Â Âspin_unlock_irqrestore(&zone->lock, flags);
> Â}
>
> +
> +void alloc_contig_freed_pages(unsigned long pfn, Âunsigned long end, gfp_t flag)
> +{
> + Â Â Â struct page *page;
> + Â Â Â struct zone *zone;
> + Â Â Â int order;
> + Â Â Â unsigned long start = pfn;
> +
> + Â Â Â zone = page_zone(pfn_to_page(pfn));
> + Â Â Â spin_lock_irq(&zone->lock);
> + Â Â Â while (pfn < end) {
> + Â Â Â Â Â Â Â VM_BUG_ON(!pfn_valid(pfn));
> + Â Â Â Â Â Â Â page = pfn_to_page(pfn);
> + Â Â Â Â Â Â Â VM_BUG_ON(page_count(page));
> + Â Â Â Â Â Â Â VM_BUG_ON(!PageBuddy(page));
> + Â Â Â Â Â Â Â list_del(&page->lru);
> + Â Â Â Â Â Â Â order = page_order(page);
> + Â Â Â Â Â Â Â zone->free_area[order].nr_free--;
> + Â Â Â Â Â Â Â rmv_page_order(page);
> + Â Â Â Â Â Â Â __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> + Â Â Â Â Â Â Â pfn += 1 << order;
> + Â Â Â }
> + Â Â Â spin_unlock_irq(&zone->lock);
> +
> + Â Â Â /*After this, pages in the range can be freed one be one */
> + Â Â Â for (pfn = start; pfn < end; pfn++)
> + Â Â Â Â Â Â Â prep_new_page(pfn_to_page(pfn), 0, flag);
> +}
> +
> Â#ifdef CONFIG_MEMORY_HOTREMOVE
> Â/*
> Â* All pages in the range must be isolated before calling this.
>
--
Thanks,
--Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/