[PATCH 1/1] numa, mm, memory-hotplug: Do not allocate pagetable to local node with MEMORY_HOTREMOVE enabled.

From: Tang Chen
Date: Thu May 16 2013 - 07:47:44 EST


The following patch-set allocated pagetables to local node.
https://lkml.org/lkml/2013/4/11/829

Doing this will break memory hot-remove.

Before removing memory, the kernel offlines memory. If offlining
memory fails, the memory cannot be removed. The pagetables are
used by the kernel, so they cannot be offlined. Furthermore, they
cannot be removed.

Of course, we can free pagetable pages because the pagetables of
the to be removed memory are useless. But offlining memory doesn't
mean removing memory. If users only want to offline memory, the
pagetables should not be freed.

The minimum unit of memory online/offline is block. And by default,
one block contains one section, which by default is 128MB. There is
possiblity that half of the block is pagetable, and the other half
is movable memory.

When we offline this kind of block, the status of the block is
uncertain. We cannot simply free the pagetables in this block because
they may be used by other online blocks. But when doing memory
hot-remove, the failure of offlining blocks will break the memory
hot-remove logic.


In order to fix it, we have three solutions:

1. Reserve the whole block (128MB), making no user can use the rest
parts of the block. And skip them when offlining memory.
When all the other blocks are offlined, free the pagetable, and remove
all the memory.

But we may lose some memory for this purpose. 128MB is a little big
to waste.


2. Keep this block online. Although the offline operation fails, it is
OK to remove memory.

But the offline operation will always fail. And generally speaking,
there are a lot of reasons of offline failing, it is difficult to
detect if it is OK to remove memory. So we don't suggest this way.


3. Migrate user pages and make this block offline. Offlining memory won't
stop the kernel using the pagetables stored in them, so it will be OK.

But this will change the semantics of "offline". I'm not sure if we
can do it in this way.


So before we fix this problem, I think we should not allocate pagetables
to local node when CONFIG_MEMORY_HOTREMOVE is enabled. And recover it when
we confirm the direction and fix the problem.

This patch is based on
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

Any other solution for this problem is welcome.


Signed-off-by: Tang Chen <tangchen@xxxxxxxxxxxxxx>
---
arch/x86/mm/init.c | 27 ++++++++++++++++-----------
1 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8d0007a..8cd8a2d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -55,18 +55,23 @@ __ref void *alloc_low_pages(unsigned int num)

if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+#ifndef CONFIG_MEMORY_HOTPLUG
+ if (local_max_pfn_mapped > local_min_pfn_mapped) {
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+#endif
+ {
if (low_min_pfn_mapped >= low_max_pfn_mapped)
panic("alloc_low_page: ran out of memory");
ret = memblock_find_in_range(
low_min_pfn_mapped << PAGE_SHIFT,
low_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
- } else
- ret = memblock_find_in_range(
- local_min_pfn_mapped << PAGE_SHIFT,
- local_max_pfn_mapped << PAGE_SHIFT,
- PAGE_SIZE * num , PAGE_SIZE);
+ }
+
if (!ret)
panic("alloc_low_page: can not alloc memory");
memblock_reserve(ret, PAGE_SIZE * num);
@@ -443,6 +448,11 @@ void __init init_mem_mapping(unsigned long begin, unsigned long end)
if (new_mapped_ram_size > mapped_ram_size)
step_size <<= STEP_SIZE_SHIFT;
mapped_ram_size += new_mapped_ram_size;
+
+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
}

if (real_end < end) {
@@ -450,11 +460,6 @@ void __init init_mem_mapping(unsigned long begin, unsigned long end)
if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
local_max_pfn_mapped = end >> PAGE_SHIFT;
}
-
- if (is_low) {
- low_min_pfn_mapped = local_min_pfn_mapped;
- low_max_pfn_mapped = local_max_pfn_mapped;
- }
}

#ifndef CONFIG_NUMA
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/