Re: [patch] mm: node-setup agnostic free_bootmem()

From: Johannes Weiner
Date: Wed Apr 30 2008 - 13:52:38 EST


Hi,

"Yinghai Lu" <yhlu.kernel@xxxxxxxxx> writes:

> On Wed, Apr 30, 2008 at 3:50 AM, Johannes Weiner <hannes@xxxxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> "Yinghai Lu" <yhlu.kernel@xxxxxxxxx> writes:
>>
>> > On Mon, Apr 28, 2008 at 12:11 PM, Yinghai Lu <yhlu.kernel@xxxxxxxxx> wrote:
>> >>
>> >> On Mon, Apr 28, 2008 at 9:54 AM, Johannes Weiner <hannes@xxxxxxxxxxxx> wrote:
>> >> > Hi Yinghai,
>> >> >
>> >> >
>> >> >
>> >> > "Yinghai Lu" <yhlu.kernel@xxxxxxxxx> writes:
>> >> >
>> >> > > On Sun, Apr 27, 2008 at 5:40 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
>> >> > >>
>> >> > >> * Johannes Weiner <hannes@xxxxxxxxxxxx> wrote:
>> >> > >>
>> >> > >> > > so i very much agree that your changes are cleaner, i just wanted to
>> >> > >> > > have one that has all the fixes included.
>> >> > >> >
>> >> > >> > I had planned this to be another patch because there are more then one
>> >> > >> > boundary check I wanted to tighten. I can merge them though if you
>> >> > >> > like.
>> >> > >>
>> >> > >> no, better to have them in separate patches.
>> >> > >>
>> >> > >> > > Would you like to post a patch against current -git or should i
>> >> > >> > > extract the cleaner reserve_bootmem() from your previous patch?
>> >> > >> >
>> >> > >> > I just moved and have only sporadic internet access and free time
>> >> > >> > slots available. Would be nice if you could do it!
>> >> > >>
>> >> > >> sure, find the merged patch below, against latest -git, boot-tested on
>> >> > >> x86. Is this what you had in mind?
>> >> > >>
>> >> > >> Ingo
>> >> > >>
>> >> > >> ---------------->
>> >> > >> Subject: mm: node-setup agnostic free_bootmem()
>> >> > >> From: Johannes Weiner <hannes@xxxxxxxxxxxx>
>> >> > >> Date: Wed, 16 Apr 2008 13:36:31 +0200
>> >> > >>
>> >> > >> Make free_bootmem() look up the node holding the specified address
>> >> > >> range which lets it work transparently on single-node and multi-node
>> >> > >> configurations.
>> >> > >>
>> >> > >> If the address range exceeds the node range, it well be marked free
>> >> > >> across node boundaries, too.
>> >> > >>
>> >> > >> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxxx>
>> >> > >> CC: Andi Kleen <andi@xxxxxxxxxxxxxx>
>> >> > >> CC: Yinghai Lu <yhlu.kernel@xxxxxxxxx>
>> >> > >> CC: Yasunori Goto <y-goto@xxxxxxxxxxxxxx>
>> >> > >> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> >> > >> CC: Christoph Lameter <clameter@xxxxxxx>
>> >> > >> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> >> > >> Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
>> >> > >> ---
>> >> > >> mm/bootmem.c | 27 +++++++++++++++++++++++++--
>> >> > >> 1 file changed, 25 insertions(+), 2 deletions(-)
>> >> > >>
>> >> > >> Index: linux-x86.q/mm/bootmem.c
>> >> > >> ===================================================================
>> >> > >> --- linux-x86.q.orig/mm/bootmem.c
>> >> > >> +++ linux-x86.q/mm/bootmem.c
>> >> > >> @@ -493,8 +493,31 @@ int __init reserve_bootmem(unsigned long
>> >> > >> void __init free_bootmem(unsigned long addr, unsigned long size)
>> >> > >> {
>> >> > >> bootmem_data_t *bdata;
>> >> > >> - list_for_each_entry(bdata, &bdata_list, list)
>> >> > >> - free_bootmem_core(bdata, addr, size);
>> >> > >> + unsigned long pos = addr;
>> >> > >> + unsigned long partsize = size;
>> >> > >> +
>> >> > >> + list_for_each_entry(bdata, &bdata_list, list) {
>> >> > >> + unsigned long remainder = 0;
>> >> > >> +
>> >> > >> + if (pos < bdata->node_boot_start)
>> >> > >> + continue;
>> >> > >> +
>> >> > >> + if (PFN_DOWN(pos + partsize) > bdata->node_low_pfn) {
>> >> > >> + remainder = PFN_DOWN(pos + partsize) - bdata->node_low_pfn;
>> >> > >> + partsize -= remainder;
>> >> > >> + }
>> >> > >> +
>> >> > >> + free_bootmem_core(bdata, pos, partsize);
>> >> > >> +
>> >> > >> + if (!remainder)
>> >> > >> + return;
>> >> > >> +
>> >> > >> + pos = PFN_PHYS(bdata->node_low_pfn + 1);
>> >> > >> + }
>> >> > >> + printk(KERN_ERR "free_bootmem: request: addr=%lx, size=%lx, "
>> >> > >> + "state: pos=%lx, partsize=%lx\n", addr, size,
>> >> > >> + pos, partsize);
>> >> > >> + BUG();
>> >> > >> }
>> >> > >>
>> >> > >> unsigned long __init free_all_bootmem(void)
>> >> > >>
>> >> > >
>> >> > > it will not work with cross nodes.
>> >> > >
>> >> > > for example: node 0: 0-2g, 4-6g, node1: 2-4g, 6-8g.
>> >> > > and if ramdisk sit cross 2G boundary. you will only free the range
>> >> > > before 2g.
>> >> >
>> >> > Yes, you stated that several times but this is not a technical argument:
>> >> > These setups are afaik not yet supported by the kernel at all. And you
>> >> > could not explain the node layout with the patch that implements support
>> >> > for these configurations.
>> >>
>> >> I looked at Suresh's patch, and it still only has one bdata for one node.
>> >
>> > Suresh's patch already in the Linus tree.
>> > commit 6ec6e0d9f2fd7cb6ca6bc3bfab5ae7b5cdd8c36f
>> > Author: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
>> > Date: Tue Mar 25 10:14:35 2008 -0700
>> >
>> > srat, x86: add support for nodes spanning other nodes
>> >
>> > For example, If the physical address layout on a two node system with 8 GB
>> > memory is something like:
>> > node 0: 0-2GB, 4-6GB
>> > node 1: 2-4GB, 6-8GB
>> >
>> > Current kernels fail to boot/detect this NUMA topology.
>> >
>> > ACPI SRAT tables can expose such a topology which needs to be supported.
>> >
>> > Signed-off-by: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
>> > Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
>> > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>
>> Okay, so we have one bdata for node 0 and one for node 1. Does that mean
>> that both have overlapping pfn ranges?
>>
>> [1 ||||| ]
>> [2 ||||| ]
>>
>> Like this? How are the ||||| represented in the bootmem maps of each bdata?
>
> Yes.

Okay. So they share the same PFNs. Now imagine the following scenario:

node0: 0-2GB, 4-6GB
node1: 2-4GB, 6-8GB

/* Marks the range on node0 and node1 */
free_bootmem(1.5G, 2G);

/* Frees all bootmem on both nodes */
free_all_bootmem_node(NODE_DATA(0));
free_all_bootmem_node(NODE_DATA(1));

Aren't the same page descriptors send to __free_bootmem_pages() twice?

Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/