Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982

From: Peter Zijlstra
Date: Thu Jul 14 2011 - 09:16:42 EST

Next message: Mel Gorman: "Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages inkswapd except in high priority"
Previous message: Frantisek Hrbata: "[PATCH] exec: remove page_table_lock for mm counters in acct_arg_size()"
In reply to: Anton Blanchard: "Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982"
Next in thread: Anton Blanchard: "Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:

> I also printed out the cpu spans as we walk through build_sched_groups:

> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480

> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
>
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.

Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.

Now that used to work because it used to simply allocate a new group
instead of using the existing one.

The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.

Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.

Looking at a slightly simpler set-up (4 socket AMD magny-cours):

$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10

We can translate that into groups like

{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...

and we can easily see there's overlap there as well in the NUMA layout
itself.

This seems to suggest we need to separate the unique state from the
sched_group.

Now all I need is a way to not consume gobs of memory.. /me goes prod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mel Gorman: "Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages inkswapd except in high priority"
Previous message: Frantisek Hrbata: "[PATCH] exec: remove page_table_lock for mm counters in acct_arg_size()"
In reply to: Anton Blanchard: "Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982"
Next in thread: Anton Blanchard: "Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]