[BUG] sched: big numa dynamic sched domain memory corruption

From: Paul Jackson
Date: Mon Jul 31 2006 - 03:05:36 EST


We have hit a bug in the dynamic sched domain setup. It causes random
memory writes via a stale pointer.

I don't entirely understand the code yet, so my description of this
bug may be flawed. I'll do the best I can. Thanks to Jack Steiner
for figuring out what we know so far.

The three systems we are testing on have 128, 224 and 256 CPUs.
They are single core, ia64 SN2 itanium systems configured with:
CONFIG_NUMA - enabled
CONFIG_SCHED_MC - disabled
CONFIG_SCHED_SMT - disabled

They are running approximately the 2.6.16.* kernel found in SLES10.

We first noticed the problem due to the memory clobbering, and
had to crank up the slab debug code a notch to backtrack to the
apparent original cause. The bug does not cause an immediate
crash or kernel complaint.

In sum, it appears that the large array sched_group_allnodes is
free'd up by arch_destroy_sched_domains() when someone redefines
the cpu_exclusive portion of the cpuset configuration, but some
of the sd->groups are left pointing into the free'd array, causing
the assignment:
sd->groups->cpu_power = power;
to write via a stale sd->groups pointer.

The build_sched_domains() code only rebuilds the sd->groups pointer
to the current sched_group_allnodes array for those cpus that are
in the specified cpu_map. The remaining cpus seem to be left with
stale sd->groups pointers.

The above summary may be too inaccurate to be helpful.

I'll step through the failing scenario in more detail, and hopefully
with fewer inaccuracies.


During the system boot, the initial call to build_sched_domains()
sets up all encompasing sched_group_allnodes and the smaller
child domains and groups. So far, all is well. Part of
this initialization includes allocating a large array called
sched_group_allnodes, and for each cpu in the system, initializing
its sd->groups->cpu_power element in the sched_group_allnodes
array.

After boot, we run some commands that create a child cpuset,
with, for this example, cpus 4-8, marked cpu_exclusive.

This calls arch_destroy_sched_domains(), which frees
sched_group_allnodes.

Then this calls build_sched_domains() with a mask including
*all-but* cpus 4-8 (in this example). That call allocates a new
sched_group_allnodes and in the first for_each_cpu_mask() loop,
initializes the sched domain, including sd->groups, for *all-but*
cpus 4-8. The sd->groups for 4-8 are still pointing back at
the now freed original sched_group_allnodes array.

Then we call build_sched_domains() again, with a mask for just
cpus 4-8. It executes the line:
sd->groups->cpu_power = power;
with a stale sd->groups pointer, clobbering the already freed
memory that used to be in the sched_group_allnodes array. For our
situation, we are in the "#ifdef CONFIG_NUMA" variant of this line.


Aha - while writing the above, I had an idea for a possible fix.

The following patch seems to fix this problem, at least for the
above CONFIG on one of the test systems. Though I have no particular
confidence that it is a good patch.

The idea of the patch is to -always- execute the code conditioned by
the "if (... > SD_NODES_PER_DOMAIN*...) {" test on big systems, even
if we happen to be calling build_sched_domains() with a small cpu_map.

---

kernel/sched.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- linux.orig/kernel/sched.c 2006-07-30 23:42:12.182958555 -0700
+++ linux/kernel/sched.c 2006-07-30 23:45:12.513282355 -0700
@@ -5675,12 +5675,13 @@ void build_sched_domains(const cpumask_t
int group;
struct sched_domain *sd = NULL, *p;
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
+ int cpus_per_node = cpus_weight(nodemask);

cpus_and(nodemask, nodemask, *cpu_map);

#ifdef CONFIG_NUMA
- if (cpus_weight(*cpu_map)
- > SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
+ if (cpus_weight(cpu_online_map)
+ > SD_NODES_PER_DOMAIN*cpus_per_node) {
if (!sched_group_allnodes) {
sched_group_allnodes
= kmalloc(sizeof(struct sched_group)


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/