[RFC PATCH v2 cpuset] Don't pass offlined cpus topartition_sched_domains()

From: Li Zhong
Date: Tue Apr 09 2013 - 05:59:48 EST


Hi Zefan,

I did some test today, enabling cpuset and online/offline the cpus. It
caused NULL address dereferencing in get_group(). After adding some
debugging code, it seems that it is caused by using some offlined cpus
as the parameter to partition_sched_domains(). More details below:

===================
following is printed in build_sched_domain():
@@ -6476,6 +6486,14 @@ struct sched_domain *build_sched_domain(struct
sched_domain_topology_level *tl,
return child;

cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+ if (cpumask_empty(sched_domain_span(sd)))
+ printk("lz empty domain %p, mask %pS\n", sd, tl->mask);
+ char cpulist[128];
+ cpumask_scnprintf(cpulist, sizeof(cpulist), cpu_map);
+ printk("lz cpu_map %s\n", cpulist);
+ cpumask_scnprintf(cpulist, sizeof(cpulist), tl->mask(cpu));
+ printk("lz tl->mask(%d) %pS %s\n",cpu, tl->mask, cpulist);
+
if (child) {
sd->level = child->level + 1;
sched_domain_level_max = max(sched_domain_level_max,
sd->level);

[ 232.714502] lz empty domain c0000001f73c0000, mask cpu_smt_mask
+0x0/0x10
[ 232.714515] lz cpu_map 1-2,7,13-15
[ 232.714521] lz tl->mask(1) cpu_smt_mask+0x0/0x10
[ 232.714529] lz cpu_map 1-2,7,13-15
[ 232.714536] lz tl->mask(1) cpu_cpu_mask+0x0/0x10 2,7,13-15
[ 232.714544] lz cpu_map 1-2,7,13-15
[ 232.714551] lz tl->mask(1) sd_numa_mask+0x0/0x10 2,7,13-15
[ 232.714558] lz cpu_map 1-2,7,13-15
[ 232.714565] lz tl->mask(2) cpu_smt_mask+0x0/0x10 2
===================

It seems that one cpu (#1 of the above) could be taken offline in
cpuset_hotplug_workfn(), after top_cpuset's allowed cpus changed and the
propagated work function finished. But generate_sched_domains(), which
uses the cpuset information, still considers this cpu (#1) valid, and
passes it down in the cpumask to partition_sched_domains(). Then we have
an empty sd as the above.

With the empty domain, in get_group(),
if (child)
cpu = cpumask_first(sched_domain_span(child));

this might cause the cpu to be returned a value >= nr_cpu_ids, then
cause bad dereference with the wrong cpu value in the later code.

This following code tries to fix the issue by anding the cpu_active_mask
at the end of generate_sched_domains() for each partition. This might
result the partiton (set of domains) not the best one, but we know
another rebuild (caused by the cpu offline in the middle of
cpuset_hotplug_workfn()) is on the way.

Also it seems that generate_sched_domains() needs be called together
with partition_sched_domains(), under the hotplug lock. Or similarly,
one cpu might be taken offline while doing generate_sched_domains(), and
cause the above issue.

Or maybe we could put the whole cpuset_hotplug_workfn() into the hotplug
lock?

Thanks, Zhong

---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4f9dfe4..aed8436 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -754,6 +754,12 @@ done:
*/
if (doms == NULL)
ndoms = 1;
+ else {
+ for (nslot = 0; nslot < ndoms; nslot++) {
+ struct cpumask *dp = doms[nslot];
+ cpumask_and(dp, dp, cpu_active_mask);
+ }
+ }

*domains = doms;
*attributes = dattr;
@@ -2222,17 +2228,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
flush_workqueue(cpuset_propagate_hotplug_wq);

/* rebuild sched domains if cpus_allowed has changed */
- if (cpus_updated) {
- struct sched_domain_attr *attr;
- cpumask_var_t *doms;
- int ndoms;
-
- mutex_lock(&cpuset_mutex);
- ndoms = generate_sched_domains(&doms, &attr);
- mutex_unlock(&cpuset_mutex);
-
- partition_sched_domains(ndoms, doms, attr);
- }
+ if (cpus_updated)
+ rebuild_sched_domains();
}

void cpuset_update_active_cpus(bool cpu_online)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/