divide by zero bug in find_busiest_group

From: Chetan Ahuja
Date: Wed Aug 25 2010 - 21:17:19 EST

Next message: MichaÅ Nazarewicz: "Re: [PATCH/RFCv4 3/6] mm: cma: Added SysFS support"
Previous message: KAMEZAWA Hiroyuki: "Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization fromoom_badness()"
Next in thread: Venkatesh Pallipadi: "Re: divide by zero bug in find_busiest_group"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This has been filed as a bug in the kernel bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=16991)
but the visibility on bugzilla seems low ( and the bugizlla server
seems to get overly "stressed" during
certain parts of the day) so here's my "summary" of the discussion so
far. If for nothing else, so it gets
indexed by search engines etc.

We've seen a divide-by-zero crash in the function update_sg_lb_stats
(inlined into find_busiest_group) at the following location :

/usr/src/linux/kernel/sched.c:3769
*balance = 0;
return;
}

/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) /
group->cpu_power;
aff5: 48 c1 e0 0a shl $0xa,%rax
aff9: 48 f7 f6 div %rsi

Apparently group->cpu_power can be zero under some conditions.

I noticed (what I thought was) a race condition between cpu_power
being initted to zero
(in build_xxx_groups functions in sched.c) and their use as
denominator in find_busiest/idlest_group
functions. PeterZ replied that there's a safe codepath from
build_*_groups functions to the crash
location which guaranteed a non-zero value. I did express concern
that in absence of explicit
synchronization/mem-barriers we're at the mercy of compiler and
hardware doing us favors (by
not re-ordering instructions in an adverse way) for that guarantee.
But I don't think we got hit
by the initial zeroes because all the crashes I saw happened after
many months of uptime.

There's also another place group->cpu_power values gets updated
without any synchronization, in
the update_cpu_power function. Though the only way this could result
in a bad value for cpu_power
is by core A reading an in-transit value for a non-atomically-updated
64 bit value from core B :-). Unlikely ?
Very !!. Should we make that update explicity atomic ? Would be prudent.

We do need more ideas on how the zero could have gotten there. The two
paths I mentioned above don't
provide that warm, fuzzy feeling yet.

Thanks
Chetan

P.S.

a) kernel version (2.6.32 release from kernel.org. Though a similar
divide-by-zero has been
reported as recently as 2.6.35 in a Ubuntu distribution kernel
here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/615135
b) Hardware : 8 core nehalem (Intel E5520).. /proc/cpuinfo shows 16
"hyperthreaded" cores.

some relevant CONFIG settings:
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
.
.
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y

CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_SCHED_TRACER is not set
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: MichaÅ Nazarewicz: "Re: [PATCH/RFCv4 3/6] mm: cma: Added SysFS support"
Previous message: KAMEZAWA Hiroyuki: "Re: [PATCH 1/2][BUGFIX] oom: remove totalpage normalization fromoom_badness()"
Next in thread: Venkatesh Pallipadi: "Re: divide by zero bug in find_busiest_group"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]