Re: [PATCH] CPU hotplug, debug: Detect imbalance between get_online_cpus()and put_online_cpus()

From: Srivatsa S. Bhat
Date: Thu Oct 04 2012 - 02:16:49 EST


On 10/04/2012 02:43 AM, Andrew Morton wrote:
> On Wed, 03 Oct 2012 18:23:09 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx> wrote:
>
>> The synchronization between CPU hotplug readers and writers is achieved by
>> means of refcounting, safe-guarded by the cpu_hotplug.lock.
>>
>> get_online_cpus() increments the refcount, whereas put_online_cpus() decrements
>> it. If we ever hit an imbalance between the two, we end up compromising the
>> guarantees of the hotplug synchronization i.e, for example, an extra call to
>> put_online_cpus() can end up allowing a hotplug reader to execute concurrently with
>> a hotplug writer. So, add a BUG_ON() in put_online_cpus() to detect such cases
>> where the refcount can go negative.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
>> ---
>>
>> kernel/cpu.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index f560598..00d29bc 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -80,6 +80,7 @@ void put_online_cpus(void)
>> if (cpu_hotplug.active_writer == current)
>> return;
>> mutex_lock(&cpu_hotplug.lock);
>> + BUG_ON(cpu_hotplug.refcount == 0);
>> if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
>> wake_up_process(cpu_hotplug.active_writer);
>> mutex_unlock(&cpu_hotplug.lock);
>
> I think calling BUG() here is a bit harsh. We should only do that if
> there's a risk to proceeding: a risk of data loss, a reduced ability to
> analyse the underlying bug, etc.
>
> But a cpu-hotplug locking imbalance is a really really really minor
> problem! So how about we emit a warning then try to fix things up?

That would be better indeed, thanks!

> This should increase the chance that the machine will keep running and
> so will increase the chance that a user will be able to report the bug
> to us.
>

Yep, sounds good.

>
> --- a/kernel/cpu.c~cpu-hotplug-debug-detect-imbalance-between-get_online_cpus-and-put_online_cpus-fix
> +++ a/kernel/cpu.c
> @@ -80,9 +80,12 @@ void put_online_cpus(void)
> if (cpu_hotplug.active_writer == current)
> return;
> mutex_lock(&cpu_hotplug.lock);
> - BUG_ON(cpu_hotplug.refcount == 0);
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> + if (!--cpu_hotplug.refcount) {

This won't catch it. We'll enter this 'if' condition only when cpu_hotplug.refcount was
decremented to zero. We'll miss out the case when it went negative (which we intended to detect).

> + if (WARN_ON(cpu_hotplug.refcount == -1))
> + cpu_hotplug.refcount++; /* try to fix things up */
> + if (unlikely(cpu_hotplug.active_writer))
> + wake_up_process(cpu_hotplug.active_writer);
> + }
> mutex_unlock(&cpu_hotplug.lock);
>
> }

So how about something like below:

------------------------------------------------------>

From: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
Subject: [PATCH] CPU hotplug, debug: Detect imbalance between get_online_cpus() and put_online_cpus()

The synchronization between CPU hotplug readers and writers is achieved by
means of refcounting, safe-guarded by the cpu_hotplug.lock.

get_online_cpus() increments the refcount, whereas put_online_cpus() decrements
it. If we ever hit an imbalance between the two, we end up compromising the
guarantees of the hotplug synchronization i.e, for example, an extra call to
put_online_cpus() can end up allowing a hotplug reader to execute concurrently with
a hotplug writer. So, add a WARN_ON() in put_online_cpus() to detect such cases
where the refcount can go negative, and also attempt to fix it up, so that we can
continue to run.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
---

kernel/cpu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index f560598..42bd331 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -80,6 +80,10 @@ void put_online_cpus(void)
if (cpu_hotplug.active_writer == current)
return;
mutex_lock(&cpu_hotplug.lock);
+
+ if (WARN_ON(!cpu_hotplug.refcount))
+ cpu_hotplug.refcount++; /* try to fix things up */
+
if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
wake_up_process(cpu_hotplug.active_writer);
mutex_unlock(&cpu_hotplug.lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/