Re: + generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd.patchadded to -mm tree

From: Xiao Guangrong
Date: Fri Sep 11 2009 - 03:45:56 EST




Peter Zijlstra wrote:
> On Thu, 2009-07-30 at 17:30 -0700, akpm@xxxxxxxxxxxxxxxxxxxx wrote:
>
>> ------------------------------------------------------
>> Subject: generic-ipi: fix the race between generic_smp_call_function_*() and hotplug_cfd()
>> From: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxx>
>>
>> There is a race between generic_smp_call_function_*() and hotplug_cfd() in
>> many cases, see below examples:
>>
>> 1: hotplug_cfd() can free cfd->cpumask, the system will crash if the
>> cpu's cfd still in the call_function list:
>>
>>
>> CPU A: CPU B
>>
>> smp_call_function_many() ......
>> cpu_down() ......
>> hotplug_cfd() -> ......
>> free_cpumask_var(cfd->cpumask) (receive function IPI interrupte)
>> /* read cfd->cpumask */
>> generic_smp_call_function_interrupt() ->
>> cpumask_test_and_clear_cpu(cpu, data->cpumask)
>>
>> CRASH!!!
>>
>> 2: It's not handle call_function list when cpu down, It's will lead to
>> dead-wait if other path is waiting this cpu to execute function
>>
>> CPU A: CPU B
>>
>> smp_call_function_many(wait=0)
>> ...... CPU B down
>> smp_call_function_many() --> (cpu down before recevie function
>> csd_lock(&data->csd); IPI interrupte)
>>
>> DEAD-WAIT!!!!
>>
>> So, CPU A will dead-wait in csd_lock(), the same as
>> smp_call_function_single()
>
> On re-reading this, I'm wondering if 2 is a real case.
>
> I'm thinking it should never happen since you're supposed to do things
> like get_online_cpus() around stuff like this, but then again, I doubt
> we actually do.
>

I think that get_online_cpus() and stop_machine() can't avoid this
race,

1: For the first example in my patch's changlog:

CPU A: CPU B

smp_call_function_many(wait=0) ......
cpu_down() ......
hotplug_cfd() -> ......
free_cpumask_var(cfd->cpumask) (receive function IPI interrupte)
/* read cfd->cpumask */
generic_smp_call_function_interrupt() ->
cpumask_test_and_clear_cpu(cpu, data->cpumask)

CRASH!!!

CPU A call smp_call_function_many(wait=0) that want CPU B to call
a specific function, after smp_call_function_many() return, we let
CPU A offline immediately. Unfortunately, if CPU B receives this
IPI interrupt after CPU A down, it will crash like above description.

2: For the second example in my patch's changlog:

If CPU B is dying, like below:

_cpu_down()
{
......

/* We suppose that have below sequences:
* before call __stop_machine(), CPU B is online (in cpu_online_mask),
* in this time, CPU A call smp_call_function_many(wait=0) and want
* CPU B to call a specific function, after CPU A finish it, CPU B
* go to __stop_machine() and disable it's interrupt
* (suppose CPU B not receive IPI interrupt in this time now)
*/
err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
......
}

Now, CPU B is down, but it's not handle CPU A's request, it cause that
can't clean the CSD_FLAG_LOCK flag of CPU A's cfd_data, if CPU A
call smp_call_function_many() next time. it will block in
csd_lock() -> csd_lock_wait(data) forever.

>> Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxx>
>> Cc: Jens Axboe <jens.axboe@xxxxxxxxxx>
>> Cc: Nick Piggin <nickpiggin@xxxxxxxxxxxx>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
>> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> ---
>>
>> kernel/smp.c | 38 ++++++++++++++++++++++++++++----------
>> 1 file changed, 28 insertions(+), 10 deletions(-)
>>
>> diff -puN kernel/smp.c~generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd kernel/smp.c
>> --- a/kernel/smp.c~generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd
>> +++ a/kernel/smp.c
>> @@ -113,14 +113,10 @@ void generic_exec_single(int cpu, struct
>> csd_lock_wait(data);
>> }
>>
>> -/*
>> - * Invoked by arch to handle an IPI for call function. Must be called with
>> - * interrupts disabled.
>> - */
>> -void generic_smp_call_function_interrupt(void)
>> +static void
>> +__generic_smp_call_function_interrupt(int cpu, int run_callbacks)
>> {
>> struct call_function_data *data;
>> - int cpu = smp_processor_id();
>>
>> /*
>> * Ensure entry is visible on call_function_queue after we have
>
> Also, if this is the last version, we're still not using run_callbacks
> for anything..
>

It's not the last version and fixed in another patch, see below URL please:
http://marc.info/?l=linux-mm-commits&m=124900028228350&w=2

Thanks,
Xiao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/