Re: +generic-ipi-fix-the-race-between-generic_smp_call_function_-and-hotplug_cfd.patchadded to -mm tree

From: Suresh Siddha
Date: Fri Sep 11 2009 - 15:09:43 EST


On Fri, 2009-09-11 at 00:45 -0700, Xiao Guangrong wrote:
> I think that get_online_cpus() and stop_machine() can't avoid this
> race,
>
> 1: For the first example in my patch's changlog:
>
> CPU A: CPU B
>
> smp_call_function_many(wait=0) ......
> cpu_down() ......
> hotplug_cfd() -> ......
> free_cpumask_var(cfd->cpumask) (receive function IPI interrupte)
> /* read cfd->cpumask */
> generic_smp_call_function_interrupt() ->
> cpumask_test_and_clear_cpu(cpu, data->cpumask)
>
> CRASH!!!
>
> CPU A call smp_call_function_many(wait=0) that want CPU B to call
> a specific function, after smp_call_function_many() return, we let
> CPU A offline immediately. Unfortunately, if CPU B receives this
> IPI interrupt after CPU A down, it will crash like above description.

How can cpu B receive the IPI interrupt after cpu A is down?

As part of the cpu A going down, we first do the stop machine. i.e.,
schedule the stop machine worker threads on each cpu. So, by the time
all the worker threads on all the cpu's get scheduled and synchronized,
ipi on B should get delivered.

>
> 2: For the second example in my patch's changlog:
>
> If CPU B is dying, like below:
>
> _cpu_down()
> {
> ......
>
> /* We suppose that have below sequences:
> * before call __stop_machine(), CPU B is online (in cpu_online_mask),
> * in this time, CPU A call smp_call_function_many(wait=0) and want
> * CPU B to call a specific function, after CPU A finish it, CPU B
> * go to __stop_machine() and disable it's interrupt
> * (suppose CPU B not receive IPI interrupt in this time now)
> */
> err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
> ......
> }
>
> Now, CPU B is down, but it's not handle CPU A's request, it cause that
> can't clean the CSD_FLAG_LOCK flag of CPU A's cfd_data, if CPU A
> call smp_call_function_many() next time. it will block in
> csd_lock() -> csd_lock_wait(data) forever.

Here also, by the time the stop machine threads are scheduled on cpu B,
cpu B should service that IPI.

smp_call_function with wait=0 will still ensure that IPI's are
registered at the destination cpu. But smp_call_function call will
return before the interrupt handler is run and completed.

So, by the time we schedule stop machine threads and when they are all
online and get synchronized (after this point only we disable interrupts
by moving to STOPMACHINE_DISABLE_IRQ state), we should have serviced the
pending smp call function IPI's.

I am still not convinced about the need for this patch. Am I missing
something? Please elaborate the need with more sequence of steps.

thanks,
suresh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/