Re: blk_mq_freeze_queue hang and possible race in percpu-refcount

From: David Chen
Date: Wed Mar 14 2018 - 13:42:39 EST


Hi Tejun,

Thanks, I see I missed the RCU part.
I'll try the force atomic thing.
Though so far I haven't been able to reproduce it yet.

Thanks,
David


2018-03-14 8:43 GMT-07:00 Tejun Heo <tj@xxxxxxxxxx>:
> Hello, David.
>
> On Tue, Mar 13, 2018 at 03:50:47PM -0700, David Chen wrote:
>> ====
>> CPU A CPU B
>> ----- -----
>> percpu_ref_kill() percpu_ref_tryget_live()
>> {
>> if (__ref_is_percpu())
>> set __PERCPU_REF_DEAD;
>> __percpu_ref_switch_mode();
>> ^ sum up current percpu_count
>> this_cpu_inc(*percpu_count); <- this
>> increment got leaked.
>>
>> ====
>>
>> So if later CPU B later does percpu_ref_put, it will cause ref->count
>> to drop to -1.
>> And thus causing the above hung task issue.
>>
>> Do you think this theory is correct, or am I missing something?
>> Please tell me what do you think.
>
> The switching to atomic mode does something like the following.
>
> 1. Mark the refcnt so that __ref_is_percpu() is false.
>
> 2. Wait for RCU grace period so that everyone including
> percpu_ref_tryget_live() which has seen true __ref_is_percpu() is
> done with its operation.
>
> 3. Now that it knows nobody is operating on the assumption that the
> counter is in percpu mode, it adds up all the percpu counters.
>
> So, provided there aren't some silly bugs, what you described
> shouldn't happen. Can you force the refcnt into atomic mode w/
> PERCPU_REF_INIT_ATOMIC and see whether the problem persists?
>
> Thanks.
>
> --
> tejun