Re: [PATCH] leds: trigger: fix potential deadlock with libata

From: Hans de Goede
Date: Sun Mar 07 2021 - 14:05:47 EST


Hi,

On 3/7/21 5:13 PM, Pavel Machek wrote:
> Hi!
>
>>> --- a/drivers/leds/led-triggers.c
>>> +++ b/drivers/leds/led-triggers.c
>>> @@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
>>> enum led_brightness brightness)
>>> {
>>> struct led_classdev *led_cdev;
>>> + unsigned long flags;
>>>
>>> if (!trig)
>>> return;
>>>
>>> - read_lock(&trig->leddev_list_lock);
>>> + read_lock_irqsave(&trig->leddev_list_lock, flags);
>>> list_for_each_entry(led_cdev, &trig->led_cdevs, trig_list)
>>> led_set_brightness(led_cdev, brightness);
>>> - read_unlock(&trig->leddev_list_lock);
>>> + read_unlock_irqrestore(&trig->leddev_list_lock, flags);
>>> }
>>> EXPORT_SYMBOL_GPL(led_trigger_event)
>>
>> meanwhile this patch hit v5.10.x stable and caused a performance
>> degradation on our use case:
>>
>> It's an embedded ARM system, 4x Cortex A53, with an SPI attached CAN
>> controller. CAN stands for Controller Area Network and here used to
>> connect to some automotive equipment. Over CAN an ISOTP (a CAN-specific
>> Transport Protocol) transfer is running. With this patch, we see CAN
>> frames delayed for ~6ms, the usual gap between CAN frames is 240µs.
>>
>> Reverting this patch, restores the old performance.
>>
>> What is the best way to solve this dilemma? Identify the critical path
>> in our use case? Is there a way we can get around the irqsave in
>> led_trigger_event()?
>
> Hans was pushing for this patch, perhaps he has some ideas...

I was not pushing for this particular fix, I was asking about a fix
for the lockdep identified potential deadlock.

And you replied that this was already fixed in your for-next branch
when I asked, so all in all, other then reporting the potential deadlock
(after it was already fixed) I have very little do to with this patch.

With that all said, I must say that I'm surprised that switching from
read_lock() to read_lock_irqsave() causes such a hefty penalty, so I
wonder what is really going on here. Using the irqsave version disables
interrupts, but AFAIK only on the current core and only for the duration
of the led_set_brightness() call(s) .

Is the system perhaps pinning IRQs to a specific CPU in combination with
a led_set_brightness() somehow taking much longer then it should?

Note that led_set_brightness() calls are not allowed to block, if they
block they should use the brightness_set_blocking callback in their
led_class_dev struct not the regular brightness_set callback. In which case
the LED-core will defer the actually setting of the LED to a workqueue.

So one thing which might be worthwhile to check is if any of the LED
drivers on the system in question are using the brightness_set callback,
where they should be using the blocking one.

Regards,

Hans