Re: [PATCH] clockevents/drivers/cs5535: improve resilience to spurious interrupts

From: David Kozub
Date: Fri Oct 20 2017 - 10:59:55 EST


On Fri, 20 Oct 2017, Daniel Lezcano wrote:

On 20/10/2017 09:49, David Kozub wrote:
On Fri, 20 Oct 2017, Daniel Lezcano wrote:

On 20/10/2017 00:25, Thomas Gleixner wrote:
On Fri, 20 Oct 2017, Daniel Lezcano wrote:

On 19/10/2017 22:57, David Kozub wrote:
This solves a BUG on ALIX 2c3 where mfgpt_tick is called before
clockevents_config_and_register returns. This caused mfgpt_tick to
call a
null function pointer.

Thanks to Daniel Lezcano and Thomas Gleixner for helping me analyze
this
and suggesting a solution.

Suggested-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Signed-off-by: David Kozub <zub@xxxxxxxxxxxxxxxxxx>
---

Thank for sending this fix.

Can you check if the commit 8f9327cbb is the one introducing
the
regression ? So we can add the proper tags and propagate the fix to
stable.

No it's not.

-       if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
+       if (clockevent_state_shutdown(&cs5535_clockevent))

This particular problem of the missing detached state check has been
there
forever and went unnoticed for whatever reason.

The detached condition was artificially caught by the initialized
variable:

-static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;

The patch 8f9327cbb removes the variable, so very likely this is where
the problem appeared.

I will try to test that. But I won't have access to the device till
Sunday evening. I've had big trouble trying to run kernels > 4.1-rc5 on
the device and if I'm looking correctly the commit was introduced in
4.3-rc1. But I'll try to figure something out.

David,

thanks again for taking the time to report and investigate this issue.
Usually people just give up and drop the legacy hardware without letting
us know the kernel is broken with it. So don't spend too much time with
it, just check if the commit before works, if not, just add in the log
the kernel version you noticed the breakage.

In case you are interested in doing some debugging to narrow down the
offending commit and you know the versions working and not working, you
can try the command git-bisect. It will use a dichotomy approach to find
out the culprit.

Hi Daniel,

I originally didn't see this issue simply because 4.1-rc7 would not output anything on the serial on the device. For kernels >= 4.1-rc7 the kernel either reboots immediatelly or freezes, in both cases without giving any output in the serial. I tried to find the cause of this with git bisect but in the end I was none the wiser. I also suspected my build environment.

Eventually I found out the following:
* 4.1-rc6 boots OK
* 4.1-rc7 restarts immediatelly, but if I revert f18c34e48 it works but I have no idea why would this commit break anything, the commit looks OK
* 4.1 - works if I revert f18c34e48
* e75c73ad6 feeezes even after reverting f18c34e48
* 4.2-rc1 - reboots, even after reverting f18c34e48

Only recently I tried a current kernel and I was more lucky: it gave me some useful output on the serial.

To verify 8f9327cbb I was thinking I could take it and apply it onto 4.1-rc6 (if that was possible), but as Thomas suggested it might not be worth it, then I forget about that.

Thanks for all the help to you and to Thomas.

Best regards,
David