Re: kernel panics with 4.14.X versions

From: Pavlos Parissis
Date: Wed Apr 18 2018 - 04:32:33 EST


On 17/04/2018 02:12 ÎÎ, Jan Kara wrote:
> On Tue 17-04-18 01:31:24, Pavlos Parissis wrote:
>> On 16/04/2018 04:40 ÎÎ, Jan Kara wrote:
>
> <snip>
>
>>> How easily can you hit this?
>>
>> Very easily, I only need to wait 1-2 days for a crash to occur.
>
> I wouldn't call that very easily but opinions may differ :). Anyway it's
> good (at least for debugging) that it's reproducible.
>

Unfortunately, I can't reproduce it, so waiting 1-2 days is the only option I have.

>>> Are you able to run debug kernels
>>
>> Well, I was under the impression I do as I have:
>> grep -E 'DEBUG_KERNEL|DEBUG_INFO' /boot/config-4.14.32-1.el7.x86_64
>> CONFIG_DEBUG_INFO=y
>> # CONFIG_DEBUG_INFO_REDUCED is not set
>> # CONFIG_DEBUG_INFO_SPLIT is not set
>> # CONFIG_DEBUG_INFO_DWARF4 is not set
>> CONFIG_DEBUG_KERNEL=y
>>
>> Do you think that my kernel doesn't produce a proper crash dump?
>> I have a production cluster where I can run any kernel we need, so if I need
>> to compile again with different settings I can certainly do that.
>
> OK, good. So please try running 4.16 as you mention below to verify whether
> this is just a -stable regression or also a problem in the current upstream
> kernel. Based on your results with 4.16 I'll prepare a debug patch for you to
> apply on top of 4.14.32 so that we can debug this further.
>
>>> / inspect
>>> crash dumps when the issue occurs?
>>
>> I can't do that as the server isn't responsive and I can only power cycle it.
>
> Well, kernel crash dumps work in that situation as well - when the kernel
> panics, it will kexec into a new kernel and dump memory of the old kernel
> to disk. It can then be investigated with the 'crash' utility. But
> obviously you don't have this set up and don't have experience with this so
> let's go via a standard 'debug patch' route.
>
>>> Also testing with the latest mainline
>>> kernel (4.16) would be welcome whether this isn't just an issue with the
>>> backport of fsnotify fixes from Miklos.
>>
>> I can try the kernel-ml-4.16.2 from elrepo (we use CentOS 7).
>
> Yes, that would be good.
>

I have production server running 4.16.2 and no kernel crash dumps yet.
Let's wait another day before we say anything.

Cheers,
Pavlos

Attachment: signature.asc
Description: OpenPGP digital signature