Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11

From: Patrick McLean
Date: Thu Nov 09 2017 - 20:59:13 EST


On 2017-11-09 12:04 PM, Linus Torvalds wrote:
> On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean <chutzpah@xxxxxxxxxx> wrote:
>>
>> We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and
>> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as
>> CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
>
> It might be worth just verifying without RANDSTRUCT in particular.
>
> And most obviously: if there is some module or part of the kernel that
> got compiled with a different seed for the randstruct hashing, that
> will break in nasty nasty ways. Your out-of-kernel module is the
> obvious suspect for something like that, but honestly, it could be
> some missing build dependency, or simply a missing special case in the
> plugin itself a missing __no_randomize_layout or any number of things.
>

We will check our fork against the in-kernel cp201x driver to make sure
we didn't miss anything, but it seems odd we would be hitting the issue
so consistently in the NFS code path, rather than somewhere in USB,
serial, or GPIO paths.

> So since you seem to be able to reproduce this _reasonably_ easily,
> it's definitely worth checking that it still reproduces even without
> the gcc plugins.

I haven't been able to reproduce it with RANDSTRUCT disabled (and
structleak enabled). I will keep trying for a little while more, but
evidence seems to be pointing to that.

Something must have changed since 4.13.8 to trigger this though. This
did not crop up at all until we tried 4.13.11, where it we saw it pretty
quickly. We have a pretty large number of machines running 4.13.6 with
RANDSTRUCT enabled and running a the same workload with many more
clients, and have not seen this bug at all.