Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11

From: Patrick McLean
Date: Thu Nov 09 2017 - 20:47:42 EST




On 2017-11-09 11:51 AM, Patrick McLean wrote:
> On 2017-11-09 11:37 AM, Al Viro wrote:
>> On Wed, Nov 08, 2017 at 06:40:22PM -0800, Linus Torvalds wrote:
>>
>>>> Here is the BUG we are getting:
>>>>> [ 58.962528] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
>>>>> [ 58.963918] IP: vfs_statfs+0x73/0xb0
>>>
>>> The code disassembles to
>>
>>> 2a:* 48 8b b7 30 02 00 00 mov 0x230(%rdi),%rsi <-- trapping instruction
>>
>>> that matters (and that traps) but I'm almost certain that it's the
>>> "mnt->mnt_sb->s_flags" loading that is part of calculate_f_flags()
>>> when it then does
>>>
>>> flags_by_sb(mnt->mnt_sb->s_flags);
>>>
>>> and I think mnt->mnt_sb is NULL. We know it's not 'mnt' itself that is
>>> NULL, because we wouldn't have gotten this far if it was.
>>>
>>
>> All instances of struct dentry are created by __d_alloc()[*], which assigns
>> ->d_sb (never to be modified afterwards) *and* dereferences the pointer
>> it has stored in ->d_sb before the created struct dentry becomes visible
>> to anyone else. No struct dentry should ever be observed with NULL ->d_sb;
>> the only way to get that is memory corruption or looking at freed instance
>> after its memory has been reused for something else and zeroed.
>>
>> In other words, we should never observe a struct mount with NULL ->mnt.mnt_sb -
>> not without memory corruption or looking at freed instance.
>>
>> The pointer in that case should've come from exp->ex_path.mnt, exp being
>> the argument of nfsd4_encode_fattr(). Sure, it might have been a dangling
>> reference. However, it looks a lot more like a memory corruptor *OR*
>> miscompiled kernel.
>>
>> What kind of load do the reproducer boxen have and how fast does that
>> bug trigger? Would it be possible to slap something like
>> if (unlikely(!exp->exp_path.mnt->mnt_sb)) {
>> struct mount *m = real_mount(exp->exp_path.mnt);
>> printk(KERN_ERR "mnt: %p\n", exp->exp_path.mnt);
>> printk(KERN_ERR "name: [%s]\n", m->mnt_devname);
>> printk(KERN_ERR "ns: [%p]\n", m->mnt_ns);
>> printk(KERN_ERR "parent: [%p]\n", m->mnt_parent);
>> WARN_ON(1);
>> err = -EINVAL;
>> goto out_nfserr;
>> }
>> in the beginning of nfsd4_encode_fattr() (with include of ../mount.h added
>> in fs/nfsd/nfs4xdr.c) and see what will it catch?
>>
>> Both with and without randomized structs, if possible - I might be barking
>> at the wrong tree, but IMO the very first step in localizing that crap is
>> to find out whether it's toolchain-related or not.
>

That condition did not seem to trigger, and I am getting a slightly
different crash message (GPF rather than null pointer dereference). Here
is the dump from the latest crash (with CONFIG_GCC_PLUGIN_STRUCTLEAK,
CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL and CONFIG_GCC_PLUGIN_RANDSTRUCT
all enabled).

> [ 36.834232] general protection fault: 0000 [#1] SMP
> [ 36.835168] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable_raw iptable_nat nf_nat_ipv4 nf_nat gkuart(O) usbserial x86_pkg_temp_thermal ie31200_edac tpm_tis ipmi_ssif tpm_tis_core ext4 mbcache jbd2 e1000e crc32c_intel
> [ 36.839120] CPU: 1 PID: 3969 Comm: nfsd Tainted: G O 4.14.0-rc8-git-kratos-1-00053-gd93d4ce103fd-dirty #1
> [ 36.840883] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013
> [ 36.841892] task: ffff88040a0b1c80 task.stack: ffffc900027bc000
> [ 36.842887] RIP: 0010:vfs_statfs+0x73/0xb0
> [ 36.843728] RSP: 0018:ffffc900027bfb30 EFLAGS: 00010202
> [ 36.844687] RAX: 0000000000000000 RBX: ffffc900027bfbf8 RCX: 000000000000180d
> [ 36.845891] RDX: 000000000000080d RSI: 0000000000000020 RDI: e2006d6574737973
> [ 36.847075] RBP: ffffc900027bfbc8 R08: 0000000000000000 R09: 00000000000000ff
> [ 36.848175] R10: 000000000038be3a R11: ffff88040b687578 R12: 0000000000000000
> [ 36.849260] R13: ffff88040d7dc400 R14: ffff88040d38b000 R15: ffffc900027bfbf8
> [ 36.850347] FS: 0000000000000000(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000
> [ 36.851891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 36.852873] CR2: 00007f049228edc0 CR3: 0000000001e0a004 CR4: 00000000001606e0
> [ 36.853942] Call Trace:
> [ 36.854667] nfsd4_encode_fattr+0x34e/0x23b0
> [ 36.855578] ? ext4_get_acl+0x1b2/0x260 [ext4]
> [ 36.856485] ? get_acl+0x7a/0xf0
> [ 36.857266] ? generic_permission+0x125/0x1a0
> [ 36.858150] nfsd4_encode_getattr+0x25/0x30
> [ 36.859002] nfsd4_encode_operation+0x98/0x1a0
> [ 36.859889] nfsd4_proc_compound+0x3eb/0x5c0
> [ 36.860736] nfsd_dispatch+0xa8/0x230
> [ 36.861538] svc_process_common+0x347/0x640
> [ 36.862383] svc_process+0x100/0x1b0
> [ 36.863204] nfsd+0xe0/0x150
> [ 36.863984] kthread+0xfc/0x130
> [ 36.864781] ? nfsd_destroy+0x50/0x50
> [ 36.865624] ? kthread_create_on_node+0x40/0x40
> [ 36.866529] ? do_group_exit+0x3a/0xb0
> [ 36.867362] ret_from_fork+0x25/0x30
> [ 36.868188] Code: d1 83 c9 08 40 f6 c6 04 0f 45 d1 89 d1 80 cd 04 40 f6 c6 08 0f 45 d1 89 d1 80 cd 08 40 f6 c6 10 0f 45 d1 89 d1 80 cd 10 83 e6 20 <48> 8b b7 b0 05 00 00 0f 45 d1 83 ca 20 89 f1 83 e1 10 89 cf 83
> [ 36.871101] RIP: vfs_statfs+0x73/0xb0 RSP: ffffc900027bfb30
> [ 36.872059] ---[ end trace 603ac898c4e2d616 ]---

I haven't been able to reproduce it with CONFIG_GCC_PLUGIN_RANDSTRUCT
disabled, so it seems like it must be a bug there. It's odd that it just
surfaced recently though, we have been using that since it was added.

> The reproducer boxen are not under particularly heavy load, they are
> serving NFS to 1 or 2 clients (which are essentially embedded devices).
> When the bug triggers, it usually triggers pretty fast and reliably, but
> it seems to only trigger on some subset of bootups. Once it fails to
> trigger, we seem to have to reboot to get it to trigger.
>
> I should be able to have some results with that added in a few hours.
> It's weirdly unreliable to reproduce this.
>
> We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and
> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as
> CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before.
>