Re: Still a pretty bad time on 5.4.6 with fuse_request_end.

From: Michael Stapelberg
Date: Wed Feb 12 2020 - 02:59:02 EST


Sure: hereâs a backtrace and req->args printed:

(gdb) bt full
#0 0xffffffff82000ff3 in __x86_indirect_thunk_r8 () at
arch/x86/lib/retpoline.S:40
No locals.
#1 0xffffffff8137ec68 in fuse_request_end (fc=0xffff88813a57be00,
req=0xffff88813a5a3770) at fs/fuse/dev.c:329
fiq = 0xffff88813a57be48
async = true
#2 0xffffffff81382d88 in fuse_dev_do_write (fud=0xffff88813a57be00,
cs=0xffffc90002fafa00, nbytes=4294967294) at fs/fuse/dev.c:1912
err = 0
fc = 0xffff88813a57be00
fpq = 0xffff888132815f48
req = 0xffff88813a5a3770
oh = {
len = 16,
error = -2,
unique = 66942
}
#3 0xffffffff81382e69 in fuse_dev_write (iocb=0xffffc900008cbe48,
from=0xffffc900008cbe20) at fs/fuse/dev.c:1934
cs = {
write = 0,
req = 0xffff88813a5a3770,
iter = 0xffffc900008cbe20,
pipebufs = 0x0 <fixed_percpu_data>,
currbuf = 0x0 <fixed_percpu_data>,
pipe = 0x0 <fixed_percpu_data>,
nr_segs = 0,
pg = 0x0 <fixed_percpu_data>,
len = 0,
offset = 24,
move_pages = 0
}
fud = 0xffff888132815f40
#4 0xffffffff8120122e in call_write_iter (file=<optimized out>,
iter=<optimized out>, kio=<optimized out>) at
./include/linux/fs.h:1901
No locals.
#5 new_sync_write (filp=0xffff888119886b00, buf=<optimized out>,
len=<optimized out>, ppos=0xffffc900008cbee8) at fs/read_write.c:483
iov = {
iov_base = 0xc0008ec008,
iov_len = 16
}
kiocb = {
ki_filp = 0xffff888119886b00,
ki_pos = 0,
ki_complete = 0x0 <fixed_percpu_data>,
private = 0x0 <fixed_percpu_data>,
ki_flags = 0,
ki_hint = 0,
ki_ioprio = 0,
ki_cookie = 0
}
iter = {
type = 5,
iov_offset = 0,
count = 0,
{
iov = 0xffffc900008cbe20,
kvec = 0xffffc900008cbe20,
bvec = 0xffffc900008cbe20,
pipe = 0xffffc900008cbe20
},
{
nr_segs = 0,
{
head = 0,
start_head = 0
}
}
}
ret = <optimized out>
#6 0xffffffff812012e4 in __vfs_write (file=<optimized out>,
p=<optimized out>, count=<optimized out>, pos=<optimized out>) at
fs/read_write.c:496
No locals.
#7 0xffffffff81203f04 in vfs_write (pos=<optimized out>, count=16,
buf=<optimized out>, file=<optimized out>) at fs/read_write.c:558
ret = 16
ret = <optimized out>
#8 vfs_write (file=0xffff888119886b00, buf=0xc0008ec008 "\020",
count=16, pos=0xffffc900008cbee8) at fs/read_write.c:542
ret = 16
#9 0xffffffff812041b2 in ksys_write (fd=<optimized out>,
buf=0xc0008ec008 "\020", count=16) at fs/read_write.c:611
pos = 0
ppos = <optimized out>
f = <optimized out>
ret = 824643076104
#10 0xffffffff81204245 in __do_sys_write (count=<optimized out>,
buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:623
No locals.
#11 __se_sys_write (count=<optimized out>, buf=<optimized out>,
fd=<optimized out>) at fs/read_write.c:620
ret = <optimized out>
ret = <optimized out>
#12 __x64_sys_write (regs=<optimized out>) at fs/read_write.c:620
No locals.
#13 0xffffffff810028a8 in do_syscall_64 (nr=<optimized out>,
regs=0xffffc900008cbf58) at arch/x86/entry/common.c:294
ti = <optimized out>
#14 0xffffffff81e0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:175
No locals.
#15 0x0000000000000000 in ?? ()
No symbol table info available.

(gdb) p *req->args
$5 = {
nodeid = 18446683600620026424,
opcode = 2167928246,
in_numargs = 65535,
out_numargs = 65535,
force = false,
noreply = false,
nocreds = false,
in_pages = false,
out_pages = false,
out_argvar = true,
page_zeroing = true,
page_replace = false,
in_args = {{
size = 978828800,
value = 0x2fafce0
}, {
size = 978992728,
value = 0xffffffff8138efaa <fuse_alloc_forget+26>
}, {
size = 50002688,
value = 0xffffffff8138635f <fuse_lookup_name+255>
}},
out_args = {{
size = 570,
value = 0xffffc90002fafb10
}, {
size = 6876,
value = 0x3000000001adc
}},
end = 0x1000100000001
}

Independently, as a separate test, I have also modified the source like this:

bool async;
bool async_early = req->args->end;

if (test_and_set_bit(FR_FINISHED, &req->flags))
goto put_request;

async = req->args->end;

âand printed the value of async and async_early. async is true,
async_early is false.

Perhaps some other routine is modifying the request, and checking
req->args->end early enough masks that bug?

On Tue, Feb 11, 2020 at 11:55 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Sun, Feb 9, 2020 at 9:09 AM <michael+lkml@xxxxxxxxxxxxx> wrote:
> >
> > From: Michael Stapelberg <michael+lkml@xxxxxxxxxxxxx>
> >
> > Hey,
> >
> > I recently ran into this, too. The symptom for me is that processes using the
> > affected FUSE file system hang indefinitely, sync(2) system calls hang
> > indefinitely, and even triggering an abort via echo 1 >
> > /sys/fs/fuse/connections/*/abort does not get the file system unstuck (there is
> > always 1 request still pending). Only removing power will get the machine
> > unstuck.
> >
> > Iâm triggering this when building packages for https://distr1.org/, which uses a
> > FUSE daemon (written in Go using the jacobsa/fuse package) to provide package
> > contents.
> >
> > I bisected the issue to commit
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2b319d1f6f92a4ced9897678113d176ee16ae85d
> >
> > With that commit, I run into a kernel oops within â1 minute after starting my
> > batch build. With the commit before, I can batch build for many minutes without
> > issues.
>
> Pretty weird. I'm not seeing how this could change behavior, as the
> args->end value is not changed after being initialized, and so moving
> the test later should not make a difference.
>
> Could you print out the complete contents of req->args?
>
> Thanks,
> Miklos