Re: Regression in 5.1.20: Reading long directory fails

From: Jason L Tibbitts III
Date: Thu Aug 22 2019 - 15:55:38 EST


I now have another user reporting the same failure of readdir on a long
directory which showed up in 5.1.20 and was traced to
3536b79ba75ba44b9ac1a9f1634f2e833bbb735c. I'm not sure what to do to
get more traction besides reposting and adding some addresses to the CC
list. If there is any information I can provide which might help to get
to the bottom of this, please let me know.

To recap:

5.1.20 introduced a regression reading some large directories. In this
case, the directory should have 7800 files or so in it:

[root@ld00 ~]# ls -l ~dblecher|wc -l
ls: reading directory '/home/dblecher': Input/output error
1844
[root@ld00 ~]# cat /proc/version Linux version 5.1.20-300.fc30.x86_64 (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)) #1 SMP Fri Jul 26 15:03:11 UTC 2019

(The server is a Centos 7 machine running kernel 3.10.0-957.12.2.el7.x86_64.)

Building a kernel which reverts commit 3536b79ba75ba44b9ac1a9f1634f2e833bbb735c:
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
fixes the issue, but of course that revert was fixing a real issue so
I'm not sure what to do.

I can trivially reproduce this by simply trying to list the problematic
directories but I'm not sure how to construct such a directory; simply
creating 10000 files doesn't cause the problem for me. I am willing to
test patches and can build my own kernels, and I'm happy to provide any
debugging information you might require. Unfortunately I don't know
enough to dig in and figure out for myself what's going wrong.

I did file https://bugzilla.redhat.com/show_bug.cgi?id=1740954 just to
have this in a bug tracker somewhere. I'm happy to file one somewhere
else if that would help.

- J<