[RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

From: Abhijith Das
Date: Fri Jul 25 2014 - 13:37:30 EST


Hi all,

The topic of a readdirplus-like syscall had come up for discussion at last year's
LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
to get at a directory's entries as well as stat() info on the individual inodes.
I'm presenting these patches and some early test results on a single-node GFS2
filesystem.

1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
call below and scales very well for large directories in GFS2. dirreadahead() is
designed to be called prior to getdents+stat operations. In it's current form, it
only speeds up stat() operations by caching the relevant inodes. Support can be
added in the future to cache extended attribute blocks as well. This works by first
collecting all the inode numbers of the directory's entries (subject to a numeric
or memory cap). This list is sorted by inode disk block order and passed on to
workqueues to perform lookups on the inodes asynchronously to bring them into the
cache.

2. xgetdents() - I had posted a version of this patchset some time last year and
it is largely unchanged - I just ported it to the latest upstream kernel. It
allows the user to request a combination of entries, stat and xattrs (keys/values)
for a directory. The stat portion is based on David Howells' xstat patchset he
had posted last year as well. I've included the relevant vfs bits in my patchset.
xgetdents() in GFS2 works in two phases. In the first phase, it collects all the
dirents by reading the directory in question. In phase two, it reads in inode
blocks and xattr blocks (if requested) for each entry after sorting the disk
accesses in block order. All of the intermediate data is stored in a buffer backed
by a vector of pages and is eventually transferred to the user supplied buffer.

Both syscalls perform significantly better than a simple getdents+stat with a cold
cache. The main advantage lies in being able to sort disk accesses for a bunch of
inodes in advance compared to seeking all over the disk for inodes one entry at a
time.

This graph (https://www.dropbox.com/s/fwi1ovu7mzlrwuq/speed-graph.png) shows the
time taken to get directory entries and their respective stat info by 3 different
sets of syscalls:

1) getdents+stat ('ls -l', basically) - Solid blue line
2) xgetdents with various buffer size and num_entries limits - Dotted lines
Eg: v16384 d10000 means a limit of 16384 pages for the scratch buffer and
a maximum of 10000 entries to collect at a time.
3) dirreadahead+getdents+stat with various num_entries limits - Dash-dot lines
Eg: d10000 implies that it would fire off a max of 10000 inode lookups during
each syscall.

numfiles: 10000 50000 100000 500000
--------------------------------------------------------------------
getdents+stat 1.4s 220s 514s 2441s
xgetdents 1.2s 43s 75s 1710s
dirreadahead+getdents+stat 1.1s 5s 68s 391s

Here is a seekwatcher graph from a test run on a directory of 50000 files.
(https://www.dropbox.com/s/fma8d4jzh7365lh/50000-combined.png) The comparison is
between getdents+stat and xgetdents. The first set of plots is of getdents+stat,
followed by xgetdents() with steadily increasing buffer sizes (256 to 262144) and
num_entries (100 to 1000000) limits. One can see the effect of ordering the disk
reads in the Disk IO portion of the graphs and the corresponding effect on seeks,
throughput and overall time taken.

This second seekwatcher graph similarly shows the dirreadahead()+getdents()+stat()
syscall-combo for a 500000-file directory with increasing num_entries (100 to
1000000) limits versus getdents+stat.
(https://www.dropbox.com/s/rrhvamu99th3eae/500000-ra_combined_new.png)
The corresponding getdents+stat baseline for this run is at the top of the series
of graphs.

I'm posting these two patchsets shortly for comments.

Cheers!
--Abhi

Red Hat Filesystems
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/