Re: [PATCH v5 00/10] VFS hot tracking

From: Al Viro
Date: Mon Sep 23 2013 - 20:21:05 EST


On Tue, Sep 17, 2013 at 06:17:45AM +0800, zwu.kernel@xxxxxxxxx wrote:
> From: Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx>
>
> The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
> Now it's time to send out its V5 for external review, and
> any comments or ideas are appreciated, thanks.

FWIW, the most fundamental problem I see with this is that the data
you are collecting is very sensitive to VM pressure details. The
hotspots wrt accesses (i.e. the stuff accessed all the time) will
not generate a lot of IO - they'll just sit in cache and look
very cold for your code. The stuff that is accessed very rarely
will also look cold. The hot ones will be those that get in and
out of cache often; IOW, the ones that are borderline - a bit less
memory pressure and they would've stayed in cache. I would expect
that to vary very much from run to run, which would make its use
for decisions like SSD vs. HD rather dubious...

Do you have that data collected on some real tasks under varying
memory pressure conditions? How stable the results are?

Another question: do you have perf profiles for system with that
stuff enabled, on boxen with many CPUs? You are using a lot of
spinlocks there; how much contention and cacheline ping-pong are
you getting on root->t_lock, for example? Ditto for cacheline
ping-pong on root->hot_cnt, while we are at it...

Comments re code:

* don't export the stuff until it's used by a module. And as
a general policy, please, do not use EXPORT_SYMBOL_GPL in fs/*.
Either don't export at all, or pick a sane API that would not
expose the guts of your code (== wouldn't require you to look
at the users to decide what will and will not break on changes
in your code) and export that. As far as I'm concerned,
all variants of EXPORT_SYMBOL are greatly overused and
EXPORT_SYMBOL_GPL is an exercise in masturbation...

* hot_inode_item_lookup() is a couple of functions smashed together;
split it, please, and lose the "alloc" argument.

* hot_inode_item_unlink() is used when the last link is killed
by unlink(2), but not when it's killed by successful rename(2).
Why?

* what happens when one opens a file, unlinks it and starts doing
IO? hot_freqs_update() will be called, re-creating the inode item
unlink has killed...

* for pity sake, use inlines - the same hot_freqs_update() on a filesystem
that doesn't have this stuff enabled will *still* branch pretty far
out of line, only to return after checking that ->s_hot_root is NULL.
Incidentally, we still have about twenty spare bits in inode flags;
use one (S_TEMP_COLLECTED, or something like that) instead of that
check. Checking it is considerably cheaper than checking ->i_sb->s_hot_root.

* hot_bit_shift() is bloody annoying. Why does true mean "up", false -
"down" and why is it easier to memorize that than just use explicit <<
and >>?

* file->f_dentry->d_inode is spelled file_inode(file), TYVM (so does
file->f_path.dentry->d_inode, actually).

* #ifdef __KERNEL__ in include/linux/* is wrong; use include/uapi/linux/*
for the bits userland needs to see.

* checks for ->i_nlink belong under ->i_mutex. As it is, two unlink(2)
killing two last links to the same file can very well _both_ call
hot_inode_item_unlink(), with obvious unpleasant results.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/