Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

From: Bernd Schubert
Date: Thu May 05 2022 - 11:13:17 EST




On 5/5/22 14:54, Vivek Goyal wrote:
On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
Here are the numbers I took last time. These were taken on tmpfs to
actually see the effect of reduced calls. On local file systems it
might not be that much visible. But we have observed that on systems
where we have thousands of clients hammering the metadata servers, it
helps a lot (We did not take numbers yet as we are required to change
a lot of our client code but would be doing it later on).

Note that for a change in performance number due to the new version of
these patches, we have just refactored the code and functionality has
remained the same since then.

here is the link to the performance numbers
https://lore.kernel.org/linux-fsdevel/20220322121212.5087-1-dharamhans87@xxxxxxxxx/

There is a lot going in that table. Trying to understand it.

- Why care about No-Flush. I mean that's independent of these changes,
right? I am assuming this means that upon file close do not send
a flush to fuse server. Not sure how bringing No-Flush into the
mix is helpful here.


It is a basically removing another call from kernel to user space. The calls there are, the lower is the resulting percentage for atomic-open.



- What is "Patched Libfuse"? I am assuming that these are changes
needed in libfuse to support atomic create + atomic open. Similarly
assuming "Patched FuseK" means patched kernel with your changes.

Yes, I did that to ensure there is no regression with the patches, when the other side is not patched.


If this is correct, I would probably only be interested in
looking at "Patched Libfuse + Patched FuseK" numbers to figure out
what's the effect of your changes w.r.t vanilla kernel + libfuse.
Am I understanding it right?

Yes.


- I am wondering why do we measure "Sequential" and "Random" patterns.
These optimizations are primarily for file creation + file opening
and I/O pattern should not matter.

bonnie++ does this automatically and it just convenient to take the bonnie++ csv value and to paste them into a table.

In our HPC world mdtest is more common, but it has MPI as requirement - make it harder to run. Reproducing the values with bonnie++ should be rather easy for you.

Only issue with bonnie++ is that bonnie++ by default does not run multi-threaded and the old 3rd party perl scripts I had to let it run with multiple processes and to sum up the values don't work anymore with recent perl versions. I need to find some time to fix that.



- Also wondering why performance of Read/s improves. Assuming once
file has been opened, I think your optimizations get out of the
way (no create, no open) and we are just going through data path of
reading file data and no lookups happening. If that's the case, why
do Read/s numbers show an improvement.

That is now bonnie++ works. It creates the files, closes them (which causes the flush) and then re-opens for stat and read - atomic open comes into the picture here. Also read() is totally skipped when the files are empty - which is why one should use something like 1B files.

If you have another metadata benchmark - please let us know.


- Why do we measure "Patched Libfuse". It shows performance regression
of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
any optimization kicking in, it has a performance cost.

Yes, I'm not sure yet. There is not so much code that has changed on the libfuse side.
However the table needs to be redone with fixed libfuse - limiting the number of threads caused a permanent libfuse thread creation and destruction

https://github.com/libfuse/libfuse/pull/652

The numbers in table are also with paththrough_ll, which has its own issue due to linear inode search. paththrough_hp uses a C++ map and avoids that. I noticed too late when I started to investigate why there are regressions....

Also the table made me to investigate/profile all the fuse operations, which resulted in my waitq question. Please see that thread for more details https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/

Regarding atomic-open/create with avoiding lookup/revalidate, our primary goal is to reduce network calls. A file system that handles it locally only reduces the number of fuse kernel/user space crossing. A network file system that fully supports it needs to do the atomic open (or in old terms lookup-intent-open) on the server side of the network and needs to transfer attributes together with the open result.

Lustre does this, although I cannot easily point you to the right code. It all started almost two decades ago:
https://groups.google.com/g/lucky.linux.fsdevel/c/iYNFIIrkJ1s


BeeGFS does this as well
https://git.beegfs.io/pub/v7/-/blob/master/client_module/source/filesystem/FhgfsOpsInode.c
See for examaple FhgfsOps_atomicOpen() and FhgfsOps_createIntent()

(FhGFS is the old name when I was still involved in the project.)

From my head I'm not sure if NFS does it over the wire, maybe v4.


Thanks,
Bernd