Re: fuse scalability part 1

From: Ashish Samant
Date: Thu Sep 24 2015 - 15:18:22 EST

Next message: Russell King - ARM Linux: "[PATCH v3 0/9] Phy, mdiobus, and netdev struct device fixes"
Previous message: Andy Lutomirski: "Re: [PATCH 26/26] x86, pkeys: Documentation"
Next in thread: Miklos Szeredi: "Re: fuse scalability part 1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 05/18/2015 08:13 AM, Miklos Szeredi wrote:

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection. This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos

Resending the numbers as attachments because my email client messes the formatting of the message. Sorry for the noise.

We did some performance testing without these patches and with these patches (with -o clone_fd option specified). We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE based database fs on a system with 8 numa nodes and 288 cpus. The performance here is almost equal to the the per-numa patches we submitted a while back.Please find results attached.

2. Spinlock access times test: We also ran some tests within the kernel to check the time spent in accessing the spinlocks per request in both cases. As can be seen, the time taken per request to access the spinlock in the kernel code throughout the lifetime of the request is 30X to 100X better in the 2nd case (with patchset). Please find results attached.

Thanks,
Ashish

1) Writes to single mount

dd processes throughput(without patchset) throughput(with patchset)
in parallel

4 633 Mb/s 606 Mb/s
8 583.2 Mb/s 561.6 Mb/s
16 436 Mb/s 640.6 Mb/s
32 500.5 Mb/s 718.1 Mb/s
64 440.7 Mb/s 1276.8 Mb/s
128 526.2 Mb/s 2343.4 Mb/s

2) Reading from single mount

dd processes throughput(without patchset) throughput(with patchset)
in parallel

4 1171 Mb/s 1059 Mb/s
8 1626 Mb/s 1677 Mb/s
16 1014 Mb/s 2240.6 Mb/s
32 807.6 Mb/s 2512.9 Mb/s
64 985.8 Mb/s 2870.3 Mb/s
128 1355 Mb/s 2996.5 Mb/s
dd processes Time/req(without patchset) Time/req(with patchset)
in parallel

4 0.025 ms 0.00685 ms
8 0.174 ms 0.0071 ms
16 0.9825 ms 0.0115 ms
32 2.4965 ms 0.0315 ms
64 4.8335 ms 0.071 ms
128 5.972 ms 0.1812 ms

Next message: Russell King - ARM Linux: "[PATCH v3 0/9] Phy, mdiobus, and netdev struct device fixes"
Previous message: Andy Lutomirski: "Re: [PATCH 26/26] x86, pkeys: Documentation"
Next in thread: Miklos Szeredi: "Re: fuse scalability part 1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]