Re: [PATCH v2 0/5] fast poll multishot mode

From: Hao Xu
Date: Sat May 07 2022 - 12:01:33 EST


在 2022/5/7 上午11:08, Jens Axboe 写道:
On 5/6/22 8:33 PM, Jens Axboe wrote:
On 5/6/22 5:26 PM, Jens Axboe wrote:
On 5/6/22 4:23 PM, Jens Axboe wrote:
On 5/6/22 1:00 AM, Hao Xu wrote:
Let multishot support multishot mode, currently only add accept as its
first comsumer.
theoretical analysis:
1) when connections come in fast
- singleshot:
add accept sqe(userpsace) --> accept inline
^ |
|-----------------|
- multishot:
add accept sqe(userspace) --> accept inline
^ |
|--*--|

we do accept repeatedly in * place until get EAGAIN

2) when connections come in at a low pressure
similar thing like 1), we reduce a lot of userspace-kernel context
switch and useless vfs_poll()


tests:
Did some tests, which goes in this way:

server client(multiple)
accept connect
read write
write read
close close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
+1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
+1934226+1914385)/20.0 = 1927633.75
after:
1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
+1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
+1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%


A liburing test is here:
https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

Wish I had seen that, I wrote my own! But maybe that's good, you tend to
find other issues through that.

Anyway, works for me in testing, and I can see this being a nice win for
accept intensive workloads. I pushed a bunch of cleanup patches that
should just get folded in. Can you fold them into your patches and
address the other feedback, and post a v3? I pushed the test branch
here:

https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

Quick benchmark here, accepting 10k connections:

Stock kernel
real 0m0.728s
user 0m0.009s
sys 0m0.192s

Patched
real 0m0.684s
user 0m0.018s
sys 0m0.102s

Looks like a nice win for a highly synthetic benchmark. Nothing
scientific, was just curious.

One more thought on this - how is it supposed to work with
accept-direct? One idea would be to make it incrementally increasing.
But we need a good story for that, if it's exclusive to non-direct
files, then it's a lot less interesting as the latter is really nice win
for lots of files. If we can combine the two, even better.

Running some quick testing, on an actual test box (previous numbers were
from a vm on my laptop):

Testing singleshot, normal files
Did 10000 accepts

________________________________________________________
Executed in 216.10 millis fish external
usr time 9.32 millis 150.00 micros 9.17 millis
sys time 110.06 millis 67.00 micros 109.99 millis

Testing multishot, fixed files
Did 10000 accepts

________________________________________________________
Executed in 189.04 millis fish external
usr time 11.86 millis 159.00 micros 11.71 millis
sys time 93.71 millis 70.00 micros 93.64 millis

That's about ~19 usec to accept a connection, pretty decent. Using
singleshot and with fixed files, it shaves about ~8% off, ends at around
200msec.

I think we can get away with using fixed files and multishot, attaching
I'm not following, do you mean we shouldn't do the multishot+fixed file
or we should use multishot+fixed to make the result better?
the quick patch I did below to test it. We need something better than
Sorry Jens, I didn't see the quick patch, is there anything I misunderstand?
this, otherwise once the space fills up, we'll likely end up with a
sparse space and the naive approach of just incrementing the next slot
won't work at all.