On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
On 2024/4/28 23:49, Cong Wang wrote:
On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.
- Background
SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.
Just FYI:
Cilium has implemented this kind of shortcut with sockmap and sockops.
In fact, for intra-OS case, it is _very_ simple. The core code is less
than 50 lines. Please take a look here:
https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
similiar eBPF things for inter-OS (aka VM) case.
More importantly, even LD_PRELOAD is not needed for this eBPF approach.
:)
Thanks.
Hi, Cong. Thank you very much for the information. I learned about sockmap
before and from my perspective smcd loopback and sockmap each have their own
pros and cons.
The pros of smcd loopback is that it uses a standard process that defined
by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
whether the tcp connection should be upgraded no matter what middleware the
connection passes, e.g. through NAT. So we don't need to pay extra effort to
check whether the connection should be shortcut, unlike checking various policy
by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
different underlay devices for different scenarios (loopback-ism in intra-OS,
ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
scenarios can be covered through one smc protocol stack.
The cons of smcd loopback is also related to the CLC handshake, one more round
handshake may cause smc to perform worse than TCP in short-lived connection
scenarios. So we basically use smc upgrade in long-lived connection scenarios
and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
You don't have to bother RFC's, since you could define your own TCP
options. And, the eBPF approach could also use TCP options whenver
needed. Cilium probably does not use them only because for intra-OS case
it is too simple to bother TCP options, as everything can be shared via a
shared socketmap.
In reality, the setup is not that complex. In many cases we already know
whether we have VM or container (or mixed) setup before we develop (as
a part of requirement gathering). And they rarely change.
Taking one step back, the discovery of VM or container or loopback cases
could be done via TCP options too, to deal with complex cases like
KataContainer. There is no reason to bother RFC's, maybe except the RDMA
case.
In fact, this is an advantage to me. We don't need to argue with anyone
on our own TCP option or eBPF code, we don't even have to share our own
eBPF code here.
Yes, it expects to be used for SMC/MPTCP modules.
And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
hook[2] with IPPROTO_SMC, to enhance the usability.
That is wrong IMHO, because basically it just overwrites kernel modules
with eBPF, not how eBPF is supposed to be used. IOW, you could not use
it at all without SMC/MPTCP modules.
BTW, this approach does not work for kernel sockets, because you onlyIn fact the purpose of this is mainly to transparently upgrade applications'
hook __sys_socket().
Of course, for sockmap or sockops, they could be used independently forYes, I agree with the pros of eBPF way, like flexiblities you mentioned.
any other purposes. I hope now you could see the flexiblities of eBPF
over kernel modules.
Thanks.