Re: [RFC PATCH 05/27] containers: Open a socket inside a container

From: Eric W. Biederman
Date: Mon Sep 30 2019 - 06:03:36 EST


Alun Evans <alun@xxxxxxxxxxxxx> writes:

> On Fri 27 Sep '19 at 07:46 ebiederm@xxxxxxxxxxxx (Eric W. Biederman) wrote:
>>
>> Alun Evans <alun@xxxxxxxxxxxxx> writes:
>>
>>> Hi Eric,
>>>
>>>
>>> On Tue, 19 Feb 2019, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>>>
>>>> David Howells <dhowells@xxxxxxxxxx> writes:
>>>>
>>>> > Provide a system call to open a socket inside of a container, using that
>>>> > container's network namespace. This allows netlink to be used to manage
>>>> > the container.
>>>> >
>>>> > fd = container_socket(int container_fd,
>>>> > int domain, int type, int protocol);
>>>> >
>>>>
>>>> Nacked-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>>>>
>>>> Use a namespace file descriptor if you need this. So far we have not
>>>> added this system call as it is just a performance optimization. And it
>>>> has been too niche to matter.
>>>>
>>>> If this that has changed we can add this separately from everything else
>>>> you are doing here.
>>>
>>> I think I've found the niche.
>>>
>>>
>>> I'm trying to use network namespaces from Go.
>>
>> Yes. Go sucks for this.
>
> Haha... Neither confirm nor deny.
>
>>> Since setns is thread
>>> specific, I'm forced to use this pattern:
>>>
>>> runtime.LockOSThread()
>>> defer runtime.UnlockOSThread()
>>> â
>>> err = netns.Set(newns)
>>>
>>>
>>> This is only safe recently:
>>> https://github.com/vishvananda/netns/issues/17#issuecomment-367325770
>>>
>>> - but is still less than ideal performance wise, as it locks out other
>>> socket operations.
>>>
>>> The socketat() / socketns() would be ideal:
>>>
>>> https://lwn.net/Articles/406684/
>>> https://lwn.net/Articles/407495/
>>> https://lkml.org/lkml/2011/10/3/220
>>>
>>>
>>> One thing that is interesting, the LockOSThread works pretty well for
>>> receiving, since I can wrap it around the socket()/bind()/listen() at
>>> startup. Then accept() can run outside of the lock.
>>>
>>> It's creating new outbound tcp connections via socket()/connect() pairs
>>> that is the issue.
>>
>> As I understand it you should be able to write socketat in go something like:
>>
>> runtime.LockOSThread()
>> err = netns.Set(newns);
>> fd = socket(...);
>> err = netns.Set(defaultns);
>> runtime.UnlockOSThread()
>
> Yeah, this is currently what I'm having to do. It's painful because due
> to the Go runtime model of a single OS netpoller thread, locking the OS
> thread to the current goroutine blocks out the other goroutines doing
> network I/O.

Just to be clear you know that only the setns and the socket calls need
to block out switching threads and all of those should be currently
quite fast.

Hmm. So this is a global Go lock and not simply locking the current go
routine onto it's current kernel thread? Yes that does sound quite
painful.

It would be very nice if Go could provide an idiom where a series of
calls could be fixed to a single kernel thread.

>> I have no real objections to a kernel system call doing that. It has
>> just never risen to the level where it was necessary to optimize
>> userspace yet.
>
> Would you be able to accept the patch from this thread with the
> container API?
>
> fd = container_socket(int container_fd,
> int domain, int type, int protocol);
>
> I think that seems more coherent with the rest of the container world
> than a follow up of https://lkml.org/lkml/2011/10/3/220 :
>

Given container_socket implies the need to create a namespace of
namespaces. No.

Given that container_socket can't be used in iptools because it has
a different concept of container. No.

Given that no one has ever proposed solving the entire migration story
when the have wanted to define a container and thus all of this
implies breaking CRIU. No.

> int socketns(int netns_fd, int domain, int type, int protocol)
>

Yes please.

I suspect in the current world where system calls are much more
expensive (because of mitigations for speculative execution bugs) with a
little bit of timing we could come up with a reasonable case even for
non GO runtimes.

To that end I would like to see performance numbers of at least a micro
benchmark in C. Just so we can quantify the improvement.

Eric