Re: [PATCH v4] pidns: introduce syscall translate_pid

From: Andy Lutomirski
Date: Tue Oct 17 2017 - 18:40:44 EST


On Tue, Oct 17, 2017 at 3:35 PM, prakash sangappa
<prakash.sangappa@xxxxxxxxxx> wrote:
>
> On 10/17/2017 3:02 PM, Andy Lutomirski wrote:
>>
>> On Tue, Oct 17, 2017 at 8:38 AM, Prakash Sangappa
>> <prakash.sangappa@xxxxxxxxxx> wrote:
>>>
>>>
>>> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>>>
>>>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>>> <prakash.sangappa@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>>>>>
>>>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>>>>> <khlebnikov@xxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>>>>>>>>>
>>>>>>>>>>> This syscall converts pid from source pid-ns into pid in target
>>>>>>>>>>> pid-ns.
>>>>>>>>>>> If pid is unreachable from target pid-ns it returns zero.
>>>>>>>>>>>
>>>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files
>>>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>>>>>>>>>> argument
>>>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>>>>>>>>>
>>>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
>>>>>>>>>>> backward
>>>>>>>>>>> translation requires scanning all tasks. Also pids could be
>>>>>>>>>>> translated
>>>>>>>>>>> by sending them through unix socket between namespaces, this
>>>>>>>>>>> method
>>>>>>>>>>> is
>>>>>>>>>>> slow and insecure because other side is exposed inside pid
>>>>>>>>>>> namespace.
>>>>>>>>
>>>>>>>> Andrew asked why we might need this.
>>>>>>>>
>>>>>>>> Such conversion is required for interaction between processes across
>>>>>>>> pid-namespaces.
>>>>>>>> For example to identify process in container by pid file looking
>>>>>>>> from
>>>>>>>> outside.
>>>>>>>>
>>>>>>>> Two years ago I've solved this in project of mine with monstrous
>>>>>>>> code
>>>>>>>> which
>>>>>>>> forks couple times just to convert pid, lucky for me performance
>>>>>>>> wasn't
>>>>>>>> important.
>>>>>>>
>>>>>>> That's a single user who needed this a single time, and found a
>>>>>>> userspace-based solution anyway. This is not exactly compelling!
>>>>>>>
>>>>>>> Is there a stronger case to be made? How does this change benefit
>>>>>>> our
>>>>>>> users? Sell it to us!
>>>>>>
>>>>>> Oracle database is planning to use pid namespace for sandboxing
>>>>>> database
>>>>>> instances and they need an API similar to translate_pid to effectively
>>>>>> translate process IDs from other pid namespaces. Prakash (cced in
>>>>>> mail)
>>>>>> can
>>>>>> provide more details on this usecase.
>>>>>
>>>>>
>>>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>>>> and
>>>>> needs a direct method of converting pids of processes in the pid
>>>>> namespace
>>>>> hierarchy. In this use case multiple
>>>>> nested PID namespaces will be used. The currently available mechanism
>>>>> are
>>>>> not very efficient for this use case. For ex. as Konstantin described,
>>>>> using
>>>>> /proc/<pid>/status would require the application to scan all the pid's
>>>>> status files to determine the pid of given process in a child
>>>>> namespace.
>>>>>
>>>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>>>> require
>>>>> every process starting inside a pid namespace to send this message and
>>>>> the
>>>>> receiving process in the target namespace would have to save the
>>>>> converted
>>>>> pid and reference it. This mechanism becomes cumbersome especially if
>>>>> the
>>>>> application has to deal with multiple nested pid namespaces. Also, the
>>>>> Database needs to be able to convert a thread's global pid(gettid()).
>>>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>>>> CAP_SYS_ADMIN, which is an issue.
>>>>>
>>>>> So having a direct method, like the API that Konstantin is proposing,
>>>>> will
>>>>> work best for the Database
>>>>> since pid of a process in any of the nested pid namespaces can be
>>>>> converted
>>>>> as and when required. I think with the proposed API, the application
>>>>> should
>>>>> be able to convert pid of a process or tid(gettid()) of a thread as
>>>>> well.
>>>>>
>>>> Can you explain what Oracle's database is planning to do with this
>>>> information?
>>>
>>>
>>> Database uses the PID to programmatically find out if the process/thread
>>> is
>>> alive(kill 0) also send signals to the processes requesting it to dump
>>> status/debug information and kill the processes in case of a shutdown
>>> abort
>>> of the instance.
>>
>> What I'm wondering is: how does the caller of kill() end up
>> controlling a task whose pid it doesn't know in its own namespace?
>
>
> I was generally describing how DB would use the PID of process. The above
> description
> was in the case when no namespaces are used.
>
> With use of namespaces, the DB would convert the PID of processes inside
> its children namespaces to PID in its namespace and use that pid to issue
> kill().

Seems vaguely sensible.

If I were designing this type of system, I'd have a manager process in
each namespace running as PID 1, though -- PID 1 is special and needs
to understand what's going on anyway. Then PID 1 would do the kill()
calls and wouldn't need translate_pid().


>
> -Prakash.
>
>>
>>> -Prakash.
>>>
>>>
>