Time Namespaces: CLONE_NEWTIME and clone3()?

From: Michael Kerrisk
Date: Mon Feb 17 2020 - 09:21:10 EST


Hello Dmitry, Andrei,

Is the CLONE_NEWTIME flag intended to be usable with clone3()? The
mail quoted below implies (in my reading) that this should be possible
once clone3() is available, which it is by now. (See also [1].)

If the answer is yes, CLONE_NEWTIME should be usable with clone3(),
then I have a bug report and a question.

I successfully used CLONE_NEWTIME with unshare(). But if I try to use
CLONE_NEWSIGNAL with clone3(), it errors out with EINVAL, because of
the following check in clone3_args_valid():

/*
* - make the CLONE_DETACHED bit reuseable for clone3
* - make the CSIGNAL bits reuseable for clone3
*/
if (kargs->flags & (CLONE_DETACHED | CSIGNAL))
return false;

The problem is that CLONE_NEWTIME matches one of the bits in the
CSIGNAL mask. If the intention is to allow CLONE_NEWTIME with
clone3(), then either the bit needs to be redefined, or the error
checking in clone3_args_valid() needs to be reworked.

And my question: if it is intended that CLONE_NEWTIME should be
usable with clone3(), how should that work? What I mean is,
clone3(CLONE_NEWTIME) creates a child process in a new time namespace,
but, as I understand it, the /proc/PID/timens_offsets must be defined
before the first process is created in or joins (setns()) the new
namespace. What am I missing?

Thanks,

Michael

[1] The message for commit 769071ac9f20b6a447410c7eaa55d1a5233ef40c,
implies rather more strongly that clone3() should be able to use
CLONE_NEWCTIME, but perhaps that is a result of Thomas's fix-up:

[[
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.

[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
]]

On Tue, Nov 12, 2019 at 2:31 AM Dmitry Safonov <dima@xxxxxxxxxx> wrote:
>
> From: Andrei Vagin <avagin@xxxxxxxxxx>
>
> Time Namespace isolates clock values.
>
> The kernel provides access to several clocks CLOCK_REALTIME,
> CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
>
> CLOCK_REALTIME
> System-wide clock that measures real (i.e., wall-clock) time.
>
> CLOCK_MONOTONIC
> Clock that cannot be set and represents monotonic time since
> some unspecified starting point.
>
> CLOCK_BOOTTIME
> Identical to CLOCK_MONOTONIC, except it also includes any time
> that the system is suspended.
>
> For many users, the time namespace means the ability to changes date and
> time in a container (CLOCK_REALTIME).
>
> But in a context of the checkpoint/restore functionality, monotonic and
> bootime clocks become interesting. Both clocks are monotonic with
> unspecified staring points. These clocks are widely used to measure time
> slices and set timers. After restoring or migrating processes, we have to
> guarantee that they never go backward. In an ideal case, the behavior of
> these clocks should be the same as for a case when a whole system is
> suspended. All this means that we need to be able to set CLOCK_MONOTONIC
> and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
> offsets for clocks.
>
> A time namespace is similar to a pid namespace in a way how it is
> created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
> but doesn't set it to the current process. Then all children of
> the process will be born in the new time namespace, or a process can
> use the setns() system call to join a namespace.
>
> This scheme allows setting clock offsets for a namespace, before any
> processes appear in it.
>
> All available clone flags have been used, so CLONE_NEWTIME uses the
> highest bit of CSIGNAL. It means that we can use it with the unshare()
> system call only. Rith now, this works for us, because time namespace
> offsets can be set only when a new time namespace is not populated. In a
> future, we will have the clone3() system call [1] which will allow to use
> the CSIGNAL mask for clone flags.
>
> [1]: httmps://lkml.kernel.org/r/20190604160944.4058-1-christian@xxxxxxxxxx
>
> Link: https://criu.org/Time_namespace
> Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
> Signed-off-by: Andrei Vagin <avagin@xxxxxxxxx>
> Co-developed-by: Dmitry Safonov <dima@xxxxxxxxxx>
> Signed-off-by: Dmitry Safonov <dima@xxxxxxxxxx>


--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/