[RFC] new open flag O_NOSTD

From: Eric Blake
Date: Mon Aug 24 2009 - 08:22:12 EST

Add a new flag, O_NOSTD, to at least open and pipe2 (and an alternate
spelling SOCK_NOSTD for socket, socketpair, accept4), with the following

If the flag is specified and the function is successful, the returned fd
(both fds for the pipe2 case) will be at least 3, regardless of whether
the standard file descriptors 0, 1, or 2 are currently closed.

GNU Coreutils tries hard to protect itself from whatever weird environment
may be thrown at it. One example is if the user runs:

cp a b 2>&-

If cp encounters an error, it prints a message to stderr, then regardless
of whether the message was successfully printed, cp guarantees a non-zero
exit status. In the case where fd 2 starts life closed, however, a naive
implementation could end up opening a destination file for writing as fd
2, then encounter an error, such that the first use of stderr to print an
error message will incorrectly modify the contents of a completely
unrelated file. Therefore, the best approach for cp to take is to ensure
that command-line arguments never occupy fd 0, 1, or 2, no matter what the
cp process inherited from its parent.

Of course, if cp were installed set-user-ID Or set-group-ID, then the OS
could guarantee that cp would never start life with fd 0, 1, or 2 closed;
but cp should not normally be installed with these permissions, and POSIX
does not permit the OS to arbitrarily open these fds if these permissions
are not present.

One option is for cp to manually guarantee that fd 0, 1, and 2 are opened
prior to parsing command line options. At one point, coreutils even used
this approach, via a function stdopen:
However, this has a couple of drawbacks. It costs several syscalls at
startup, even in the common case of all three std descriptors being
provided by the parent process. It also ties up otherwise unused open
file descriptors (perhaps the user intentionally closed some of the std
fds in order to provide room for allowing more simultaneously open files
without hitting EMFILE limits).

Another option is what cp currently uses, which guarantees that any
function call that creates a new fd is wrapped by a *_safer variant, which
guarantees that the result will never collide with the standard
descriptors. In the common case, the original open() returns 3 or larger,
so the wrapper has no additional work to perform. But if the user started
cp with fd 0, 1, or 2 closed, then the current implementation of the
open_safer wrapper notices that the underlying open() call is in the wrong
range, and provides a followup call to fcntl(fd,F_DUPFD,3) and close(fd),
such that the overall result is again safely out of the std fd range:

Notice that with coreutils' current approach, the common case (all std
descriptors provided by the parent) uses the minimal number of syscalls.
However, in the corner case of starting life with a standard descriptor
closed, the number of additional fcntl(F_DUPFD)/close() calls cause
noticeable slowdown when copying large hierarchies (especially when
compared with the stdopen approach of only suffering an up-front syscall
penalty). And while coreutils does not keep fd 0, 1, or 2 tied open on a
useless file all the time, it is still putting pressure on these
descriptors during the window of the open_safer wrapper, so it has not
completely eliminated the EMFILE avoidance. Also, the coreutils' approach
works well for a single-threaded application, but it needs modifications
to use the recently added POSIX 2008 open(O_CLOEXEC) and
fcntl(F_DUPFD_CLOEXEC) flags if it is to avoid leaking a temporary fd 0,
1, or 2 into child process created by a fork/exec in another thread during
the time that the first thread is calling open_safer.

Therefore it makes sense to move this functionality into the kernel, via
the addition of a new open() flag that informs the kernel that a
successful fd-creation syscall must behave as if fd 0, 1, and 2 were
already open. The idea is not new, since fcntl(fd, F_DUPFD, 3) already
does just this. Then, on kernels where this is available, coreutils can
alter its open_safer function to pass the new flag to the underlying
open() syscall, and avoid having to use fcntl/close to sanitize any
returned fd, with the result of no difference in the number of syscalls
regardless of whether the parent process started cp with stderr open or
closed. It also solves the EMFILE and multithreading fd leak issue, since
a temporary fd 0, 1, or 2 is never opened in the first place.

The name proposed in this mail is O_NOSTD (implying that a successful
result will not be any of the standard file descriptors); other ideas
mentioned on the bug-gnulib list were O_SAFER, O_NONSTD, O_NOSTDFD.


Don't work too hard, make some time for fun as well!

Eric Blake ebb9@xxxxxxx
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/