Returning a pidfd allows a process to keep a race-free handle for a
child process, otherwise, the caller will need to either use pidfd_open
(which still might be subject to TOCTOU) or keep the old racy interface
base on pid_t.
To correct use pifd_spawn, the kernel must support not only returning
the pidfd with clone/clone3 but also waitid (P_PIDFD) (added on Linux
5.4). If kernel does not support the waitid, pidfd return ENOSYS.
It avoids the need to racy workarounds, such as reading the procfs
fdinfo to get the pid to use along with other wait interfaces.
These interfaces are similar to the posix_spawn and posix_spawnp, with
the only difference being it returns a process file descriptor (int)
instead of a process ID (pid_t). Their prototypes are:
int pidfd_spawn (int *restrict pidfd,
const char *restrict file,
const posix_spawn_file_actions_t *restrict facts,
const posix_spawnattr_t *restrict attrp,
char *const argv[restrict],
char *const envp[restrict])
int pidfd_spawnp (int *restrict pidfd,
const char *restrict path,
const posix_spawn_file_actions_t *restrict facts,
const posix_spawnattr_t *restrict attrp,
char *const argv[restrict_arr],
char *const envp[restrict_arr]);
A new symbol is used instead of a posix_spawn extension to avoid
possible issues with language bindings that might track the return
argument lifetime. Although on Linux pid_t and int are interchangeable,
POSIX only states that pid_t should be a signed integer.
Both symbols reuse the posix_spawn posix_spawn_file_actions_t and
posix_spawnattr_t, to void rehash posix_spawn API or add a new one. It
also means that both interfaces support the same attribute and file
actions, and a new flag or file action on posix_spawn is also added
automatically for pidfd_spawn.
Also, using posix_spawn plumbing allows the reusing of most of the
current testing with some changes:
- waitid is used instead of waitpid since it is a more generic
interface.
- tst-posix_spawn-setsid.c is adapted to take into consideration that
the caller can check for session id directly. The test now spawns
itself and writes the session id as a file instead.
- tst-spawn3.c need to know where pidfd_spawn is used so it keeps an
extra file description unused.
Checked on x86_64-linux-gnu on Linux 4.15 (no CLONE_PIDFD or waitid
support), Linux 5.4 (full support), and Linux 6.2.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Add an optimization to avoid calling clone3 when glibc detects that
there is no kernel support. It also adds __ASSUME_CLONE3, which allows
skipping this optimization and issuing the clone3 syscall directly.
It does not handle the the small window between 5.3 and 5.5 for
posix_spawn (CLONE_CLEAR_SIGHAND was added in 5.5).
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The clone3 flag resets all signal handlers of the child not set to
SIG_IGN to SIG_DFL. It allows to skip most of the sigaction calls
to setup child signal handling, where previously a posix_spawn
had to issue 2 times NSIG sigaction calls (one to obtain the current
disposition and another to set either SIG_DFL or SIG_IGN).
With POSIX_SPAWN_SETSIGDEF the child will setup the signal for the case
where the disposition is SIG_IGN.
The code must handle the fallback where clone3 is not available. This is
done by splitting __clone_internal_fallback from __clone_internal.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
All internal callers of __clone3 should provide an already aligned
stack. Removing the stack alignment in __clone3 is a net gain: it
simplifies the internal function contract (mask/unmask signals) along
with the arch-specific code.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Different than kernel, clone3 returns EINVAL for NULL struct
clone_args or function pointer. This is similar to clone
interface that return EINVAL for NULL function argument.
It also clean up the Linux clone3.h interface, since it not
currently exported.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The clone3 system call (since Linux 5.3) provides a superset of the
functionality of clone and clone2. It also provides a number of API
improvements, including the ability to specify the size of the child's
stack area which can be used by kernel to compute the shadow stack size
when allocating the shadow stack. Add:
extern int __clone_internal (struct clone_args *__cl_args,
int (*__func) (void *__arg), void *__arg);
to provide an abstract interface for clone, clone2 and clone3.
1. Simplify stack management for thread creation by passing both stack
base and size to create_thread.
2. Consolidate clone vs clone2 differences into a single file.
3. Call __clone3 if HAVE_CLONE3_WAPPER is defined. If __clone3 returns
-1 with ENOSYS, fall back to clone or clone2.
4. Use only __clone_internal to clone a thread. Since the stack size
argument for create_thread is now unconditional, always pass stack size
to create_thread.
5. Enable the public clone3 wrapper in the future after it has been
added to all targets.
NB: Sandbox will return ENOSYS on clone3 in both Chromium:
The following revision refers to this bug:
218438259d
commit 218438259dd795456f0a48f67cbe5b4e520db88b
Author: Matthew Denton <mpdenton@chromium.org>
Date: Thu Jun 03 20:06:13 2021
Linux sandbox: return ENOSYS for clone3
Because clone3 uses a pointer argument rather than a flags argument, we
cannot examine the contents with seccomp, which is essential to
preventing sandboxed processes from starting other processes. So, we
won't be able to support clone3 in Chromium. This CL modifies the
BPF policy to return ENOSYS for clone3 so glibc always uses the fallback
to clone.
Bug: 1213452
Change-Id: I7c7c585a319e0264eac5b1ebee1a45be2d782303
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2936184
Reviewed-by: Robert Sesek <rsesek@chromium.org>
Commit-Queue: Matthew Denton <mpdenton@chromium.org>
Cr-Commit-Position: refs/heads/master@{#888980}
[modify] https://crrev.com/218438259dd795456f0a48f67cbe5b4e520db88b/sandbox/linux/seccomp-bpf-helpers/baseline_policy.cc
and Firefox:
https://hg.mozilla.org/integration/autoland/rev/ecb4011a0c76
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>