As discussed in libc-alpha [1] current clone with CLONE_VM (without
CLONE_THREAD set) will reset the pthread pid/tid fields to -1. The
issue is since memory is shared between the parent and child it will
clobber parent's cached pid/tid leading to internal inconsistencies
if the value is not restored.
And even it is restored it may lead to racy conditions when between
set/restore a thread might invoke pthread function that validate the
pthread with INVALID_TD_P/INVALID_NOT_TERMINATED_TD_P and thus get
wrong results.
As stated in BZ19957, previously reports of this behaviour was close
with EWONTFIX due the fact usage of clone outside glibc is tricky
since glibc requires consistent internal pthread, while using clone
directly may not provide it. However since now posix_spawn uses
clone (CLONE_VM) to fixes various issues related to previous vfork
usage this issue requires fixing.
The vfork implementation also does something similar, but instead
it negates and restores only the *pid* field and functions that
might access its value know to handle such case (getpid, raise
and pthread ones that uses INVALID_TD_P/INVALID_NOT_TERMINATED_TD_P
macros that check only *tid* field). Also vfork does not call
__clone directly, instead calling either __NR_vfork or __NR_clone
directly.
So this patch removes this clone behavior by avoiding setting
the pthread pid/tid field for CLONE_VM. There is no need to
check for CLONE_THREAD, since the minimum supported kernel in all
architecture implies that CLONE_VM must be used with CLONE_THREAD,
otherwise clone returns EINVAL.
Instead of current approach of:
int clone(int (*fn)(void *), void *child_stack, int flags, ...)
[...]
if (flags & CLONE_THREAD)
goto do_syscall;
pid_t new_value;
if (flags & CLONE_VM)
new_value = -1;
else
new_value = getpid ();
THREAD_SETMEM (THREAD_SELF, pid, new_value);
THREAD_SETMEM (THREAD_SELF, tid, new_value);
do_syscall:
[...]
The new approach uses:
int clone(int (*fn)(void *), void *child_stack, int flags, ...)
[...]
if (flags & CLONE_VM)
goto do_syscall;
pid_t new_value = getpid ();
THREAD_SETMEM (THREAD_SELF, pid, new_value);
THREAD_SETMEM (THREAD_SELF, tid, new_value);
do_syscall:
[...]
It also removes the linux tst-getpid2.c test which expects the previous
behavior and instead add another clone test.
Tested on x86_64, i686, x32, powerpc64le, aarch64, armhf, s390, and
s390x. I also did limited check on mips32 and sparc64 (using the new
added test).
I also got reviews from both m68k, hppa, and tile. So I presume for
these architecture the patch works.
The fixes for alpha, microblaze, sh, ia64, and nio2 have not been
tested.
[1] https://sourceware.org/ml/libc-alpha/2016-04/msg00307.html
* sysdeps/unix/sysv/linux/Makefile [$(subdir) == nptl] (test): Remove
tst-getpid2.
(test): Add tst-clone2.
* sysdeps/unix/sysv/linux/tst-clone2.c: New file.
* sysdeps/unix/sysv/linux/aarch64/clone.S (__clone): Do not change
pid/tid fields for CLONE_VM.
* sysdeps/unix/sysv/linux/arm/clone.S: Likewise.
* sysdeps/unix/sysv/linux/i386/clone.S: Likewise.
* sysdeps/unix/sysv/linux/mips/clone.S: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc32/clone.S: Likewise.
* sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S: Likewise.
* sysdeps/unix/sysv/linux/s390/s390-32/clone.S: Likewise.
* sysdeps/unix/sysv/linux/s390/s390-64/clone.S: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc32/clone.S: Likewise.
* sysdeps/unix/sysv/linux/sparc/sparc64/clone.S: Likewise.
* sysdeps/unix/sysv/linux/x86_64/clone.S: Likewise.
* sysdeps/unix/sysv/linux/tst-getpid2.c: Remove file.
The personality system call, starting with linux kernel commit
v2.6.29-6609-g11d06b2a1e5658f448a308aa3beb97bacd64a940, always
successfully changes the personality if requested. The syscall
wrapper, however, still can return an error in the following cases:
- the value returned by the system call looks like an error
due to architecture limitations of 32-bit kernels;
- a personality greater than 0xffffffff is passed to the system call,
and the 64-bit kernel does not have commit
v2.6.35-rc1-372-g485d527686850d68a0e9006dd9904f19f122485e
that would truncate this value to unsigned int;
- on sparc64, the value returned by the system call looks like an error
due to sparc64 kernel sign extension bug.
The solution is three-fold:
- move generic syscalls.list personality entry to generic 64-bit
syscalls.list file;
- for each 32-bit architecture that use negated errno semantics,
add a NOERRNO personality entry to their syscalls.list file;
- for sparc64 and 32-bit architectures that use dedicated registers
to flag syscall errors, add a wrapper around personality syscall;
if the system call return value is flagged as an error, this wrapper
returns the negated "would be errno" value, otherwise it returns
the system call return value; on sparc64, it also truncates the
personality argument to unsigned int before passing it to the kernel.
[BZ #19408]
* sysdeps/unix/sysv/linux/personality.c: New file.
* sysdeps/unix/sysv/linux/sparc/sparc64/personality.c: Likewise.
* sysdeps/unix/sysv/linux/tst-personality.c: Likewise.
* sysdeps/unix/sysv/linux/Makefile [$(subdir) == misc]
(sysdep_routines): Add personality.
(tests): Add tst-personality.
* sysdeps/unix/sysv/linux/syscalls.list (personality): Move ...
* sysdeps/unix/sysv/linux/wordsize-64/syscalls.list: ... here.
* sysdeps/unix/sysv/linux/arm/syscalls.list (personality): New entry.
* sysdeps/unix/sysv/linux/hppa/syscalls.list (personality): Likewise.
* sysdeps/unix/sysv/linux/i386/syscalls.list (personality): Likewise.
* sysdeps/unix/sysv/linux/m68k/syscalls.list (personality): Likewise.
* sysdeps/unix/sysv/linux/microblaze/syscalls.list (personality):
Likewise.
* sysdeps/unix/sysv/linux/mips/mips64/n32/syscalls.list (personality):
Likewise.
* sysdeps/unix/sysv/linux/s390/s390-32/syscalls.list (personality):
Likewise.
* sysdeps/unix/sysv/linux/sh/syscalls.list (personality): Likewise.
* sysdeps/unix/sysv/linux/x86_64/x32/syscalls.list (personality):
Likewise.
Various Linux kernel syscalls have become obsolete over time.
Specifically, the following are obsolete in all kernel versions
supported by glibc, are not present for architectures more recently
added to the kernel, and as such, the wrapper functions for them
should be compat symbols, not in static libc and not available for new
links with shared libc.
* bdflush: in Linux 2.6, does nothing if present.
* create_module get_kernel_syms query_module: Linux 2.4 module
interface, syscalls not present in Linux 2.6.
* uselib: part of the mechanism for loading a.out shared libraries,
irrelevant with ELF.
This patch adds support for syscalls.list to list syscall aliases of
the form NAME@VERSION:OBSOLETED, with SHLIB_COMPAT conditionals being
generated for such aliases. Those five syscalls are then made into
compat symbols (obsoleted in glibc 2.23, so future ports won't have
these symbols at all), with the header <sys/kdaemon.h> declaring
bdflush being removed. When we move to 3.2 as minimum kernel version,
the same can be done for nfsservctl (removed in Linux 3.1) as well.
Tested for x86_64 and x86 (testsuite, as well as checking that the
symbols in question indeed become compat symbols, that they are indeed
omitted from static libc, and that the generated SHLIB_COMPAT
conditionals look right).
[BZ #18472]
* sysdeps/unix/Makefile ($(objpfx)stub-syscalls.c): Handle entries
for the form NAME@VERSION:OBSOLETED and generate SHLIB_COMPAT
conditionals for them.
* sysdeps/unix/make-syscalls.sh (emit_weak_aliases): Likewise.
* sysdeps/unix/sysv/linux/sys/kdaemon.h: Remove file.
* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Remove
sys/kdaemon.h.
* sysdeps/unix/sysv/linux/syscalls.list (bdflush): Make into
compat-only syscall, obsoleted in glibc 2.23.
(create_module): Likewise.
(get_kernel_syms): Likewise.
(query_module): Likewise.
(uselib): Likewise.
* manual/sysinfo.texi (System Parameters): Do not mention bdflush.
With current kernel versions, the check does not reliably detect that
unavailable CPUs are requested, for these reasons:
(1) The kernel will silently ignore non-allowed CPUs, that is, CPUs
which are physically present but disallowed for the thread
based on system configuration.
(2) Similarly, CPU bits which lack an online CPU (possible CPUs)
are ignored.
(3) The existing probing code assumes that the CPU mask size is a
power of two and at least 1024. Neither has it to be a power
of two, nor is the minimum possible value 1024, so the value
determined is often too large. This means that the CPU set
size check in glibc accepts CPU bits beyond the actual hard
system limit.
(4) Future kernel versions may not even have a fixed CPU set size.
After the removal of the probing code, the kernel still returns
EINVAL if no CPU in the requested set remains which can run the
thread after the affinity change.
Applications which care about the exact affinity mask will have
to query it using sched_getaffinity after setting it. Due to the
effects described above, this commit does not change this.
The new tests supersede tst-getcpu, which is removed. This
addresses bug 19164 because the new tests allocate CPU sets
dynamically.
* nptl/check-cpuset.h: Remove.
* nptl/pthread_attr_setaffinity.c (__pthread_attr_setaffinity_new):
Remove CPU set size check.
* nptl/pthread_setattr_default_np.c (pthread_setattr_default_np):
Likewise.
* sysdeps/unix/sysv/linux/check-cpuset.h: Remove.
* sysdeps/unix/sysv/linux/pthread_setaffinity.c
(__kernel_cpumask_size, __determine_cpumask_size): Remove.
(__pthread_setaffinity_new): Remove CPU set size check.
* sysdeps/unix/sysv/linux/sched_setaffinity.c
(__kernel_cpumask_size): Remove.
(__sched_setaffinity_new): Remove CPU set size check.
* manual/threads.texi (Default Thread Attributes): Remove stale
reference to check_cpuset_attr, determine_cpumask_size in comment.
* sysdeps/unix/sysv/linux/Makefile [$(subdir) == posix] (tests):
Remove tst-getcpu. Add tst-affinity, tst-affinity-pid.
[$(subdir) == nptl] (tests): Add tst-thread-affinity-pthread,
tst-thread-affinity-pthread2, tst-thread-affinity-sched.
* sysdeps/unix/sysv/linux/tst-affinity.c: New file.
* sysdeps/unix/sysv/linux/tst-affinity-pid.c: New file.
* sysdeps/unix/sysv/linux/tst-skeleton-affinity.c: New skeleton test file.
* sysdeps/unix/sysv/linux/tst-thread-affinity-sched.c: New file.
* sysdeps/unix/sysv/linux/tst-thread-affinity-pthread.c: New file.
* sysdeps/unix/sysv/linux/tst-thread-affinity-pthread2.c: New file.
* sysdeps/unix/sysv/linux/tst-thread-skeleton-affinity.c: New
skeleton test file.
* sysdeps/unix/sysv/linux/tst-getcpu.c: Remove. Superseded by
tst-affinity-pid.
The recvmsg system calls for netlink sockets have been particularly
prone to picking up unrelated data after a file descriptor race
(where the descriptor is closed and reopened concurrently in a
multi-threaded process, as the result of a file descriptor
management issue elsewhere). This commit adds additional error
checking and aborts the process if a datagram of unexpected length
(without the netlink header) is received, or an error code which
cannot happen due to the way the netlink socket is used.
[BZ #12926]
Terminate process on invalid netlink response.
* sysdeps/unix/sysv/linux/netlinkaccess.h
(__netlink_assert_response): Declare.
* sysdeps/unix/sysv/linux/netlink_assert_response.c: New file.
* sysdeps/unix/sysv/linux/Makefile [$(subdir) == inet]
(sysdep_routines): Add netlink_assert_response.
* sysdeps/unix/sysv/linux/check_native.c (__check_native): Call
__netlink_assert_response.
* sysdeps/unix/sysv/linux/check_pf.c (make_request): Likewise.
* sysdeps/unix/sysv/linux/ifaddrs.c (__netlink_request): Likewise.
* sysdeps/unix/sysv/linux/Versions (GLIBC_PRIVATE): Add
__netlink_assert_response.
Since _dl_x86_64_save_sse and _dl_x86_64_restore_sse are removed now,
we don't need to run tst-getpid2 with LD_BIND_NOW=1.
[BZ #11214]
* sysdeps/unix/sysv/linux/Makefile (tst-getpid2-ENV): Removed.
This patch removes the socket.S implementation for all ports and replace
it by a C implementation using socketcall. For ports that implement
the syscall directly, there is no change.
The patch idea is to simplify the socket function implementation that
uses the socketcall to be based on C implemetation instead of a pseudo
assembly implementation with arch specific parts. The patch then remove
the assembly implementatation for the ports which uses socketcall
(i386, microblaze, mips, powerpc, sparc, m68k, s390 and sh).
I have cross-build GLIBC for afore-mentioned ports and tested on both
i386 and ppc32 without regressions.
The current scheme to identify which module a translation unit is
built in depends on defining multiple macros IS_IN_* and also defining
NOT_IN_libc if we're building a non-libc module. In addition, there
is an IN_LIB macro that does effectively the same thing, but for
different modules (notably the systemtap probes). This macro scheme
unifies both ideas to use just one macro IN_MODULE and assign it a
value depending on the module it is being built into. If the module
is not defined, it defaults to MODULE_libc.
Patches that follow will replace uses of IS_IN_* variables with the
IS_IN() macro. libc-symbols.h has been converted already to give an
example of how such a transition will look.
Verified that there are no relevant binary changes. One source change
that will crop up repeatedly is that of nscd_stat, since it uses the
build timestamp as a constant in its logic.
* Makeconfig (in-module): Get value of libof set for the
translation unit.
(CPPFLAGS): Use $(in-module).
* Makerules: Don't suffix routine names for nonlib.
* include/libc-modules.h: New file.
* include/libc-symbols.h: Include libc-modules.h
(IS_IN): New macro to replace IS_IN_* macros.
* elf/Makefile: Set libof-* for each routine.
* elf/rtld-Rules: Likewise.
* extra-modules.mk: Likewise.
* iconv/Makefile: Likewise.
* iconvdata/Makefile: Likewise.
* locale/Makefile: Likewise.
* malloc/Makefile: Likewise.
* nss/Makefile: Likewise.
* sysdeps/gnu/Makefile: Likewise.
* sysdeps/ieee754/ldbl-opt/Makefile: Likewise.
* sysdeps/unix/sysv/linux/Makefile: Likewise.
* sysdeps/s390/s390-64/Makefile: Likewise.
* nscd/Makefile: Set libof-* for each routine. Set CFLAGS and
CPPFLAGS for nscd instead of nonlib.
The generated header is compiled with `-ffreestanding' to avoid any
circular dependencies against the installed implementation headers.
Such a dependency would require the implementation header to be
installed before the generated header could be built (See bug 15711).
In current practice the generated header dependencies do not include
any of the implementation headers removed by the use of `-ffreestanding'.
---
2013-07-15 Carlos O'Donell <carlos@redhat.com>
[BZ #15711]
* sysdeps/unix/sysv/linux/Makefile ($(objpfx)bits/syscall%h):
Avoid system header dependency with -ffreestanding.
($(objpfx)bits/syscall%d): Likewise.
Whenever getaddrinfo needed network interface information it used the
netlink interface to read the information every single time. The
problem is that this information can change at any time.
The patch implements monitoring of the network interfaces through
nscd. If no change is detected the previously read information can
be reused (which is the norm). This timestamp information is also
made available to other processes using the shared memory segment
between nscd and those processes.
nscd can clear caches when certain files change. The list of files
was hardcoded so far and worked for nss_files and nss_dns and those
modules which need no monitoring. nss_db, for instance, has its
own set of files to monitor. Now the NSS modules themselves can
request that certain files are monitored.
The getcwd syscall (so far?) can only handle path up to one page
in size. There is no limit about directory hierarchy depth, though,
and the POSIX getcwd is supposed to handle this. In that case fall
back to the generic getcwd.
Additionally, optimize the generic getcwd to use openat when possible
to change the asymptotic performance from O(N^2) to O(n).
The ntp_gettime implementation of NTP exports the tai field the kernel
now produces. This requires an ABI change since the ntptimeval structure
changed. Upstream kept the same name, there is nothing to do. This
patch changes the ntptimeval structure but keeps the old ntp_gettime
definition. A new ntp_gettimex function which is transparently invoked
through the old name is introduced. This has the advantage that even
object files can remain compatible. This wouldn't be the case if
symbol versioning would be used to overload the name ntp_gettime.
* sysdeps/unix/sysv/linux/fallocate64.c: New file.
* sysdeps/unix/sysv/linux/wordsize-64/fallocate64.c: New file.
* sysdeps/unix/sysv/linux/wordsize-64/fallocate.c: New file.
* sysdeps/unix/sysv/linux/i386/fallocate64.c: New file.
* sysdeps/unix/sysv/linux/i386/fallocate.c: New file.
* sysdeps/unix/sysv/linux/Versions: Export fallocate and fallocate64
for GLIBC_2.10.
* sysdeps/unix/sysv/linux/Makefile [subdir=io] (sysdep_routines):
Add fallocate and fallocate64.
* io/fcntl.h: Pretty printing.
fields.
* nscd/connections.c (inotify_fd): New variable.
(nscd_init): Try to open an inotify descriptor.
If successful, watch files for databases using inotify instead of
having prune threads stat the files.
(nscd_run_prune): Recognize clear_cache flag being set and call
prune_cache appropriately.
(main_loop_poll): Add inotify descriptor to wait set and handle the
reported changes.
(main_loop_epoll): Likewise.
* nscd/cache.c (prune_cache): Don't stat files for databases if
inotify is used.
* sysdeps/unix/sysv/linux/Makefile [subdir=nscd]
(CFLAGS-connections.c): Add -DHAVE_INOTIFY.