nptl: Invert the mmap/mprotect logic on allocated stacks (BZ#18988)
Current allocate_stack logic for create stacks is to first mmap all
the required memory with the desirable memory and then mprotect the
guard area with PROT_NONE if required. Although it works as expected,
it pessimizes the allocation because it requires the kernel to actually
increase commit charge (it counts against the available physical/swap
memory available for the system).
The only issue is to actually check this change since side-effects are
really Linux specific and to actually account them it would require a
kernel specific tests to parse the system wide information. On the kernel
I checked /proc/self/statm does not show any meaningful difference for
vmm and/or rss before and after thread creation. I could only see
really meaningful information checking on system wide /proc/meminfo
between thread creation: MemFree, MemAvailable, and Committed_AS shows
large difference without the patch. I think trying to use these
kind of information on a testcase is fragile.
The BZ#18988 reports shows that the commit pages are easily seen with
mlockall (MCL_FUTURE) (with lock all pages that become mapped in the
process) however a more straighfoward testcase shows that pthread_create
could be faster using this patch:
--
static const int inner_count = 256;
static const int outer_count = 128;
static
void *thread1(void *arg)
{
return NULL;
}
static
void *sleeper(void *arg)
{
pthread_t ts[inner_count];
for (int i = 0; i < inner_count; i++)
pthread_create (&ts[i], &a, thread1, NULL);
for (int i = 0; i < inner_count; i++)
pthread_join (ts[i], NULL);
return NULL;
}
int main(void)
{
pthread_attr_init(&a);
pthread_attr_setguardsize(&a, 1<<20);
pthread_attr_setstacksize(&a, 1134592);
pthread_t ts[outer_count];
for (int i = 0; i < outer_count; i++)
pthread_create(&ts[i], &a, sleeper, NULL);
for (int i = 0; i < outer_count; i++)
pthread_join(ts[i], NULL);
assert(r == 0);
}
return 0;
}
--
On x86_64 (4.4.0-45-generic, gcc 5.4.0) running the small benchtests
I see:
$ time ./test
real 0m3.647s
user 0m0.080s
sys 0m11.836s
While with the patch I see:
$ time ./test
real 0m0.696s
user 0m0.040s
sys 0m1.152s
So I added a pthread_create benchtest (thread_create) which check
the thread creation latency. As for the simple benchtests, I saw
improvements in thread creation on all architectures I tested the
change.
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
arm-linux-gnueabihf, powerpc64le-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu.
[BZ #18988]
* benchtests/thread_create-inputs: New file.
* benchtests/thread_create-source.c: Likewise.
* support/xpthread_attr_setguardsize.c: Likewise.
* support/Makefile (libsupport-routines): Add
xpthread_attr_setguardsize object.
* support/xthread.h: Add xpthread_attr_setguardsize prototype.
* benchtests/Makefile (bench-pthread): Add thread_create.
* nptl/allocatestack.c (allocate_stack): Call mmap with PROT_NONE and
then mprotect the required area.
2017-01-31 20:01:59 +00:00
|
|
|
/* Measure pthread_create thread creation with different stack
|
|
|
|
and guard sizes.
|
|
|
|
|
2018-01-01 00:32:25 +00:00
|
|
|
Copyright (C) 2017-2018 Free Software Foundation, Inc.
|
nptl: Invert the mmap/mprotect logic on allocated stacks (BZ#18988)
Current allocate_stack logic for create stacks is to first mmap all
the required memory with the desirable memory and then mprotect the
guard area with PROT_NONE if required. Although it works as expected,
it pessimizes the allocation because it requires the kernel to actually
increase commit charge (it counts against the available physical/swap
memory available for the system).
The only issue is to actually check this change since side-effects are
really Linux specific and to actually account them it would require a
kernel specific tests to parse the system wide information. On the kernel
I checked /proc/self/statm does not show any meaningful difference for
vmm and/or rss before and after thread creation. I could only see
really meaningful information checking on system wide /proc/meminfo
between thread creation: MemFree, MemAvailable, and Committed_AS shows
large difference without the patch. I think trying to use these
kind of information on a testcase is fragile.
The BZ#18988 reports shows that the commit pages are easily seen with
mlockall (MCL_FUTURE) (with lock all pages that become mapped in the
process) however a more straighfoward testcase shows that pthread_create
could be faster using this patch:
--
static const int inner_count = 256;
static const int outer_count = 128;
static
void *thread1(void *arg)
{
return NULL;
}
static
void *sleeper(void *arg)
{
pthread_t ts[inner_count];
for (int i = 0; i < inner_count; i++)
pthread_create (&ts[i], &a, thread1, NULL);
for (int i = 0; i < inner_count; i++)
pthread_join (ts[i], NULL);
return NULL;
}
int main(void)
{
pthread_attr_init(&a);
pthread_attr_setguardsize(&a, 1<<20);
pthread_attr_setstacksize(&a, 1134592);
pthread_t ts[outer_count];
for (int i = 0; i < outer_count; i++)
pthread_create(&ts[i], &a, sleeper, NULL);
for (int i = 0; i < outer_count; i++)
pthread_join(ts[i], NULL);
assert(r == 0);
}
return 0;
}
--
On x86_64 (4.4.0-45-generic, gcc 5.4.0) running the small benchtests
I see:
$ time ./test
real 0m3.647s
user 0m0.080s
sys 0m11.836s
While with the patch I see:
$ time ./test
real 0m0.696s
user 0m0.040s
sys 0m1.152s
So I added a pthread_create benchtest (thread_create) which check
the thread creation latency. As for the simple benchtests, I saw
improvements in thread creation on all architectures I tested the
change.
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
arm-linux-gnueabihf, powerpc64le-linux-gnu, sparc64-linux-gnu,
and sparcv9-linux-gnu.
[BZ #18988]
* benchtests/thread_create-inputs: New file.
* benchtests/thread_create-source.c: Likewise.
* support/xpthread_attr_setguardsize.c: Likewise.
* support/Makefile (libsupport-routines): Add
xpthread_attr_setguardsize object.
* support/xthread.h: Add xpthread_attr_setguardsize prototype.
* benchtests/Makefile (bench-pthread): Add thread_create.
* nptl/allocatestack.c (allocate_stack): Call mmap with PROT_NONE and
then mprotect the required area.
2017-01-31 20:01:59 +00:00
|
|
|
This file is part of the GNU C Library.
|
|
|
|
|
|
|
|
The GNU C Library is free software; you can redistribute it and/or
|
|
|
|
modify it under the terms of the GNU Lesser General Public
|
|
|
|
License as published by the Free Software Foundation; either
|
|
|
|
version 2.1 of the License, or (at your option) any later version.
|
|
|
|
|
|
|
|
The GNU C Library is distributed in the hope that it will be useful,
|
|
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
Lesser General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU Lesser General Public
|
|
|
|
License along with the GNU C Library; if not, see
|
|
|
|
<http://www.gnu.org/licenses/>. */
|
|
|
|
|
|
|
|
#include <stdio.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
#include <support/xthread.h>
|
|
|
|
|
|
|
|
static size_t pgsize;
|
|
|
|
|
|
|
|
static void
|
|
|
|
thread_create_init (void)
|
|
|
|
{
|
|
|
|
pgsize = sysconf (_SC_PAGESIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *
|
|
|
|
thread_dummy (void *arg)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
thread_create (int nthreads, size_t stacksize, size_t guardsize)
|
|
|
|
{
|
|
|
|
pthread_attr_t attr;
|
|
|
|
xpthread_attr_init (&attr);
|
|
|
|
|
|
|
|
stacksize = stacksize * pgsize;
|
|
|
|
guardsize = guardsize * pgsize;
|
|
|
|
|
|
|
|
xpthread_attr_setstacksize (&attr, stacksize);
|
|
|
|
xpthread_attr_setguardsize (&attr, guardsize);
|
|
|
|
|
|
|
|
pthread_t ts[nthreads];
|
|
|
|
|
|
|
|
for (int i = 0; i < nthreads; i++)
|
|
|
|
ts[i] = xpthread_create (&attr, thread_dummy, NULL);
|
|
|
|
|
|
|
|
for (int i = 0; i < nthreads; i++)
|
|
|
|
xpthread_join (ts[i]);
|
|
|
|
}
|