Commit Graph

493 Commits

Author SHA1 Message Date
H.J. Lu
c328d0152d x86_64/multiarch: Sort sysdep_routines and put one entry per line 2022-02-05 16:42:17 -08:00
H.J. Lu
0e0199a9e0 x86-64: Fix strcmp-evex.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit 8418eb3ff4
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:39 2022 -0600

    x86: Optimize strcmp-evex.S
2022-02-04 11:11:08 -08:00
H.J. Lu
c15efd011c x86-64: Fix strcmp-avx2.S
Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit b77b06e0e2
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Mon Jan 10 15:35:38 2022 -0600

    x86: Optimize strcmp-avx2.S
2022-02-04 11:09:10 -08:00
Noah Goldstein
8418eb3ff4 x86: Optimize strcmp-evex.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-02-03 16:41:41 -06:00
Noah Goldstein
b77b06e0e2 x86: Optimize strcmp-avx2.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-02-03 16:41:38 -06:00
Noah Goldstein
7e08db3359 x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_evex. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-01-10 20:31:57 -06:00
Noah Goldstein
ddf0992cf5 x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-01-10 20:31:46 -06:00
Paul Eggert
581c785bf3 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.

I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah.  I don't
know why I run into these diagnostics whereas others evidently do not.

remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2022-01-01 11:40:24 -08:00
Noah Goldstein
cca457f9c5 x86: Optimize L(less_vec) case in memcmpeq-evex.S
No bug.
Optimizations are twofold.

1) Replace page cross and 0/1 checks with masked load instructions in
   L(less_vec). In applications this reduces branch-misses in the
   hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.

Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range.  From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-12-27 03:18:58 -06:00
Noah Goldstein
abddd61de0 x86: Optimize L(less_vec) case in memcmp-evex-movbe.S
No bug.
Optimizations are twofold.

1) Replace page cross and 0/1 checks with masked load instructions in
   L(less_vec). In applications this reduces branch-misses in the
   hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.

Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range.  From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-12-27 03:17:59 -06:00
Noah Goldstein
4df1fa6ddc x86-64: Use notl in EVEX strcmp [BZ #28646]
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
2021-12-03 21:14:11 -08:00
Noah Goldstein
2f9062d717 x86: Shrink memcmp-sse4.S code size
No bug.

This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.

memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes

The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.

See https://research.google/pubs/pub48320/ for more details.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-11-10 20:12:10 -06:00
Noah Goldstein
a6b7502ec0 x86: Optimize memmove-vec-unaligned-erms.S
No bug.

The optimizations are as follows:

1) Always align entry to 64 bytes. This makes behavior more
   predictable and makes other frontend optimizations easier.

2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
   significant benefits in the case that:
        0 < (dst - src) < [256, 512]

3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
   improvement and for FSRM [-10%, 25%].

In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-11-06 16:18:03 -05:00
Fangrui Song
6720d36b66 x86-64: Replace movzx with movzbl
Clang cannot assemble movzx in the AT&T dialect mode.

../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
 movzx (%rsi), %ecx
               ^~~~

Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-11-02 20:59:52 -07:00
H.J. Lu
14dbbf46a0 x86-64: Remove Prefer_AVX2_STRCMP
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp.  When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL.  EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
2021-11-01 07:53:04 -07:00
H.J. Lu
c46e9afb2d x86-64: Improve EVEX strcmp with masked load
In strcmp-evex.S, to compare 2 32-byte strings, replace

        VMOVU   (%rdi, %rdx), %YMM0
        VMOVU   (%rsi, %rdx), %YMM1
        /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
        VPCMP   $4, %YMM0, %YMM1, %k0
        VPCMP   $0, %YMMZERO, %YMM0, %k1
        VPCMP   $0, %YMMZERO, %YMM1, %k2
        /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
        kord    %k1, %k2, %k1
        /* Each bit in K1 represents a NULL or a mismatch.  */
        kord    %k0, %k1, %k1
        kmovd   %k1, %ecx
        testl   %ecx, %ecx
        jne     L(last_vector)

with

        VMOVU   (%rdi, %rdx), %YMM0
        VPTESTM %YMM0, %YMM0, %k2
        /* Each bit cleared in K1 represents a mismatch or a null CHAR
           in YMM0 and 32 bytes at (%rsi, %rdx).  */
        VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
        kmovd   %k1, %ecx
        incl    %ecx
        jne     L(last_vector)

It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.

Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
2021-11-01 07:52:56 -07:00
Noah Goldstein
9b7cfab180 x86_64: Add evex optimized __memcmpeq in memcmpeq-evex.S
No bug. This commit adds new optimized __memcmpeq implementation for
evex.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
2021-10-27 13:03:46 -05:00
Noah Goldstein
b4ed69ba16 x86_64: Add avx2 optimized __memcmpeq in memcmpeq-avx2.S
No bug. This commit adds new optimized __memcmpeq implementation for
avx2.

The primary optimizations are:

1) skipping the logic to find the difference of the first mismatched
byte.

2) not updating src/dst addresses as the non-equals logic does not
need to be reused by different areas.
2021-10-27 13:03:46 -05:00
Noah Goldstein
cf4fd28ea4 x86_64: Add support for __memcmpeq using sse2, avx2, and evex
No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.
2021-10-27 13:03:46 -05:00
Noah Goldstein
9894127d20 String: Add hidden defs for __memcmpeq() to enable internal usage
No bug.

This commit adds hidden defs for all declarations of __memcmpeq. This
enables usage of __memcmpeq without the PLT for usage internal to
GLIBC.
2021-10-26 16:51:29 -05:00
Noah Goldstein
44829b3ddb String: Add support for __memcmpeq() ABI on all targets
No bug.

This commit adds support for __memcmpeq() as a new ABI for all
targets. In this commit __memcmpeq() is implemented only as an alias
to the corresponding targets memcmp() implementation. __memcmpeq() is
added as a new symbol starting with GLIBC_2.35 and defined in string.h
with comments explaining its behavior. Basic tests that it is callable
and works where added in string/tester.c

As discussed in the proposal "Add new ABI '__memcmpeq()' to libc"
__memcmpeq() is essentially a reserved namespace for bcmp(). The means
is shares the same specifications as memcmp() except the return value
for non-equal byte sequences is any non-zero value. This is less
strict than memcmp()'s return value specification and can be better
optimized when a boolean return is all that is needed.

__memcmpeq() is meant to only be called by compilers if they can prove
that the return value of a memcmp() call is only used for its boolean
value.

All tests in string/tester.c passed. As well build succeeds on
x86_64-linux-gnu target.
2021-10-26 16:51:29 -05:00
Noah Goldstein
bad852b61b x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.

it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.

It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-10-23 13:02:42 -05:00
Noah Goldstein
e59ced2384 x86: Optimize memset-vec-unaligned-erms.S
No bug.

Optimization are

1. change control flow for L(more_2x_vec) to fall through to loop and
   jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
   size and saves jumps for length > 4x VEC_SIZE.

2. For EVEX/AVX512 move L(less_vec) closer to entry.

3. Avoid complex address mode for length > 2x VEC_SIZE

4. Slightly better aligning code for the loop from the perspective of
   code size and uops.

5. Align targets so they make full use of their fetch block and if
   possible cache line.

6. Try and reduce total number of icache lines that will need to be
   pulled in for a given length.

7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
   jumping to the stosb target in the sse2 code section will almost
   certainly be to a new page. The new version does increase code size
   marginally by duplicating the target but should get better iTLB
   behavior as a result.

test-memset, test-wmemset, and test-bzero are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-10-12 13:38:02 -05:00
Noah Goldstein
1bd8b8d58f x86: Optimize memcmp-evex-movbe.S for frontend behavior and size
No bug.

The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
   the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
   code size. Smaller / Less hot basic blocks are used for this.

Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.

In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.

The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.

test-memcmp and test-wmemcmp are both passing.
2021-10-12 12:02:12 -05:00
Siddhesh Poyarekar
30891f35fa Remove "Contributed by" lines
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date.  Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.

Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions.  These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.

The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively.  These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:

https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dc
https://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-09-03 22:06:44 +05:30
H.J. Lu
3c8b9879ca x86-64: Use testl to check __x86_string_control
Use testl, instead of andl, to check __x86_string_control to avoid
updating __x86_string_control.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2021-08-30 10:39:53 -07:00
H.J. Lu
91cc803d27 x86-64: Add Avoid_Short_Distance_REP_MOVSB
commit 3ec5d83d2a
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sat Jan 25 14:19:40 2020 -0800

    x86-64: Avoid rep movsb with short distance [BZ #27130]

introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM).  Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM.  bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:

                                  before    after    Improvement
length=4127, align1=3, align2=0:  479.38    349.25      27%
length=4223, align1=9, align2=5:  405.62    333.25      18%
length=8223, align1=3, align2=0:  786.12    496.38      37%
length=8319, align1=9, align2=5:  727.50    501.38      31%
length=16415, align1=3, align2=0: 1436.88   840.00      41%
length=16511, align1=9, align2=5: 1375.50   836.38      39%
length=32799, align1=3, align2=0: 2890.00   1860.12     36%
length=32895, align1=9, align2=5: 2891.38   1931.88     33%
2021-07-28 13:23:57 -07:00
Noah Goldstein
0679442def x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064]
The following commit

commit 6f573a27b6
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 23 01:19:34 2021 -0400

    x86-64: Add wcslen optimize for sse4.1

Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.

Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-07-08 18:55:43 -04:00
Noah Goldstein
08cbcd4dbc x86: Remove unnecessary overflow check from wcsnlen-sse4_1.S
No bug. The way wcsnlen will check if near the end of maxlen
is the following macro:

	mov	%r11, %rsi;	\
	subq	%rax, %rsi;	\
	andq	$-64, %rax;	\
	testq	$-64, %rsi;	\
	je	L(strnlen_ret)

Which words independently of s + maxlen overflowing. So the
second overflow check is unnecissary for correctness and
just extra overhead in the common no overflow case.

test-strlen.c, test-wcslen.c, test-strnlen.c and test-wcsnlen.c are
all passing

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-06-24 19:14:19 -04:00
Noah Goldstein
a775a7a3eb x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]
This commit fixes the bug mentioned in the previous commit.

The previous implementations of wmemchr in these files relied
on maxlen * sizeof(wchar_t) which was not guranteed by the standard.

The new overflow tests added in the previous commit now
pass (As well as all the other tests).

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-06-23 14:13:06 -04:00
Noah Goldstein
645a158978 x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]
This commit fixes the bug mentioned in the previous commit.

The previous implementations of wmemchr in these files relied
on n * sizeof(wchar_t) which was not guranteed by the standard.

The new overflow tests added in the previous commit now
pass (As well as all the other tests).

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-06-23 14:13:03 -04:00
Noah Goldstein
6f573a27b6 x86-64: Add wcslen optimize for sse4.1
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-06-23 14:12:36 -04:00
H.J. Lu
a0db678071 x86-64: Move strlen.S to multiarch/strlen-vec.S
Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.
2021-06-23 10:24:35 -07:00
Noah Goldstein
1b992204f6 x86: Improve memmove-vec-unaligned-erms.S
This patch changes the condition for copy 4x VEC so that if length is
exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of
8x VEC case.

Results For Skylake memcpy-avx2-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 0   , 9.137   , 6.873   , New , 75.22
128 , 7   , 0   , 12.933  , 7.732   , New , 59.79
128 , 0   , 7   , 11.852  , 6.76    , New , 57.04
128 , 7   , 7   , 12.587  , 6.808   , New , 54.09

Results For Icelake memcpy-evex-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 0   , 9.963   , 5.416   , New , 54.36
128 , 7   , 0   , 16.467  , 8.061   , New , 48.95
128 , 0   , 7   , 14.388  , 7.644   , New , 53.13
128 , 7   , 7   , 14.546  , 7.642   , New , 52.54

Results For Tigerlake memcpy-evex-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 0   , 8.979   , 4.95    , New , 55.13
128 , 7   , 0   , 14.245  , 7.122   , New , 50.0
128 , 0   , 7   , 12.668  , 6.675   , New , 52.69
128 , 7   , 7   , 13.042  , 6.802   , New , 52.15

Results For Skylake memmove-avx2-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 32  , 6.181   , 5.691   , New , 92.07
128 , 32  , 0   , 6.165   , 5.752   , New , 93.3
128 , 0   , 7   , 13.923  , 9.37    , New , 67.3
128 , 7   , 0   , 12.049  , 10.182  , New , 84.5

Results For Icelake memmove-evex-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 32  , 5.479   , 4.889   , New , 89.23
128 , 32  , 0   , 5.127   , 4.911   , New , 95.79
128 , 0   , 7   , 18.885  , 13.547  , New , 71.73
128 , 7   , 0   , 15.565  , 14.436  , New , 92.75

Results For Tigerlake memmove-evex-erms
size, al1 , al2 , Cur T   , New T   , Win , New / Cur
128 , 0   , 32  , 5.275   , 4.815   , New , 91.28
128 , 32  , 0   , 5.376   , 4.565   , New , 84.91
128 , 0   , 7   , 19.426  , 14.273  , New , 73.47
128 , 7   , 0   , 15.924  , 14.951  , New , 93.89

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-05-23 22:50:49 -04:00
Noah Goldstein
6abf27980a x86: Improve memset-vec-unaligned-erms.S
No bug. This commit makes a few small improvements to
memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64
instead of 128. Either alignment will perform equally well in a loop
and 128 just increases the odds of having to do an extra iteration
which can be significant overhead for small values. 2) Align some
targets and the loop. 3) Remove an ALU from the alignment process. 4)
Reorder the last 4x VEC so that they are stored after the loop. 5)
Move the condition for leq 8x VEC to before the alignment
process. test-memset and test-wmemset are both passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-20 17:28:33 -04:00
Noah Goldstein
4ad473e97a x86: Optimize memcmp-evex-movbe.S
No bug. This commit optimizes memcmp-evex.S. The optimizations include
adding a new vec compare path for small sizes, reorganizing the entry
control flow, removing some unnecissary ALU instructions from the main
loop, and most importantly replacing the heavy use of vpcmp + kand
logic with vpxor + vptern. test-memcmp and test-wmemcmp are both
passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-18 22:57:51 -04:00
Noah Goldstein
16d12015c5 x86: Optimize memcmp-avx2-movbe.S
No bug. This commit optimizes memcmp-avx2.S. The optimizations include
adding a new vec compare path for small sizes, reorganizing the entry
control flow, and removing some unnecissary ALU instructions from the
main loop. test-memcmp and test-wmemcmp are both passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-18 22:57:44 -04:00
Noah Goldstein
104c7b1967 x86: Add EVEX optimized memchr family not safe for RTM
No bug.

This commit adds a new implementation for EVEX memchr that is not safe
for RTM because it uses vzeroupper. The benefit is that by using
ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is
faster than the RTM safe version which cannot use vpcmpeq because
there is no EVEX encoding for the instruction. All parts of the
implementation aside from the 4x loop are the same for the two
versions and the optimization is only relevant for large sizes.

Tigerlake:
size  , algn  , Pos   , Cur T , New T , Win     , Dif
512   , 6     , 192   , 9.2   , 9.04  , no-RTM  , 0.16
512   , 7     , 224   , 9.19  , 8.98  , no-RTM  , 0.21
2048  , 0     , 256   , 10.74 , 10.54 , no-RTM  , 0.2
2048  , 0     , 512   , 14.81 , 14.87 , RTM     , 0.06
2048  , 0     , 1024  , 22.97 , 22.57 , no-RTM  , 0.4
2048  , 0     , 2048  , 37.49 , 34.51 , no-RTM  , 2.98   <--

Icelake:
size  , algn  , Pos   , Cur T , New T , Win     , Dif
512   , 6     , 192   , 7.6   , 7.3   , no-RTM  , 0.3
512   , 7     , 224   , 7.63  , 7.27  , no-RTM  , 0.36
2048  , 0     , 256   , 8.48  , 8.38  , no-RTM  , 0.1
2048  , 0     , 512   , 11.57 , 11.42 , no-RTM  , 0.15
2048  , 0     , 1024  , 17.92 , 17.38 , no-RTM  , 0.54
2048  , 0     , 2048  , 30.37 , 27.34 , no-RTM  , 3.03   <--

test-memchr, test-wmemchr, and test-rawmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-08 16:26:30 -04:00
Alice Xu
6ea916adfa x86-64: Fix an unknown vector operation in memchr-evex.S
An unknown vector operation occurred in commit 2a76821c30. Fixed it
by using "ymm{k1}{z}" but not "ymm {k1} {z}".

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-07 19:03:21 -07:00
Adhemerval Zanella
db373e4c57 Remove architecture specific sched_cpucount optimizations
And replace the generic algorithm with the Brian Kernighan's one.
GCC optimize it with popcnt if the architecture supports, so there
is no need to add the extra POPCNT define to enable it.

This is really a micro-optimization that only adds complexity:
recent ABIs already support it (x86-64-v2 or power64le) and it
simplifies the code for internal usage, since i686 does not allow an
internal iFUNC call.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
2021-05-07 13:35:29 -03:00
Noah Goldstein
2a76821c30 x86: Optimize memchr-evex.S
No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-03 21:18:03 -04:00
Noah Goldstein
acfd088a19 x86: Optimize memchr-avx2.S
No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-03 21:17:21 -04:00
Noah Goldstein
7f3e7c262c x86: Optimize strchr-evex.S
No bug. This commit optimizes strchr-evex.S. The optimizations are
mostly small things such as save an ALU in the alignment process,
saving a few instructions in the loop return. The one significant
change is saving 2 instructions in the 4x loop. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-25 10:04:39 -07:00
Noah Goldstein
ccabe7971f x86: Optimize strchr-avx2.S
No bug. This commit optimizes strchr-avx2.S. The optimizations are all
small things such as save an ALU in the alignment process, saving a
few instructions in the loop return, saving some bytes in the main
loop, and increasing the ILP in the return cases. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-25 10:04:31 -07:00
Noah Goldstein
aaa23c3507 x86: Optimize strlen-avx2.S
No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-19 18:03:49 -07:00
Noah Goldstein
4ba6558684 x86: Optimize strlen-evex.S
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-19 18:03:49 -07:00
Noah Goldstein
f53790272c x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S
No bug. This commit adds optimized cased for less_vec memset case that
uses the avx512vl/avx512bw mask store avoiding the excessive
branches. test-memset and test-wmemset are passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-19 15:08:04 -07:00
H.J. Lu
83c5b36822 x86-64: Require BMI2 for strchr-avx2.S
Since strchr-avx2.S updated by

commit 1f745ecc21
Author: noah <goldstein.w.n@gmail.com>
Date:   Wed Feb 3 00:38:59 2021 -0500

    x86-64: Refactor and improve performance of strchr-avx2.S

uses sarx:

c4 e2 72 f7 c0       	sarx   %ecx,%eax,%eax

for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and
ifunc-avx2.h.
2021-04-19 11:01:45 -07:00
H.J. Lu
55bf411b45 x86-64: Require BMI2 for __strlen_evex and __strnlen_evex
Since __strlen_evex and __strnlen_evex added by

commit 1fd8c163a8
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 5 06:24:52 2021 -0800

    x86-64: Add ifunc-avx2.h functions with 256-bit EVEX

use sarx:

c4 e2 6a f7 c0       	sarx   %edx,%eax,%eax

require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c.
ifunc-avx2.h already requires BMI2 for EVEX implementation.
2021-04-19 07:51:33 -07:00
noah
1a8605b6cd x86: Update large memcpy case in memmove-vec-unaligned-erms.S
No Bug. This commit updates the large memcpy case (no overlap). The
update is to perform memcpy on either 2 or 4 contiguous pages at
once. This 1) helps to alleviate the affects of false memory aliasing
when destination and source have a close 4k alignment and 2) In most
cases and for most DRAM units is a modestly more efficient access
pattern. These changes are a clear performance improvement for
VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
2021-04-16 10:06:56 -07:00
Sunil K Pandey
595c22ecd8 x86-64: Fix ifdef indentation in strlen-evex.S
Fix some indentations of ifdef in file strlen-evex.S which are off by 1
and confusing to read.
2021-04-01 16:13:33 -07:00
H.J. Lu
e4fda46310 x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions
Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
4e2d8f3527 x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
7ebba91361 x86-64: Add AVX optimized string/memory functions for RTM
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with

	xtest
	jz	1f
	vzeroall
	ret
1:
	vzeroupper
	ret

at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
2021-03-29 07:40:17 -07:00
H.J. Lu
91264fe357 x86-64: Add memcmp family functions with 256-bit EVEX
Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
1b968b6b9b x86-64: Add memset family functions with 256-bit EVEX
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
63ad43566f x86-64: Add memmove family functions with 256-bit EVEX
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
525bc2a32c x86-64: Add strcpy family functions with 256-bit EVEX
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
1fd8c163a8 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.

For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
2021-03-29 07:40:17 -07:00
noah
1f745ecc21 x86-64: Refactor and improve performance of strchr-avx2.S
No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.
2021-02-08 11:21:33 -08:00
Sajan Karumanchi
6e02b3e932 x86: Adding an upper bound for Enhanced REP MOVSB.
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.

Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
2021-02-02 12:42:15 +01:00
H.J. Lu
3ec5d83d2a x86-64: Avoid rep movsb with short distance [BZ #27130]
When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow.  This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:

	cmpl	$63, %ecx
	// Don't use "rep movsb" if ECX <= 63
	jbe	L(Don't use rep movsb")
	Use "rep movsb"

Benchtests data with bench-memcpy, bench-memcpy-large, bench-memcpy-random
and bench-memcpy-walk on Skylake, Ice Lake and Tiger Lake show that its
performance impact is within noise range as "rep movsb" is only used for
data size >= 4KB.
2021-01-04 07:58:57 -08:00
Paul Eggert
2b778ceb40 Update copyright dates with scripts/update-copyrights
I used these shell commands:

../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")

and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
2021-01-02 12:17:34 -08:00
H.J. Lu
9620398097 x86: Install <sys/platform/x86.h> [BZ #26124]
Install <sys/platform/x86.h> so that programmers can do

 #if __has_include(<sys/platform/x86.h>)
 #include <sys/platform/x86.h>
 #endif
 ...

   if (CPU_FEATURE_USABLE (SSE2))
 ...
   if (CPU_FEATURE_USABLE (AVX2))
 ...

<sys/platform/x86.h> exports only:

enum
{
  COMMON_CPUID_INDEX_1 = 0,
  COMMON_CPUID_INDEX_7,
  COMMON_CPUID_INDEX_80000001,
  COMMON_CPUID_INDEX_D_ECX_1,
  COMMON_CPUID_INDEX_80000007,
  COMMON_CPUID_INDEX_80000008,
  COMMON_CPUID_INDEX_7_ECX_1,
  /* Keep the following line at the end.  */
  COMMON_CPUID_INDEX_MAX
};

struct cpuid_features
{
  struct cpuid_registers cpuid;
  struct cpuid_registers usable;
};

struct cpu_features
{
  struct cpu_features_basic basic;
  struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
};

/* Get a pointer to the CPU features structure.  */
extern const struct cpu_features *__x86_get_cpu_features
  (unsigned int max) __attribute__ ((const));

Since all feature checks are done through macros, programs compiled with
a newer <sys/platform/x86.h> are compatible with the older glibc binaries
as long as the layout of struct cpu_features is identical.  The features
array can be expanded with backward binary compatibility for both .o and
.so files.  When COMMON_CPUID_INDEX_MAX is increased to support new
processor features, __x86_get_cpu_features in the older glibc binaries
returns NULL and HAS_CPU_FEATURE/CPU_FEATURE_USABLE return false on the
new processor feature.  No new symbol version is neeeded.

Both CPU_FEATURE_USABLE and HAS_CPU_FEATURE are provided.  HAS_CPU_FEATURE
can be used to identify processor features.

Note: Although GCC has __builtin_cpu_supports, it only supports a subset
of <sys/platform/x86.h> and it is equivalent to CPU_FEATURE_USABLE.  It
doesn't support HAS_CPU_FEATURE.
2020-09-11 17:20:52 -07:00
H.J. Lu
107e6a3c22 x86: Support usable check for all CPU features
Support usable check for all CPU features with the following changes:

1. Change struct cpu_features to

struct cpuid_features
{
  struct cpuid_registers cpuid;
  struct cpuid_registers usable;
};

struct cpu_features
{
  struct cpu_features_basic basic;
  struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
  unsigned int preferred[PREFERRED_FEATURE_INDEX_MAX];
...
};

so that there is a usable bit for each cpuid bit.
2. After the cpuid bits have been initialized, copy the known bits to the
usable bits.  EAX/EBX from INDEX_1 and EAX from INDEX_7 aren't used for
CPU feature detection.
3. Clear the usable bits which require OS support.
4. If the feature is supported by OS, copy its cpuid bit to its usable
bit.
5. Replace HAS_CPU_FEATURE and CPU_FEATURES_CPU_P with CPU_FEATURE_USABLE
and CPU_FEATURE_USABLE_P to check if a feature is usable.
6. Add DEPR_FPU_CS_DS for INDEX_7_EBX_13.
7. Unset MPX feature since it has been deprecated.

The results are

1. If the feature is known and doesn't requre OS support, its usable bit
is copied from the cpuid bit.
2. Otherwise, its usable bit is copied from the cpuid bit only if the
feature is known to supported by OS.
3. CPU_FEATURE_USABLE/CPU_FEATURE_USABLE_P are used to check if the
feature can be used.
4. HAS_CPU_FEATURE/CPU_FEATURE_CPU_P are used to check if CPU supports
the feature.
2020-07-13 06:05:16 -07:00
H.J. Lu
3f4b61a0b8 x86: Add thresholds for "rep movsb/stosb" to tunables
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.

Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2020-07-06 11:48:42 -07:00
Sunil K Pandey
75870237ff Fix avx2 strncmp offset compare condition check [BZ #25933]
strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.
2020-06-17 07:07:38 -07:00
H.J. Lu
a35a59036e x86_64: Use %xmmN with vpxor to clear a vector register
Since "vpxor %xmmN, %xmmN, %xmmN" clears the whole vector register, use
%xmmN, instead of %ymmN, with vpxor to clear a vector register.
2020-06-17 05:44:02 -07:00
H.J. Lu
55c7bcc71b x86-64: Use RDX_LP on __x86_shared_non_temporal_threshold [BZ #25966]
Since __x86_shared_non_temporal_threshold is defined as

long int __x86_shared_non_temporal_threshold;

and long int is 4 bytes for x32, use RDX_LP to compare against
__x86_shared_non_temporal_threshold in assembly code.
2020-05-09 12:28:15 -07:00
Joseph Myers
d614a75396 Update copyright dates with scripts/update-copyrights. 2020-01-01 00:14:33 +00:00
Paul Eggert
5a82c74822 Prefer https to http for gnu.org and fsf.org URLs
Also, change sources.redhat.com to sourceware.org.
This patch was automatically generated by running the following shell
script, which uses GNU sed, and which avoids modifying files imported
from upstream:

sed -ri '
  s,(http|ftp)(://(.*\.)?(gnu|fsf|sourceware)\.org($|[^.]|\.[^a-z])),https\2,g
  s,(http|ftp)(://(.*\.)?)sources\.redhat\.com($|[^.]|\.[^a-z]),https\2sourceware.org\4,g
' \
  $(find $(git ls-files) -prune -type f \
      ! -name '*.po' \
      ! -name 'ChangeLog*' \
      ! -path COPYING ! -path COPYING.LIB \
      ! -path manual/fdl-1.3.texi ! -path manual/lgpl-2.1.texi \
      ! -path manual/texinfo.tex ! -path scripts/config.guess \
      ! -path scripts/config.sub ! -path scripts/install-sh \
      ! -path scripts/mkinstalldirs ! -path scripts/move-if-change \
      ! -path INSTALL ! -path  locale/programs/charmap-kw.h \
      ! -path po/libc.pot ! -path sysdeps/gnu/errlist.c \
      ! '(' -name configure \
            -execdir test -f configure.ac -o -f configure.in ';' ')' \
      ! '(' -name preconfigure \
            -execdir test -f preconfigure.ac ';' ')' \
      -print)

and then by running 'make dist-prepare' to regenerate files built
from the altered files, and then executing the following to cleanup:

  chmod a+x sysdeps/unix/sysv/linux/riscv/configure
  # Omit irrelevant whitespace and comment-only changes,
  # perhaps from a slightly-different Autoconf version.
  git checkout -f \
    sysdeps/csky/configure \
    sysdeps/hppa/configure \
    sysdeps/riscv/configure \
    sysdeps/unix/sysv/linux/csky/configure
  # Omit changes that caused a pre-commit check to fail like this:
  # remote: *** error: sysdeps/powerpc/powerpc64/ppc-mcount.S: trailing lines
  git checkout -f \
    sysdeps/powerpc/powerpc64/ppc-mcount.S \
    sysdeps/unix/sysv/linux/s390/s390-64/syscall.S
  # Omit change that caused a pre-commit check to fail like this:
  # remote: *** error: sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S: last line does not end in newline
  git checkout -f sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S
2019-09-07 02:43:31 -07:00
Adhemerval Zanella
81a1443941 wcsmbs: optimize wcscat
This patch rewrites wcscat using wcslen and wcscpy.  This is similar to
the optimization done on strcat by 6e46de42fe.

The strcpy changes are mainly to add the internal alias to avoid PLT
calls.

Checked on x86_64-linux-gnu and a build against the affected
architectures.

	* include/wchar.h (__wcscpy): New prototype.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c
	(__wcscpy): Route internal symbol to generic implementation.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c (wcscpy):
	Add internal __wcscpy alias.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy.c (wcscpy): Likewise.
	* sysdeps/s390/wcscpy.c (wcscpy): Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.c (wcscpy): Likewise.
	* wcsmbs/wcscpy.c (wcscpy): Add
	* sysdeps/x86_64/multiarch/wcscpy-c.c (WCSCPY): Adjust macro to
	use generic implementation.
	* wcsmbs/wcscat.c (wcscat): Rewrite using wcslen and wcscpy.
2019-02-27 10:00:37 -03:00
Andreas Schwab
65f7767a91 Fix handling of collating elements in fnmatch (bug 17396, bug 16976)
This fixes the same bug in fnmatch that was fixed by commit 7e2f0d2d77 for
regexp matching.  As a side effect it also removes the use of an unbound
VLA.
2019-02-04 15:45:02 +01:00
H.J. Lu
5165de69c0 x86-64 strnlen/wcsnlen: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes strnlen/wcsnlen for x32.  Tested on x86-64 and x32.  On
x86-64, libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/strlen-avx2.S: Use RSI_LP for length.
	Clear the upper 32 bits of RSI register.
	* sysdeps/x86_64/strlen.S: Use RSI_LP for length.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strnlen
	and tst-size_t-wcsnlen.
	* sysdeps/x86_64/x32/tst-size_t-strnlen.c: New file.
	* sysdeps/x86_64/x32/tst-size_t-wcsnlen.c: Likewise.
2019-01-21 11:36:47 -08:00
H.J. Lu
c7c54f65b0 x86-64 strncpy: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes strncpy for x32.  Tested on x86-64 and x32.  On x86-64,
libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/strcpy-avx2.S: Use RDX_LP for length.
	* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strncpy.
	* sysdeps/x86_64/x32/tst-size_t-strncpy.c: New file.
2019-01-21 11:35:34 -08:00
H.J. Lu
ee915088a0 x86-64 strncmp family: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes the strncmp family for x32.  Tested on x86-64 and x32.
On x86-64, libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/strcmp-avx2.S: Use RDX_LP for length.
	* sysdeps/x86_64/multiarch/strcmp-sse42.S: Likewise.
	* sysdeps/x86_64/strcmp.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-strncasecmp,
	tst-size_t-strncmp and tst-size_t-wcsncmp.
	* sysdeps/x86_64/x32/tst-size_t-strncasecmp.c: New file.
	* sysdeps/x86_64/x32/tst-size_t-strncmp.c: Likewise.
	* sysdeps/x86_64/x32/tst-size_t-wcsncmp.c: Likewise.
2019-01-21 11:34:04 -08:00
H.J. Lu
82d0b4a4d7 x86-64 memset/wmemset: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes memset/wmemset for x32.  Tested on x86-64 and x32.  On
x86-64, libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: Use
	RDX_LP for length.  Clear the upper 32 bits of RDX register.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-wmemset.
	* sysdeps/x86_64/x32/tst-size_t-memset.c: New file.
	* sysdeps/x86_64/x32/tst-size_t-wmemset.c: Likewise.
2019-01-21 11:32:37 -08:00
H.J. Lu
ecd8b842cf x86-64 memrchr: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes memrchr for x32.  Tested on x86-64 and x32.  On x86-64,
libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/memrchr.S: Use RDX_LP for length.
	* sysdeps/x86_64/multiarch/memrchr-avx2.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memrchr.
	* sysdeps/x86_64/x32/tst-size_t-memrchr.c: New file.
2019-01-21 11:30:12 -08:00
H.J. Lu
231c56760c x86-64 memcpy: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes memcpy for x32.  Tested on x86-64 and x32.  On x86-64,
libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Use RDX_LP for
	length.  Clear the upper 32 bits of RDX register.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memcpy.
	tst-size_t-wmemchr.
	* sysdeps/x86_64/x32/tst-size_t-memcpy.c: New file.
2019-01-21 11:27:36 -08:00
H.J. Lu
b304fc201d x86-64 memcmp/wmemcmp: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes memcmp/wmemcmp for x32.  Tested on x86-64 and x32.  On
x86-64, libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S: Use RDX_LP for
	length.  Clear the upper 32 bits of RDX register.
	* sysdeps/x86_64/multiarch/memcmp-sse4.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp-ssse3.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memcmp and
	tst-size_t-wmemcmp.
	* sysdeps/x86_64/x32/tst-size_t-memcmp.c: New file.
	* sysdeps/x86_64/x32/tst-size_t-wmemcmp.c: Likewise.
2019-01-21 11:26:07 -08:00
H.J. Lu
97700a34f3 x86-64 memchr/wmemchr: Properly handle the length parameter [BZ# 24097]
On x32, the size_t parameter may be passed in the lower 32 bits of a
64-bit register with the non-zero upper 32 bits.  The string/memory
functions written in assembly can only use the lower 32 bits of a
64-bit register as length or must clear the upper 32 bits before using
the full 64-bit register for length.

This pach fixes memchr/wmemchr for x32.  Tested on x86-64 and x32.  On
x86-64, libc.so is the same with and withou the fix.

	[BZ# 24097]
	CVE-2019-6488
	* sysdeps/x86_64/memchr.S: Use RDX_LP for length.  Clear the
	upper 32 bits of RDX register.
	* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
	* sysdeps/x86_64/x32/Makefile (tests): Add tst-size_t-memchr and
	tst-size_t-wmemchr.
	* sysdeps/x86_64/x32/test-size_t.h: New file.
	* sysdeps/x86_64/x32/tst-size_t-memchr.c: Likewise.
	* sysdeps/x86_64/x32/tst-size_t-wmemchr.c: Likewise.
2019-01-21 11:24:13 -08:00
Leonardo Sandoval
1a153e47fc x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
It uses vector comparison as much as possible. In general, the larger the
source string, the greater performance gain observed, reaching speedups of
1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
preferred and AVX unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
	stpcpy-avx2 and stpncpy-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c:
	(__libc_ifunc_impl_list): Add tests for __strcat_avx2,
	__strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
	and __stpncpy_avx2.
	* sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
	ifunc-strcpy.h}: rename header for a more generic name.
	* sysdeps/x86_64/multiarch/ifunc-strcpy.h:
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
	AVX unaligned load is fast and vzeroupper is preferred.
	* sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
	* sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
2019-01-14 09:43:38 -06:00
Joseph Myers
04277e02d7 Update copyright dates with scripts/update-copyrights.
* All files with FSF copyright notices: Update copyright dates
	using scripts/update-copyrights.
	* locale/programs/charmap-kw.h: Regenerated.
	* locale/programs/locfile-kw.h: Likewise.
2019-01-01 00:11:28 +00:00
H.J. Lu
c22e4c2a14 x86: Extend CPUID support in struct cpu_features
Extend CPUID support for all feature bits from CPUID.  Add a new macro,
CPU_FEATURE_USABLE, which can be used to check if a feature is usable at
run-time, instead of HAS_CPU_FEATURE and HAS_ARCH_FEATURE.

Add COMMON_CPUID_INDEX_D_ECX_1, COMMON_CPUID_INDEX_80000007 and
COMMON_CPUID_INDEX_80000008 to check CPU feature bits in them.

Tested on i686 and x86-64 as well as using build-many-glibcs.py with
x86 targets.

	* sysdeps/x86/cacheinfo.c (intel_check_word): Updated for
	cpu_features_basic.
	(__cache_sysconf): Likewise.
	(init_cacheinfo): Likewise.
	* sysdeps/x86/cpu-features.c (get_extended_indeces): Also
	populate COMMON_CPUID_INDEX_80000007 and
	COMMON_CPUID_INDEX_80000008.
	(get_common_indices): Also populate COMMON_CPUID_INDEX_D_ECX_1.
	Use CPU_FEATURES_CPU_P (cpu_features, XSAVEC) to check if
	XSAVEC is available.  Set the bit_arch_XXX_Usable bits.
	(init_cpu_features): Use _Static_assert on
	index_arch_Fast_Unaligned_Load.
	__get_cpuid_registers and __get_arch_feature.  Updated for
	cpu_features_basic.  Set stepping in cpu_features.
	* sysdeps/x86/cpu-features.h: (FEATURE_INDEX_1): Changed to enum.
	(FEATURE_INDEX_2): New.
	(FEATURE_INDEX_MAX): Changed to enum.
	(COMMON_CPUID_INDEX_D_ECX_1): New.
	(COMMON_CPUID_INDEX_80000007): Likewise.
	(COMMON_CPUID_INDEX_80000008): Likewise.
	(cpuid_registers): Likewise.
	(cpu_features_basic): Likewise.
	(CPU_FEATURE_USABLE): Likewise.
	(bit_arch_XXX_Usable): Likewise.
	(cpu_features): Use cpuid_registers and cpu_features_basic.
	(bit_arch_XXX): Reweritten.
	(bit_cpu_XXX): Likewise.
	(index_cpu_XXX): Likewise.
	(reg_XXX): Likewise.
	* sysdeps/x86/tst-get-cpu-features.c: Include <stdio.h> and
	<support/check.h>.
	(CHECK_CPU_FEATURE): New.
	(CHECK_CPU_FEATURE_USABLE): Likewise.
	(cpu_kinds): Likewise.
	(do_test): Print vendor, family, model and stepping.  Check
	HAS_CPU_FEATURE and CPU_FEATURE_USABLE.
	(TEST_FUNCTION): Removed.
	Include <support/test-driver.c> instead of
	"../../test-skeleton.c".
	* sysdeps/x86_64/multiarch/sched_cpucount.c (__sched_cpucount):
	Check POPCNT instead of POPCOUNT.
	* sysdeps/x86_64/multiarch/test-multiarch.c (do_test): Likewise.
2018-12-03 05:54:56 -08:00
Martin Sebor
1626a1cfcd Add support for GCC 9 attribute copy.
GCC 9 has gained an enhancement to help detect attribute mismatches
between alias declarations and their targets.  It consists of a new
warning, -Wattribute-alias, an enhancement to an existing warning,
-Wmissing-attributes, and a new attribute called copy.

The purpose of the warnings is to help identify either possible bugs
(an alias declared with more restrictive attributes than its target
promises) or optimization or diagnostic opportunities (an alias target
missing some attributes that it could be declared with that might
benefit analysis and code generation).  The purpose of the new
attribute is to easily apply (almost) the same set of attributes
to one declaration as those already present on another.

As expected (and intended) the enhancement triggers warnings for
many alias declarations in Glibc code.  This change, tested on
x86_64-linux, avoids all instances of the new warnings by making
use of the attribute where appropriate.  To fully benefit from
the enhancement Glibc will need to be compiled with
 -Wattribute-alias=2 and remaining warnings reviewed and dealt with
(there are a couple of thousand but most should be straightforward
to deal with).

ChangeLog:

	* include/libc-symbols.h (__attribute_copy__): Define macro unless
	it's already defined.
	(_strong_alias): Use __attribute_copy__.
	(_weak_alias,  __hidden_ver1,  __hidden_nolink2): Same.
	* misc/sys/cdefs.h (__attribute_copy__): New macro.
	* sysdeps/x86_64/multiarch/memchr.c (memchr): Use __attribute_copy__.
	* sysdeps/x86_64/multiarch/memcmp.c (memcmp): Same.
	* sysdeps/x86_64/multiarch/mempcpy.c (mempcpy): Same.
	* sysdeps/x86_64/multiarch/memset.c (memset): Same.
	* sysdeps/x86_64/multiarch/stpcpy.c (stpcpy): Same.
	* sysdeps/x86_64/multiarch/strcat.c (strcat): Same.
	* sysdeps/x86_64/multiarch/strchr.c (strchr): Same.
	* sysdeps/x86_64/multiarch/strcmp.c (strcmp): Same.
	* sysdeps/x86_64/multiarch/strcpy.c (strcpy): Same.
	* sysdeps/x86_64/multiarch/strcspn.c (strcspn): Same.
	* sysdeps/x86_64/multiarch/strlen.c (strlen): Same.
	* sysdeps/x86_64/multiarch/strncmp.c (strncmp): Same.
	* sysdeps/x86_64/multiarch/strncpy.c (strncpy): Same.
	* sysdeps/x86_64/multiarch/strnlen.c (strnlen): Same.
	* sysdeps/x86_64/multiarch/strpbrk.c (strpbrk): Same.
	* sysdeps/x86_64/multiarch/strrchr.c (strrchr): Same.
	* sysdeps/x86_64/multiarch/strspn.c (strspn): Same.
2018-11-09 17:24:12 -07:00
H.J. Lu
430388d5dc x86: Don't include <init-arch.h> in assembly codes
There is no need to include <init-arch.h> in assembly codes since all
x86 IFUNC selector functions are written in C.  Tested on i686 and
x86-64.  There is no code change in libc.so, ld.so and libmvec.so.

	* sysdeps/i386/i686/multiarch/bzero-ia32.S: Don't include
	<init-arch.h>.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core-avx2.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
2018-08-03 08:05:00 -07:00
H.J. Lu
e2d40a8822 x86-64: Use _CET_NOTRACK in memcmp-sse4.S
* sysdeps/x86_64/multiarch/memcmp-sse4.S (BRANCH_TO_JMPTBL_ENTRY):
	Add _CET_NOTRACK before indirect jump to jump table.
2018-07-18 08:07:32 -07:00
H.J. Lu
03aaf49b68 x86-64: Use _CET_NOTRACK in memcpy-ssse3.S
* sysdeps/x86_64/multiarch/memcpy-ssse3.S
	(BRANCH_TO_JMPTBL_ENTRY): Add _CET_NOTRACK before indirect jump
	to jump table.
	(MEMCPY): Likewise.
2018-07-18 06:39:46 -07:00
H.J. Lu
811e9e52b2 x86-64: Use _CET_NOTRACK in memcpy-ssse3-back.S
* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S
	(BRANCH_TO_JMPTBL_ENTRY): Add _CET_NOTRACK before indirect jump
	to jump table.
	(MEMCPY): Likewise.
2018-07-18 06:38:23 -07:00
H.J. Lu
8817df4265 x86-64: Use _CET_NOTRACK in strcmp-sse42.S
* sysdeps/x86_64/multiarch/strcmp-sse42.S (STRCMP_SSE42): Add
	_CET_NOTRACK before indirect jump to jump table.
2018-07-18 06:37:09 -07:00
H.J. Lu
921595d151 x86-64: Use _CET_NOTRACK in strcpy-sse2-unaligned.S
* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
	(BRANCH_TO_JMPTBL_ENTRY): Add _CET_NOTRACK before indirect jump
	to jump table.
2018-07-18 06:33:06 -07:00
H.J. Lu
5efc6777ad x86-64: Add _CET_ENDBR to STRCMP_SSE42
Add _CET_ENDBR to STRCMP_SSE42, which is called indirectly, to support
IBT.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

	* sysdeps/x86_64/multiarch/strcmp-sse42.S (STRCMP_SSE42): Add
	_CET_ENDBR.
2018-07-17 16:08:47 -07:00
Florian Weimer
e826574c98 x86: Make strncmp usable from rtld
Due to the way the conditions were written, the rtld build of strncmp
ended up with no definition of the strncmp symbol at all: The
implementations were renamed for use within an IFUNC resolver, but the
IFUNC resolver itself was missing (because rtld does not use IFUNCs).

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2018-06-12 15:00:33 +02:00
Leonardo Sandoval
1457016337 x86-64: Optimize strcmp/wcscmp and strncmp/wcsncmp with AVX2
Optimize x86-64 strcmp/wcscmp and strncmp/wcsncmp with AVX2. It uses vector
comparison as much as possible. Peak performance observed on a SkyLake
machine: 9x, 3x, 2.5x and 5.5x for strcmp, strncmp, wcscmp and wcsncmp,
respectively. The larger the comparison length, the more benefit using
avx2 functions, except on the strcmp, where peak is observed at length
== 32 bytes. Select AVX2 strcmp/wcscmp on AVX2 machines where vzeroupper
is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-avx2, strncmp-avx2, wcscmp-avx2, wcscmp-sse2, wcsncmp-avx2 and
	wcsncmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strcmp_avx2,
	__strncmp_avx2,	__wcscmp_avx2, __wcsncmp_avx2, __wcscmp_sse2
	and __wcsncmp_sse2.
	* sysdeps/x86_64/multiarch/strcmp.c (OPTIMIZE (avx2)):
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
	AVX unaligned load is fast and vzeroupper is preferred.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strncmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp-sse2.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp.c: Likewise.
	* sysdeps/x86_64/wcscmp.S (__wcscmp): Add alias only if __wcscmp
	is undefined.
2018-06-01 16:32:43 -05:00
H.J. Lu
727b38df05 x86-64: Skip zero length in __mem[pcpy|move|set]_erms
This patch skips zero length in __mempcpy_erms, __memmove_erms and
__memset_erms.

Tested on x86-64.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__mempcpy_erms): Skip zero length.
	(__memmove_erms): Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__memset_erms): Likewise.
2018-05-23 11:25:42 -07:00
Andreas Schwab
9aaaab7c6e Don't write beyond destination in __mempcpy_avx512_no_vzeroupper (bug 23196)
When compiled as mempcpy, the return value is the end of the destination
buffer, thus it cannot be used to refer to the start of it.
2018-05-23 09:50:57 +02:00
H.J. Lu
e28e9b1ec4 x86-64: Check Prefer_FSRM in ifunc-memmove.h
Although the REP MOVSB implementations of memmove, memcpy and mempcpy
aren't used by the current processors, this patch adds Prefer_FSRM
check in ifunc-memmove.h so that they can be used in the future.

	* sysdeps/x86/cpu-features.h (bit_arch_Prefer_FSRM): New.
	(index_arch_Prefer_FSRM): Likewise.
	* sysdeps/x86/cpu-tunables.c (TUNABLE_CALLBACK (set_hwcaps)):
	Also check Prefer_FSRM.
	* sysdeps/x86_64/multiarch/ifunc-memmove.h (IFUNC_SELECTOR):
	Also return OPTIMIZE (erms) for Prefer_FSRM.
2018-05-21 16:54:59 -07:00
Leonardo Sandoval
e4ebc1380d x86-64: remove duplicate line on PREFETCH_ONE_SET macro
Tested on 64-bit AVX machine

       * sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
       (PREFETCH_ONE_SET): Remove duplicate line
2018-05-17 09:09:13 -05:00
H.J. Lu
04958880e0 x86-64: Use IFUNC strncat inside libc.so
Unlike i386, we can call hidden IFUNC functions inside libc.so since
x86-64 PLT is always PIC.

Tested on x86-64.

	* sysdeps/x86_64/multiarch/strncat-c.c (STRNCAT_PRIMARY): Removed.
	Include <string/strncat.c>.
	* sysdeps/x86_64/multiarch/strncat.c (__strncat): New strong
	alias.
	(__GI___strncat): New hidden alias.
2018-05-16 09:04:35 -07:00
H.J. Lu
e322ec3282 x86-64: Remove the unnecessary testl in strlen-avx2.S
Since the result of testl is never used, this patch removes it.

Tested on 64-bit AVX2 machine.

	* sysdeps/x86_64/multiarch/strlen-avx2.S (STRLEN): Remove the
	unnecessary testl.
2018-05-14 03:41:35 -07:00
H.J. Lu
50d7d351b5 x86-64/memset: Mark the debugger symbol as hidden
When MEMSET_SYMBOL (__memset, erms) is provided for debugger, mark it
as hidden so that it will be local to the library.

	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(MEMSET_SYMBOL (__memset, erms)): Mark the debugger symbol as
	hidden.
2018-05-07 11:01:48 -07:00
Joseph Myers
688903eb3e Update copyright dates with scripts/update-copyrights.
* All files with FSF copyright notices: Update copyright dates
	using scripts/update-copyrights.
	* locale/programs/charmap-kw.h: Regenerated.
	* locale/programs/locfile-kw.h: Likewise.
2018-01-01 00:32:25 +00:00
Adhemerval Zanella
65687ac76c Remove remaining _HAVE_STRING_ARCH_* definitions (BZ #18858)
Since the removal of bits/string.h, _HAVE_STRING_ARCH_* are no
longer used.  This patch removes the unused macros from i686
and x86_64 sysdeps folder.

Checked on x86_64-linux-gnu and i686-linux-gnu.

	* sysdeps/i386/i686/multiarch/strncpy.c (_HAVE_STRING_ARCH_strncpy):
	Remove define.
	* sysdeps/x86_64/multiarch/stpcpy.c (_HAVE_STRING_ARCH_stpcpy):
	Likewise.
	* sysdeps/x86_64/multiarch/strcspn.c (_HAVE_STRING_ARCH_strcspn):
	Likewise.
	* sysdeps/x86_64/multiarch/strncat.c (_HAVE_STRING_ARCH_strncat):
	Likewise.
	* sysdeps/x86_64/multiarch/strncpy.c (_HAVE_STRING_ARCH_strncpy):
	Likewise.
	* sysdeps/x86_64/multiarch/strpbrk.c (_HAVE_STRING_ARCH_strpbrk):
	Likewise.
	* sysdeps/x86_64/multiarch/strspn.c (_HAVE_STRING_ARCH_strspn):
	Likewise.
2017-09-06 14:35:23 -03:00
H.J. Lu
7550717ed7 Mark internal SSE2 functions with attribute_hidden [BZ #18822]
Mark internal SSE2 functions with attribute_hidden to allow direct
access within libc.so and libc.a without using GOT nor PLT.

	[BZ #18822]
	* sysdeps/x86_64/multiarch/strcspn-c.c (STRCSPN_SSE2): Add
	attribute_hidden.
	(__strspn_sse2): Likewise.
2017-08-19 16:46:53 -07:00
H.J. Lu
fc11ff8d0a x86-64: Use IFUNC memcpy and mempcpy in libc.a
Since apply_irel is called before memcpy and mempcpy are called, we
can use IFUNC memcpy and mempcpy in libc.a.

	* sysdeps/x86_64/memmove.S (MEMCPY_SYMBOL): Don't check SHARED.
	(MEMPCPY_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test memcpy and mempcpy in libc.a.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Also include
	in libc.a.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memcpy.c: Also include in libc.a.
	(__hidden_ver1): Don't use in libc.a.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(__mempcpy): Don't create a weak alias in libc.a.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Support
	libc.a.
	* sysdeps/x86_64/multiarch/mempcpy.c: Also include in libc.a.
	(__hidden_ver1): Don't use in libc.a.
2017-08-04 12:27:18 -07:00
H.J. Lu
7a499756ab x86-64: Test memmove_chk and memset_chk only in libc.so [BZ #21741]
Since there are no multiarch versions of memmove_chk and memset_chk,
test multiarch versions of memmove_chk and memset_chk only in libc.so.

	[BZ #21741]
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test memmove_chk and memset_chk only
	in libc.so.
2017-07-10 04:44:38 -07:00
H.J. Lu
58d021c836 x86-64: Update comments in IFUNC selectors
* sysdeps/x86_64/multiarch/memcmp.c: Update comments.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
2017-07-09 11:43:20 -07:00
H.J. Lu
4df54c89bb x86-64: Update comments in ifunc-impl-list.c
All x86-64 IFUNC selectors are written in C now.  Update comments to
reflect it.

	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Update comments.
2017-07-09 11:38:37 -07:00
H.J. Lu
e94c310357 x86-64: Optimize memcmp-avx2-movbe.S for short difference
Check the first 32 bytes before checking size when size >= 32 bytes
to avoid unnecessary branch if the difference is in the first 32 bytes.
Replace vpmovmskb/subl/jnz with vptest/jnc.

On Haswell, the new version is as fast as the previous one.  On Skylake,
the new version is a little bit faster.

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check
	the first 32 bytes before checking size when size >= 32 bytes.
	Replace vpmovmskb/subl/jnz with vptest/jnc.
2017-06-27 07:55:00 -07:00
H.J. Lu
049816c3be x86-64: Optimize L(between_2_3) in memcmp-avx2-movbe.S
Turn

	movzbl	-1(%rdi, %rdx), %edi
	movzbl	-1(%rsi, %rdx), %esi
	orl	%edi, %eax
	orl	%esi, %ecx

into

	movb	-1(%rdi, %rdx), %al
	movb	-1(%rsi, %rdx), %cl

	* sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (between_2_3):
	Replace movzbl and orl with movb.
2017-06-23 12:46:12 -07:00
Florian Weimer
bc0382ae90 x86-64: Fix comment typo in memcmp-avx2-movbe.S 2017-06-23 19:00:58 +02:00
Florian Weimer
3ec7c02cc3 x86-64: memcmp-avx2-movbe.S needs saturating subtraction [BZ #21662]
This code:

L(between_2_3):
	/* Load as big endian with overlapping loads and bswap to avoid
	   branches.  */
	movzwl	-2(%rdi, %rdx), %eax
	movzwl	-2(%rsi, %rdx), %ecx
	shll	$16, %eax
	shll	$16, %ecx
	movzwl	(%rdi), %edi
	movzwl	(%rsi), %esi
	orl	%edi, %eax
	orl	%esi, %ecx
	bswap	%eax
	bswap	%ecx
	subl	%ecx, %eax
	ret

needs a saturating subtract because the full register is used.
With this commit, only the lower 24 bits of the register are used,
so a regular subtraction suffices.

The test case change adds coverage for these kinds of bugs.
2017-06-23 17:24:40 +02:00
H.J. Lu
11ffcacb64 x86-64: Implement strcmp family IFUNC selectors in C
Implement strcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2,
	strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx,
	strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx.
	* sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file.
	* sysdeps/x86_64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>.
	(STRCMP_SSE42): New.  Defined to __strcmp_sse42 if not defined.
	[USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include
	"locale-defines.h".
	(UPDATE_STRNCMP_COUNTER): New.
	(SECTION): Likewise.
	(GLABEL): Likewise.
	(LABEL): Likewise.
	* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable
	for libc.a.
2017-06-21 12:11:06 -07:00
Zack Weinberg
c0b23001a8 Fix fallout from bits/string.h removal.
Remove one more string inline that was defined directly in string.h;
in the absence of the rest of the inlines, it broke the build.

Like other ifunc shims for these functions,
x86_64/multiarch/{mem,st}pcpy.c need to define __NO_STRING_INLINES and
NO_MEMPCPY_STPCPY_REDIRECT.

	* string/string.h (__mempcpy_inline): Delete.
	* sysdeps/x86_64/multiarch/mempcpy.c
	* sysdeps/x86_64/multiarch/stpcpy.c:
	Define NO_MEMPCPY_STPCPY_REDIRECT and __NO_STRING_INLINES
	before including string.h.
2017-06-20 09:39:08 -04:00
Zack Weinberg
09a596cc2c Remove bits/string.h.
These machine-dependent inline string functions have never been on by
default, and even if they were a good idea at the time they were
introduced, they haven't really been touched in ten to fifteen years
and probably aren't a good idea on current-gen processors.  Current
thinking is that this class of optimization is best left to the
compiler.

	* bits/string.h, string/bits/string.h
	* sysdeps/aarch64/bits/string.h
	* sysdeps/m68k/m680x0/m68020/bits/string.h
	* sysdeps/s390/bits/string.h, sysdeps/sparc/bits/string.h
	* sysdeps/x86/bits/string.h: Delete file.

	* string/string.h: Don't include bits/string.h.
	* string/bits/string3.h: Rename to bits/string_fortified.h.
	No need to undef various symbols that the removed headers
	might have defined as macros.
	* string/Makefile (headers): Remove bits/string.h, change
	bits/string3.h to bits/string_fortified.h.
	* string/string-inlines.c: Update commentary.  Remove definitions
	of various macros that nothing looks at anymore.  Don't directly
	include bits/string.h. Set _STRING_INLINE_unaligned here, based on
	compiler-predefined macros.
	* string/strncat.c: If STRNCAT is not defined, or STRNCAT_PRIMARY
	_is_ defined, provide internal hidden alias __strncat.
	* include/string.h: Declare internal hidden alias __strncat.
	Only forward __stpcpy to __builtin_stpcpy if __NO_STRING_INLINES is
	not defined.
	* include/bits/string3.h: Rename to bits/string_fortified.h,
	update to match above.

	* sysdeps/i386/string-inlines.c: Define compat symbols for
	everything formerly defined by sysdeps/x86/bits/string.h.
	Make existing definitions into compat symbols as well.
	Remove some no-longer-necessary messing around with macros.

	* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c
	* sysdeps/s390/multiarch/mempcpy.c
	No need to define _HAVE_STRING_ARCH_mempcpy.
	Do define __NO_STRING_INLINES and NO_MEMPCPY_STPCPY_REDIRECT.

	* sysdeps/i386/i686/multiarch/strncat-c.c
	* sysdeps/s390/multiarch/strncat-c.c
	* sysdeps/x86_64/multiarch/strncat-c.c
	Define STRNCAT_PRIMARY.  Don't change definition of libc_hidden_def.
2017-06-20 08:21:24 -04:00
Siddhesh Poyarekar
629ebc873a Fix typo when undefining weak_alias
The macro directive #undef was miswritten as #undefine.

	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Fix typo.
2017-06-19 14:56:40 +05:30
H.J. Lu
70fe2eb794 x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in C
Implement strcspn/strpbrk/strspn IFUNC selectors in C

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcspn-sse2, strpbrk-sse2 and strspn-sse2.
	* sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed.
	Check USE_AS_STRPBRK instead of STRPBRK_P.
	* sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New.
	* sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file.
	* sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk.c: Likewise.
	* sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Removed.
	* sysdeps/x86_64/multiarch/strpbrk.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED"
	and "#endif".
2017-06-15 08:59:05 -07:00
H.J. Lu
9f4254b8bd x86-64: Implement wcscpy IFUNC selector in C
* sysdeps/x86_64/multiarch/wcscpy.S: Removed.
	* sysdeps/x86_64/multiarch/wcscpy.c: New file.
2017-06-15 08:57:52 -07:00
H.J. Lu
9ed0aa15d3 x86-64: Implement strcat family IFUNC selectors in C
Implement strcat family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcat family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-sse2.
	* sysdeps/x86_64/multiarch/strcat-sse2.S: New file.
	* sysdeps/x86_64/multiarch/strcat.c: Likewise.
	* sysdeps/x86_64/multiarch/strncat.c: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Removed.
	* sysdeps/x86_64/multiarch/strncat.S: Likewise.
2017-06-15 08:56:59 -07:00
H.J. Lu
b91a52d0d7 x86-64: Implement memcmp family IFUNC selectors in C
Implement memcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file.
	* sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Removed.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2017-06-15 08:49:57 -07:00
H.J. Lu
93e46f8773 x86-64: Implement memset family IFUNC selectors in C
Implement memset family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memset functions within libc.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Erich Elsen  <eriche@google.com>

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, and memset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add test for __memset_chk_erms.
	Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memset.h: New file.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Removed.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__memset_chk_erms): New function.
2017-06-15 08:33:35 -07:00
H.J. Lu
5c3e322d3b x86-64: Implement memmove family IFUNC selectors in C
Implement memmove family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memmove family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memcpy_chk-nonshared,
	mempcpy_chk-nonshared and memmove_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memmove_chk_erms,
	__memcpy_chk_erms and __mempcpy_chk_erms.  Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memmove.h: New file.
	* sysdeps/x86_64/multiarch/memcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Removed.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__mempcpy_chk_erms): New function.
	(__memmove_chk_erms): Likewise.
	(__memcpy_chk_erms): New alias.
2017-06-14 12:11:10 -07:00
H.J. Lu
5a103908c0 x86-64: Implement strcpy family IFUNC selectors in C
Implement strcpy family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcpy family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcpy-sse2 and stpcpy-sse2.
	* sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file.
	* sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.S: Removed.
	* sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New.
	(libc_hidden_def): Always defined as empty.
	* sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def):
	Always Defined as empty.
2017-06-12 09:06:09 -07:00
H.J. Lu
6b6710e55b x86-64: Correct comments in ifunc-impl-list.c
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Correct comments.
2017-06-09 05:53:45 -07:00
H.J. Lu
d2538b9156 x86-64: Optimize strrchr/wcsrchr with AVX2
Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector
instructions.  It is as fast as SSE2 version for small data sizes
and up to 1X faster for large data sizes on Haswell.  Select AVX2
version on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strrchr_avx2,
	__strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2.
	* sysdeps/x86_64/multiarch/strrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
2017-06-09 05:45:52 -07:00
H.J. Lu
5ac7aa1d7c x86-64: Optimize memrchr with AVX2
Optimize memrchr with AVX2 to search 32 bytes with a single vector
compare instruction.  It is as fast as SSE2 memrchr for small data
sizes and up to 1X faster for large data sizes on Haswell.  Select
AVX2 memrchr on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memrchr-sse2 and memrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memrchr_avx2 and
	__memrchr_sse2.
	* sysdeps/x86_64/multiarch/memrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memrchr.c: Likewise.
2017-06-09 05:44:41 -07:00
H.J. Lu
8fe57365bf x86-64: Optimize strchr/strchrnul/wcschr with AVX2
Optimize strchr/strchrnul/wcschr with AVX2 to search 32 bytes with vector
instructions.  It is as fast as SSE2 versions for size <= 16 bytes and up
to 1X faster for or size > 16 bytes on Haswell.  Select AVX2 version on
AVX2 machines where vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strchr-sse2, strchrnul-sse2, strchr-avx2, strchrnul-avx2,
	wcschr-sse2 and wcschr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strchr_avx2,
	__strchrnul_avx2, __strchrnul_sse2, __wcschr_avx2 and
	__wcschr_sse2.
	* sysdeps/x86_64/multiarch/strchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strchrnul.c: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcschr.c: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Removed.
2017-06-09 05:42:29 -07:00
H.J. Lu
dc485ceb2a x86-64: Optimize strlen/strnlen/wcslen/wcsnlen with AVX2
Optimize strlen/strnlen/wcslen/wcsnlen with AVX2 to check 32 bytes with
a single vector compare instruction.  It is as fast as SSE2 versions for
size <= 16 bytes and up to 1X faster for or size > 16 bytes on Haswell.
Select AVX2 version on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strlen-sse2, strnlen-sse2, strlen-avx2, strnlen-avx2,
	wcslen-sse2, wcslen-avx2 and wcsnlen-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strlen_avx2,
	__strlen_sse2, __strnlen_avx2, __strnlen_sse2, __wcslen_avx2,
	__wcslen_sse2 and __wcsnlen_avx2.
	* sysdeps/x86_64/multiarch/strlen-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strlen.c: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strnlen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcslen.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c (OPTIMIZE (avx2)): New.
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX2 machines where
	vzeroupper is preferred and AVX unaligned load is fast.
2017-06-09 05:18:18 -07:00
H.J. Lu
2f5d20ac99 x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2
SSE2 memchr is extended to support wmemchr.  AVX2 memchr/rawmemchr/wmemchr
are added to search 32 bytes with a single vector compare instruction.
AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr
for small sizes and up to 1.5X faster for larger sizes on Haswell and
Skylake.  Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/memchr.S (MEMCHR): New.  Depending on if
	USE_AS_WMEMCHR is defined.
	(PCMPEQ): Likewise.
	(memchr): Renamed to ...
	(MEMCHR): This.  Support wmemchr if USE_AS_WMEMCHR is defined.
	Replace pcmpeqb with PCMPEQ.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2,
	wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c.
	* sysdeps/x86_64/multiarch/ifunc-avx2.h: New file.
	* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2,
	__rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and
	__wmemchr_sse2.
2017-06-09 05:13:31 -07:00
H.J. Lu
5e1122827a x86-64: Rename wmemset.h to ifunc-wmemset.h
No code changes.

	* sysdeps/x86_64/multiarch/wmemset.c: Include ifunc-wmemset.h
	instead of wmemset.h.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemset.h: Renamed to ...
	* sysdeps/x86_64/multiarch/ifunc-wmemset.h: This.
2017-06-07 14:48:34 -07:00
H.J. Lu
2e87c7d158 x86-64: Fold ifunc-sse4_1.h into wcsnlen.c
Since ifunc-sse4_1.h is included only by wcsnlen.c, we can fold it
into wcsnlen.c.  No code changes in wcsnlen.o.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>

	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: Removed and folded
	into ...
	* sysdeps/x86_64/multiarch/wcsnlen.c: Here.  Don't include
	ifunc-sse4_1.h.
2017-06-07 09:04:40 -07:00
H.J. Lu
d4cc385c6e x86-64: Move wcsnlen.S to multiarch/wcsnlen-sse4_1.S
Since wcsnlen.S uses pminud which is the part of SSE4.1, move wcsnlen.S
to multiarch/wcsnlen-sse4_1.S.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wcsnlen-sse4_1 and wcsnlen-c.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __wcsnlen_sse4_1 and
	__wcsnlen_sse2.
	* sysdeps/x86_64/multiarch/ifunc-sse4_1.h: New file.
	* sysdeps/x86_64/multiarch/wcsnlen-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsnlen.c: Likewise.
	* sysdeps/x86_64/wcsnlen.S: Removed.
2017-06-06 06:12:32 -07:00
H.J. Lu
935971ba6b x86-64: Optimize memcmp/wmemcmp with AVX2 and MOVBE
Optimize x86-64 memcmp/wmemcmp with AVX2.  It uses vector compare as
much as possible.  It is as fast as SSE4 memcmp for size <= 16 bytes
and up to 2X faster for size > 16 bytes on Haswell and Skylake.  Select
AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and
AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

Key features:

1. For size from 2 to 7 bytes, load as big endian with movbe and bswap
   to avoid branches.
2. Use overlapping compare to avoid branch.
3. Use vector compare when size >= 4 bytes for memcmp or size >= 8
   bytes for wmemcmp.
4. If size is 8 * VEC_SIZE or less, unroll the loop.
5. Compare 4 * VEC_SIZE at a time with the aligned first memory area.
6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
8. Use 8 vector compares when size is 8 * VEC_SIZE or less.

	* sysdeps/x86/cpu-features.h (index_cpu_MOVBE): New.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-avx2 and wmemcmp-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memcmp_avx2 and __wmemcmp_avx2.
	* sysdeps/x86_64/multiarch/memcmp-avx2.S: New file.
	* sysdeps/x86_64/multiarch/wmemcmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Use __memcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Use __wmemcmp_avx2 on AVX
	2 machines if AVX unaligned load is fast and vzeroupper is
	preferred.
2017-06-05 12:52:55 -07:00
H.J. Lu
ef9c4cb6c7 x86-64: Optimize wmemset with SSE2/AVX2/AVX512
The difference between memset and wmemset is byte vs int.  Add stubs
to SSE2/AVX2/AVX512 memset for wmemset with updated constant and size:

SSE2 wmemset:
	shl    $0x2,%rdx
	movd   %esi,%xmm0
	mov    %rdi,%rax
	pshufd $0x0,%xmm0,%xmm0
	jmp	entry_from_wmemset

SSE2 memset:
	movd   %esi,%xmm0
	mov    %rdi,%rax
	punpcklbw %xmm0,%xmm0
	punpcklwd %xmm0,%xmm0
	pshufd $0x0,%xmm0,%xmm0
entry_from_wmemset:

Since the ERMS versions of wmemset requires "rep stosl" instead of
"rep stosb", only the vector store stubs of SSE2/AVX2/AVX512 wmemset
are added.  The SSE2 wmemset is about 3X faster and the AVX2 wmemset
is about 6X faster on Haswell.

	* include/wchar.h (__wmemset_chk): New.
	* sysdeps/x86_64/memset.S (VDUP_TO_VEC0_AND_SET_RETURN): Renamed
	to MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_CHK_SYMBOL): Likewise.
	(WMEMSET_SYMBOL): Likewise.
	(__wmemset): Add hidden definition.
	(wmemset): Add weak hidden definition.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	wmemset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add __wmemset_sse2_unaligned,
	__wmemset_avx2_unaligned, __wmemset_avx512_unaligned,
	__wmemset_chk_sse2_unaligned, __wmemset_chk_avx2_unaligned
	and __wmemset_chk_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
	(VDUP_TO_VEC0_AND_SET_RETURN): Renamed to ...
	(MEMSET_VDUP_TO_VEC0_AND_SET_RETURN): This.
	(WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN): New.
	(WMEMSET_SYMBOL): Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Updated.
	(WMEMSET_CHK_SYMBOL): New.
	(WMEMSET_CHK_SYMBOL (__wmemset_chk, unaligned)): Likewise.
	(WMEMSET_SYMBOL (__wmemset, unaligned)): Likewise.
	* sysdeps/x86_64/multiarch/memset.S (WMEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Also define __GI_wmemset and
	__GI___wmemset.
	(weak_alias): New.
	* sysdeps/x86_64/multiarch/wmemset.c: New file.
	* sysdeps/x86_64/multiarch/wmemset.h: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemset_chk.c: Likewise.
	* sysdeps/x86_64/wmemset.c: Likewise.
	* sysdeps/x86_64/wmemset_chk.c: Likewise.
2017-06-05 11:09:59 -07:00
H.J. Lu
a7fbedff76 Correct comments in x86_64/multiarch/memcmp.S
* sysdeps/x86_64/multiarch/memcmp.S (__GI_memcmp): Correct
	comments.
2017-05-18 14:02:02 -07:00
Zack Weinberg
7c3018f9e4 Suppress internal declarations for most of the testsuite.
This patch adds a new build module called 'testsuite'.
IS_IN (testsuite) implies _ISOMAC, as do IS_IN_build and __cplusplus
(which means several ad-hoc tests for __cplusplus can go away).
libc-symbols.h now suppresses almost all of *itself* when _ISOMAC is
defined; in particular, _ISOMAC mode does not get config.h
automatically anymore.

There are still quite a few tests that need to see internal gunk of
one variety or another.  For them, we now have 'tests-internal' and
'test-internal-extras'; files in this category will still be compiled
with MODULE_NAME=nonlib, and everything proceeds as it always has.
The bulk of this patch is moving tests from 'tests' to
'tests-internal'.  There is also 'tests-static-internal', which has
the same effect on files in 'tests-static', and 'modules-names-tests',
which has the *inverse* effect on files in 'modules-names' (it's
inverted because most of the things in modules-names are *not* tests).
For both of these, the file must appear in *both* the new variable and
the old one.

There is also now a special case for when libc-symbols.h is included
without MODULE_NAME being defined at all.  (This happens during the
creation of libc-modules.h, and also when preprocessing Versions
files.)  When this happens, IS_IN is set to be always false and
_ISOMAC is *not* defined, which was the status quo, but now it's
explicit.

The remaining changes to C source files in this patch seemed likely to
cause problems in the absence of the main change.  They should be
relatively self-explanatory.  In a few cases I duplicated a definition
from an internal header rather than move the test to tests-internal;
this was a judgement call each time and I'm happy to change those
however reviewers feel is more appropriate.

	* Makerules: New subdir configuration variables 'tests-internal'
	and 'test-internal-extras'.  Test files in these categories will
	still be compiled with MODULE_NAME=nonlib.  Test files in the
	existing categories (tests, xtests, test-srcs, test-extras) are
	now compiled with MODULE_NAME=testsuite.
	New subdir configuration variable 'modules-names-tests'.  Files
	which are in both 'modules-names' and 'modules-names-tests' will
	be compiled with MODULE_NAME=testsuite instead of
	MODULE_NAME=extramodules.
	(gen-as-const-headers): Move to tests-internal.
	(do-tests-clean, common-mostlyclean): Support tests-internal.
	* Makeconfig (built-modules): Add testsuite.
	* Makefile: Change libof-check-installed-headers-c and
	libof-check-installed-headers-cxx to 'testsuite'.
	* Rules: Likewise.  Support tests-internal.
	* benchtests/strcoll-inputs/filelist#en_US.UTF-8:
	Remove extra-modules.mk.

	* config.h.in: Don't check for __OPTIMIZE__ or __FAST_MATH__ here.
	* include/libc-symbols.h: Move definitions of _GNU_SOURCE,
	PASTE_NAME, PASTE_NAME1, IN_MODULE, IS_IN, and IS_IN_LIB to the
	very top of the file and rationalize their order.
	If MODULE_NAME is not defined at all, define IS_IN to always be
	false, and don't define _ISOMAC.
	If any of IS_IN (testsuite), IS_IN_build, or __cplusplus are
	true, define _ISOMAC and suppress everything else in this file,
	starting with the inclusion of config.h.
	Do check for inappropriate definitions of __OPTIMIZE__ and
	__FAST_MATH__ here, but only if _ISOMAC is not defined.
        Correct some out-of-date commentary.

	* include/math.h: If _ISOMAC is defined, undefine NO_LONG_DOUBLE
	and _Mlong_double_ before including math.h.
	* include/string.h: If _ISOMAC is defined, don't expose
	_STRING_ARCH_unaligned. Move a comment to a more appropriate
	location.

	* include/errno.h, include/stdio.h, include/stdlib.h, include/string.h
	* include/time.h, include/unistd.h, include/wchar.h: No need to
	check __cplusplus nor use __BEGIN_DECLS/__END_DECLS.

	* misc/sys/cdefs.h (__NTHNL): New macro.
	* sysdeps/m68k/m680x0/fpu/bits/mathinline.h
	(__m81_defun): Use __NTHNL to avoid errors with GCC 6.

	* elf/tst-env-setuid-tunables.c: Include config.h with _LIBC
	defined, for HAVE_TUNABLES.
	* inet/tst-checks-posix.c: No need to define _ISOMAC.
	* intl/tst-gettext2.c: Provide own definition of N_.
	* math/test-signgam-finite-c99.c: No need to define _ISOMAC.
	* math/test-signgam-main.c: No need to define _ISOMAC.
	* stdlib/tst-strtod.c: Convert to test-driver. Split locale_test to...
	* stdlib/tst-strtod1i.c: ...this new file.
	* stdlib/tst-strtod5.c: Convert to test-driver and add copyright notice.
        Split tests of __strtod_internal to...
	* stdlib/tst-strtod5i.c: ...this new file.
	* string/test-string.h: Include stdint.h. Duplicate definition of
	inhibit_loop_to_libcall here (from libc-symbols.h).
	* string/test-strstr.c: Provide dummy definition of
	libc_hidden_builtin_def when including strstr.c.
	* sysdeps/ia64/fpu/libm-symbols.h: Suppress entire file in _ISOMAC
	mode; no need to test __STRICT_ANSI__ nor __cplusplus as well.
	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.

	* elf/Makefile: Move tst-ptrguard1-static, tst-stackguard1-static,
	tst-tls1-static, tst-tls2-static, tst-tls3-static, loadtest,
	unload, unload2, circleload1, neededtest, neededtest2,
	neededtest3, neededtest4, tst-tls1, tst-tls2, tst-tls3,
	tst-tls6, tst-tls7, tst-tls8, tst-dlmopen2, tst-ptrguard1,
	tst-stackguard1, tst-_dl_addr_inside_object, and all of the
	ifunc tests to tests-internal.
	Don't add $(modules-names) to test-extras.
	* inet/Makefile: Move tst-inet6_scopeid_pton to tests-internal.
	Add tst-deadline to tests-static-internal.
	* malloc/Makefile: Move tst-mallocstate and tst-scratch_buffer to
	tests-internal.
	* misc/Makefile: Move tst-atomic and tst-atomic-long to tests-internal.
	* nptl/Makefile: Move tst-typesizes, tst-rwlock19, tst-sem11,
	tst-sem12, tst-sem13, tst-barrier5, tst-signal7, tst-tls3,
	tst-tls3-malloc, tst-tls5, tst-stackguard1, tst-sem11-static,
	tst-sem12-static, and tst-stackguard1-static to tests-internal.
        Link tests-internal with libpthread also.
	Don't add $(modules-names) to test-extras.
	* nss/Makefile: Move tst-field to tests-internal.
	* posix/Makefile: Move bug-regex5, bug-regex20, bug-regex33,
	tst-rfc3484, tst-rfc3484-2, and tst-rfc3484-3 to tests-internal.
	* stdlib/Makefile: Move tst-strtod1i, tst-strtod3, tst-strtod4,
	tst-strtod5i, tst-tls-atexit, and tst-tls-atexit-nodelete to
	tests-internal.
        * sunrpc/Makefile: Move tst-svc_register to tests-internal.
	* sysdeps/powerpc/Makefile: Move test-get_hwcap and
	test-get_hwcap-static to tests-internal.
	* sysdeps/unix/sysv/linux/Makefile: Move tst-setgetname to
	tests-internal.
	* sysdeps/x86_64/fpu/Makefile: Add all libmvec test modules to
	modules-names-tests.
2017-05-11 19:27:59 -04:00
H.J. Lu
4cb334c4d6 x86: Use AVX2 memcpy/memset on Skylake server [BZ #21396]
On Skylake server, AVX512 load/store instructions in memcpy/memset may
lead to lower CPU turbo frequency in certain situations.  Use of AVX2
in memcpy/memset has been observed to have improved overall performance
in many workloads due to the higher frequency.

Since AVX512ER is unique to Xeon Phi, this patch sets Prefer_No_AVX512
if AVX512ER isn't available so that AVX2 versions of memcpy/memset are
used on Skylake server.

	[BZ #21396]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Prefer_No_AVX512 if AVX512ER isn't available.
	* sysdeps/x86/cpu-features.h (bit_arch_Prefer_No_AVX512): New.
	(index_arch_Prefer_No_AVX512): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Don't use
	AVX512 version if Prefer_No_AVX512 is set.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S (__memmove_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk):
	Likewise.
	* sysdeps/x86_64/multiarch/memset.S (memset): Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk):
	Likewise.
2017-04-18 14:01:45 -07:00
Joseph Myers
f7a51347a4 Revert header inclusion changes that break math/ testing on x86_64.
Revert:
	2017-02-16  Zack Weinberg  <zackw@panix.com>

	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.
2017-02-17 17:08:17 +00:00
Zack Weinberg
ceaa98897c Add missing header files throughout the testsuite.
* crypt/md5.h: Test _LIBC with #if defined, not #if.
	* dirent/opendir-tst1.c: Include sys/stat.h.
	* dirent/tst-fdopendir.c: Include sys/stat.h.
	* dirent/tst-fdopendir2.c: Include stdlib.h.
	* dirent/tst-scandir.c: Include stdbool.h.
	* elf/tst-auditmod1.c: Include link.h and stddef.h.
	* elf/tst-tls15.c: Include stdlib.h.
	* elf/tst-tls16.c: Include stdlib.h.
	* elf/tst-tls17.c: Include stdlib.h.
	* elf/tst-tls18.c: Include stdlib.h.
	* iconv/tst-iconv6.c: Include endian.h.
	* iconvdata/bug-iconv11.c: Include limits.h.
	* io/test-utime.c: Include stdint.h.
	* io/tst-faccessat.c: Include sys/stat.h.
	* io/tst-fchmodat.c: Include sys/stat.h.
	* io/tst-fchownat.c: Include sys/stat.h.
	* io/tst-fstatat.c: Include sys/stat.h.
	* io/tst-futimesat.c: Include sys/stat.h.
	* io/tst-linkat.c: Include sys/stat.h.
	* io/tst-mkdirat.c: Include sys/stat.h and stdbool.h.
	* io/tst-mkfifoat.c: Include sys/stat.h and stdbool.h.
	* io/tst-mknodat.c: Include sys/stat.h and stdbool.h.
	* io/tst-openat.c: Include stdbool.h.
	* io/tst-readlinkat.c: Include sys/stat.h.
	* io/tst-renameat.c: Include sys/stat.h.
	* io/tst-symlinkat.c: Include sys/stat.h.
	* io/tst-unlinkat.c: Include stdbool.h.
	* libio/bug-memstream1.c: Include stdlib.h.
	* libio/bug-wmemstream1.c: Include stdlib.h.
	* libio/tst-fwrite-error.c: Include stdlib.h.
	* libio/tst-memstream1.c: Include stdlib.h.
	* libio/tst-memstream2.c: Include stdlib.h.
	* libio/tst-memstream3.c: Include stdlib.h.
	* malloc/tst-interpose-aux.c: Include stdint.h.
	* misc/tst-preadvwritev-common.c: Include sys/stat.h.
	* nptl/tst-basic7.c: Include limits.h.
	* nptl/tst-cancel25.c: Include pthread.h, not pthreadP.h.
	* nptl/tst-cancel4.c: Include stddef.h, limits.h, and sys/stat.h.
	* nptl/tst-cancel4_1.c: Include stddef.h.
	* nptl/tst-cancel4_2.c: Include stddef.h.
	* nptl/tst-cond16.c: Include limits.h.
	Use sysconf(_SC_PAGESIZE) instead of __getpagesize.
	* nptl/tst-cond18.c: Include limits.h.
	Use sysconf(_SC_PAGESIZE) instead of __getpagesize.
	* nptl/tst-cond4.c: Include stdint.h.
	* nptl/tst-cond6.c: Include stdint.h.
	* nptl/tst-stack2.c: Include limits.h.
	* nptl/tst-stackguard1.c: Include stddef.h.
	* nptl/tst-tls4.c: Include stdint.h. Don't include tls.h.
	* nptl/tst-tls4moda.c: Include stddef.h.
	Don't include stdio.h, unistd.h, or tls.h.
	* nptl/tst-tls4modb.c: Include stddef.h.
	Don't include stdio.h, unistd.h, or tls.h.
	* nptl/tst-tls5.h: Include stddef.h. Don't include stdlib.h or tls.h.
	* posix/tst-getaddrinfo2.c: Include stdio.h.
	* posix/tst-getaddrinfo5.c: Include stdio.h.
	* posix/tst-pathconf.c: Include sys/stat.h.
	* posix/tst-posix_fadvise-common.c: Include stdint.h.
	* posix/tst-preadwrite-common.c: Include sys/stat.h.
	* posix/tst-regex.c: Include stdint.h.
	Don't include spawn.h or spawn_int.h.
	* posix/tst-regexloc.c: Don't include spawn.h or spawn_int.h.
	* posix/tst-vfork3.c: Include sys/stat.h.
	* resolv/tst-bug18665-tcp.c: Include stdlib.h.
	* resolv/tst-res_hconf_reorder.c: Include stdlib.h.
	* resolv/tst-resolv-search.c: Include stdlib.h.
	* stdio-common/tst-fmemopen2.c: Include stdint.h.
	* stdio-common/tst-vfprintf-width-prec.c: Include stdlib.h.
	* stdlib/test-canon.c: Include sys/stat.h.
	* stdlib/tst-tls-atexit.c: Include stdbool.h.
	* string/test-memchr.c: Include stdint.h.
	* string/tst-cmp.c: Include stdint.h.
	* sysdeps/pthread/tst-timer.c: Include stdint.h.
	* sysdeps/unix/sysv/linux/tst-sync_file_range.c: Include stdint.h.
	* sysdeps/wordsize-64/tst-writev.c: Include limits.h and stdint.h.
	* sysdeps/x86_64/fpu/math-tests-arch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/multiarch/test-multiarch.h: Include cpu-features.h.
	Don't include init-arch.h.
	* sysdeps/x86_64/tst-auditmod10b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod3b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod4b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod5b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod6b.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod6c.c: Include link.h and stddef.h.
	* sysdeps/x86_64/tst-auditmod7b.c: Include link.h and stddef.h.
	* time/clocktest.c: Include stdint.h.
	* time/tst-posixtz.c: Include stdint.h.
	* timezone/tst-timezone.c: Include stdint.h.
2017-02-16 17:33:18 -05:00
H.J. Lu
02b78ff749 Add VZEROUPPER to memset-vec-unaligned-erms.S [BZ #21081]
Since memset-vec-unaligned-erms.S has VDUP_TO_VEC0_AND_SET_RETURN at
function entry, memset optimized for AVX2 and AVX512 will always use
ymm/zmm register. VZEROUPPER should be placed before ret in

L(stosb):
        movq    %rdx, %rcx
        movzbl  %sil, %eax
        movq    %rdi, %rdx
        rep stosb
        movq    %rdx, %rax
        ret

since it can be reached from

L(stosb_more_2x_vec):
        cmpq    $REP_STOSB_THRESHOLD, %rdx
        ja      L(stosb)

	[BZ #21081]
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(L(stosb)): Add VZEROUPPER before ret.
2017-01-30 10:59:31 -08:00
Adhemerval Zanella
8dad72997a Fix x86 strncat optimized implementation for large sizes
Similar to BZ#19387, BZ#21014, and BZ#20971, both x86 sse2 strncat
optimized assembly implementations do not handle the size overflow
correctly.

The x86_64 one is in fact an issue with strcpy-sse2-unaligned, but
that is triggered also with strncat optimized implementation.

This patch uses a similar strategy used on 3daef2c8ee, where
saturared math is used for overflow case.

Checked on x86_64-linux-gnu and i686-linux-gnu.  It fixes BZ #19390.

	[BZ #19390]
	* string/test-strncat.c (test_main): Add tests with SIZE_MAX as
	maximum string size.
	* sysdeps/i386/i686/multiarch/strcat-sse2.S (STRCAT): Avoid overflow
	in pointer addition.
	* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S (STRCPY):
	Likewise.
2017-01-03 14:24:53 -02:00
Joseph Myers
bfff8b1bec Update copyright dates with scripts/update-copyrights. 2017-01-01 00:14:16 +00:00
H.J. Lu
f43cb35c9b Require binutils 2.24 to build x86-64 glibc [BZ #20139]
If assembler doesn't support AVX512DQ, _dl_runtime_resolve_avx is used
to save the first 8 vector registers, which only saves the lower 256
bits of vector register, for lazy binding.  When it is called on AVX512
platform, the upper 256 bits of ZMM registers are clobbered.  Parameters
passed in ZMM registers will be wrong when the function is called the
first time.  This patch requires binutils 2.24, whose assembler can store
and load ZMM registers, to build x86-64 glibc.  Since mathvec library
needs assembler support for AVX512DQ,  we disable mathvec if assembler
doesn't support AVX512DQ.

	[BZ #20139]
	* config.h.in (HAVE_AVX512_ASM_SUPPORT): Renamed to ...
	(HAVE_AVX512DQ_ASM_SUPPORT): This.
	* sysdeps/x86_64/configure.ac: Require assembler from binutils
	2.24 or above.
	(HAVE_AVX512_ASM_SUPPORT): Removed.
	(HAVE_AVX512DQ_ASM_SUPPORT): New.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/dl-trampoline.S: Make HAVE_AVX512_ASM_SUPPORT
	check unconditional.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_cos8_core_avx512.S: Check
	HAVE_AVX512DQ_ASM_SUPPORT instead of HAVE_AVX512_ASM_SUPPORT.
	* sysdeps/x86_64/fpu/multiarch/svml_d_exp8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_log8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_pow8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sin8_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_cosf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_expf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_logf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_powf16_core_avx512.S:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sincosf16_core_avx51:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/svml_s_sinf16_core_avx512.S:
	Likewise.
2016-07-01 06:03:05 -07:00
H.J. Lu
13efa86ece Check Prefer_ERMS in memmove/memcpy/mempcpy/memset
Although the Enhanced REP MOVSB/STOSB (ERMS) implementations of memmove,
memcpy, mempcpy and memset aren't used by the current processors, this
patch adds Prefer_ERMS check in memmove, memcpy, mempcpy and memset so
that they can be used in the future.

	* sysdeps/x86/cpu-features.h (bit_arch_Prefer_ERMS): New.
	(index_arch_Prefer_ERMS): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Return
	__memcpy_erms for Prefer_ERMS.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__memmove_erms): Enabled for libc.a.
	* ysdeps/x86_64/multiarch/memmove.S (__libc_memmove): Return
	__memmove_erms or Prefer_ERMS.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Return
	__mempcpy_erms for Prefer_ERMS.
	* sysdeps/x86_64/multiarch/memset.S (memset): Return
	__memset_erms for Prefer_ERMS.
2016-06-30 07:58:11 -07:00
H.J. Lu
c867597bff X86-64: Remove previous default/SSE2/AVX2 memcpy/memmove
Since the new SSE2/AVX2 memcpy/memmove are faster than the previous ones,
we can remove the previous SSE2/AVX2 memcpy/memmove and replace them with
the new ones.

No change in IFUNC selection if SSE2 and AVX2 memcpy/memmove weren't used
before.  If SSE2 or AVX2 memcpy/memmove were used, the new SSE2 or AVX2
memcpy/memmove optimized with Enhanced REP MOVSB will be used for
processors with ERMS.  The new AVX512 memcpy/memmove will be used for
processors with AVX512 which prefer vzeroupper.

Since the new SSE2 memcpy/memmove are faster than the previous default
memcpy/memmove used in libc.a and ld.so, we also remove the previous
default memcpy/memmove and make them the default memcpy/memmove, except
that non-temporal store isn't used in ld.so.

Together, it reduces the size of libc.so by about 6 KB and the size of
ld.so by about 2 KB.

	[BZ #19776]
	* sysdeps/x86_64/memcpy.S: Make it dummy.
	* sysdeps/x86_64/mempcpy.S: Likewise.
	* sysdeps/x86_64/memmove.S: New file.
	* sysdeps/x86_64/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/memmove.c: Removed.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-sse2-unaligned, memmove-avx-unaligned,
	memcpy-avx-unaligned and memmove-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Replace
	__memmove_chk_avx512_unaligned_2 with
	__memmove_chk_avx512_unaligned.  Remove
	__memmove_chk_avx_unaligned_2.  Replace
	__memmove_chk_sse2_unaligned_2 with
	__memmove_chk_sse2_unaligned.  Remove __memmove_chk_sse2 and
	__memmove_avx_unaligned_2.  Replace __memmove_avx512_unaligned_2
	with __memmove_avx512_unaligned.  Replace
	__memmove_sse2_unaligned_2 with __memmove_sse2_unaligned.
	Remove __memmove_sse2.  Replace __memcpy_chk_avx512_unaligned_2
	with __memcpy_chk_avx512_unaligned.  Remove
	__memcpy_chk_avx_unaligned_2.  Replace
	__memcpy_chk_sse2_unaligned_2 with __memcpy_chk_sse2_unaligned.
	Remove __memcpy_chk_sse2.  Remove __memcpy_avx_unaligned_2.
	Replace __memcpy_avx512_unaligned_2 with
	__memcpy_avx512_unaligned.  Remove __memcpy_sse2_unaligned_2
	and __memcpy_sse2.  Replace __mempcpy_chk_avx512_unaligned_2
	with __mempcpy_chk_avx512_unaligned.  Remove
	__mempcpy_chk_avx_unaligned_2.  Replace
	__mempcpy_chk_sse2_unaligned_2 with
	__mempcpy_chk_sse2_unaligned.  Remove __mempcpy_chk_sse2.
	Replace __mempcpy_avx512_unaligned_2 with
	__mempcpy_avx512_unaligned.  Remove __mempcpy_avx_unaligned_2.
	Replace __mempcpy_sse2_unaligned_2 with
	__mempcpy_sse2_unaligned.  Remove __mempcpy_sse2.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Support
	__memcpy_avx512_unaligned_erms and __memcpy_avx512_unaligned.
	Use __memcpy_avx_unaligned_erms and __memcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __memcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../memcpy.S.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Support
	__memcpy_chk_avx512_unaligned_erms and
	__memcpy_chk_avx512_unaligned.  Use
	__memcpy_chk_avx_unaligned_erms and
	__memcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __memcpy_chk_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	Change function suffix from unaligned_2 to unaligned.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Support
	__mempcpy_avx512_unaligned_erms and __mempcpy_avx512_unaligned.
	Use __mempcpy_avx_unaligned_erms and __mempcpy_sse2_unaligned_erms
	if processor has ERMS.  Default to __mempcpy_sse2_unaligned.
	(ENTRY): Removed.
	(END): Likewise.
	(ENTRY_CHK): Likewise.
	(libc_hidden_builtin_def): Likewise.
	Don't include ../mempcpy.S.
	(mempcpy): New.  Add a weak alias.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Support
	__mempcpy_chk_avx512_unaligned_erms and
	__mempcpy_chk_avx512_unaligned.  Use
	__mempcpy_chk_avx_unaligned_erms and
	__mempcpy_chk_sse2_unaligned_erms if if processor has ERMS.
	Default to __mempcpy_chk_sse2_unaligned.
2016-06-08 13:58:08 -07:00
H.J. Lu
5e8c5bb1ac X86-64: Remove the previous SSE2/AVX2 memsets
Since the new SSE2/AVX2 memsets are faster than the previous ones, we
can remove the previous SSE2/AVX2 memsets and replace them with the
new ones.  This reduces the size of libc.so by about 900 bytes.

No change in IFUNC selection if SSE2 and AVX2 memsets weren't used
before.  If SSE2 or AVX2 memset was used, the new SSE2 or AVX2 memset
optimized with Enhanced REP STOSB will be used for processors with
ERMS.  The new AVX512 memset will be used for processors with AVX512
which prefer vzeroupper.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Folded
	into ...
	* sysdeps/x86_64/memset.S: This.
	(__bzero): Removed.
	(__memset_tail): Likewise.
	(__memset_chk): Likewise.
	(memset): Likewise.
	(MEMSET_CHK_SYMBOL): New. Define only if MEMSET_SYMBOL isn't
	defined.
	(MEMSET_SYMBOL): Define only if MEMSET_SYMBOL isn't defined.
	* sysdeps/x86_64/multiarch/memset-avx2.S: Removed.
	(__memset_zero_constant_len_parameter): Check SHARED instead of
	PIC.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memset-avx2 and memset-sse2-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Remove __memset_chk_sse2,
	__memset_chk_avx2, __memset_sse2 and __memset_avx2_unaligned.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__bzero): Enabled.
	* sysdeps/x86_64/multiarch/memset.S (memset): Replace
	__memset_sse2 and __memset_avx2 with __memset_sse2_unaligned
	and __memset_avx2_unaligned.  Use __memset_sse2_unaligned_erms
	or __memset_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	(memset): Removed.
	(__memset_chk): Likewise.
	(MEMSET_SYMBOL): New.
	(libc_hidden_builtin_def): Replace __memset_sse2 with
	__memset_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memset_chk.S (__memset_chk): Replace
	__memset_chk_sse2 and __memset_chk_avx2 with
	__memset_chk_sse2_unaligned and __memset_chk_avx2_unaligned_erms.
	Use __memset_chk_sse2_unaligned_erms or
	__memset_chk_avx2_unaligned_erms if processor has ERMS.  Support
	__memset_chk_avx512_unaligned_erms and
	__memset_chk_avx512_unaligned.
2016-06-08 13:56:14 -07:00
H.J. Lu
3f61232ab3 Fix a typo in comments in memmove-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Fix
	a typo in comments.
2016-06-06 16:03:21 -07:00
H.J. Lu
eb2c88c7c8 Remove alignments on jump targets in memset
X86-64 memset-vec-unaligned-erms.S aligns many jump targets, which
increases code sizes, but not necessarily improve performance.  As
memset benchtest data of align vs no align on various Intel and AMD
processors

https://sourceware.org/bugzilla/attachment.cgi?id=9277

shows that aligning jump targets isn't necessary.

	[BZ #20115]
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S (__memset):
	Remove alignments on jump targets.
2016-05-19 08:49:55 -07:00
H.J. Lu
2a1f15b1a9 Remove x86 ifunc-defines.sym and rtld-global-offsets.sym
Merge x86 ifunc-defines.sym with x86 cpu-features-offsets.sym.  Remove
x86 ifunc-defines.sym and rtld-global-offsets.sym.  No code changes on
i686 and x86-64.

	* sysdeps/i386/i686/multiarch/Makefile (gen-as-const-headers):
	Remove ifunc-defines.sym.
	* sysdeps/x86_64/multiarch/Makefile (gen-as-const-headers):
	Likewise.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym: Removed.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Likewise.
	* sysdeps/x86/Makefile (gen-as-const-headers): Remove
	rtld-global-offsets.sym.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym: Merged with ...
	* sysdeps/x86/cpu-features-offsets.sym: This.
	* sysdeps/x86/cpu-features.h: Include <cpu-features-offsets.h>
	instead of <ifunc-defines.h> and <rtld-global-offsets.h>.
2016-05-11 05:51:39 -07:00
H.J. Lu
a057f5f8cd X86-64: Use non-temporal store in memcpy on large data
The large memcpy micro benchmark in glibc shows that there is a
regression with large data on Haswell machine.  non-temporal store in
memcpy on large data can improve performance significantly.  This
patch adds a threshold to use non temporal store which is 6 times of
shared cache size.  When size is above the threshold, non temporal
store will be used, but avoid non-temporal store if there is overlap
between destination and source since destination may be in cache when
source is loaded.

For size below 8 vector register width, we load all data into registers
and store them together.  Only forward and backward loops, which move 4
vector registers at a time, are used to support overlapping addresses.
For forward loop, we load the last 4 vector register width of data and
the first vector register width of data into vector registers before the
loop and store them after the loop.  For backward loop, we load the first
4 vector register width of data and the last vector register width of
data into vector registers before the loop and store them after the loop.

	[BZ #19928]
	* sysdeps/x86_64/cacheinfo.c (__x86_shared_non_temporal_threshold):
	New.
	(init_cacheinfo): Set __x86_shared_non_temporal_threshold to 6
	times of shared cache size.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S
	(VMOVNT): New.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
	(VMOVNT): Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S
	(VMOVNT): Likewise.
	(VMOVU): Changed to movups for smaller code sizes.
	(VMOVA): Changed to movaps for smaller code sizes.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: Update
	comments.
	(PREFETCH): New.
	(PREFETCH_SIZE): Likewise.
	(PREFETCHED_LOAD_SIZE): Likewise.
	(PREFETCH_ONE_SET): Likewise.
	Rewrite to use forward and backward loops, which move 4 vector
	registers at a time, to support overlapping addresses and use
	non temporal store if size is above the threshold and there is
	no overlap between destination and source.
2016-04-12 08:10:47 -07:00
H.J. Lu
a7d1c51482 X86-64: Prepare memmove-vec-unaligned-erms.S
Prepare memmove-vec-unaligned-erms.S to make the SSE2 version as the
default memcpy, mempcpy and memmove.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(MEMCPY_SYMBOL): New.
	(MEMPCPY_SYMBOL): Likewise.
	(MEMMOVE_CHK_SYMBOL): Likewise.
	Replace MEMMOVE_SYMBOL with MEMMOVE_CHK_SYMBOL on __mempcpy_chk
	symbols.  Replace MEMMOVE_SYMBOL with MEMPCPY_SYMBOL on
	__mempcpy symbols.  Provide alias for __memcpy_chk in libc.a.
	Provide alias for memcpy in libc.a and ld.so.
2016-04-06 10:19:16 -07:00
H.J. Lu
4af1bb06c5 X86-64: Prepare memset-vec-unaligned-erms.S
Prepare memset-vec-unaligned-erms.S to make the SSE2 version as the
default memset.

	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(MEMSET_CHK_SYMBOL): New.  Define if not defined.
	(__bzero): Check VEC_SIZE == 16 instead of USE_MULTIARCH.
	Disabled fro now.
	Replace MEMSET_SYMBOL with MEMSET_CHK_SYMBOL on __memset_chk
	symbols.  Properly check USE_MULTIARCH on __memset symbols.
2016-04-06 09:10:35 -07:00
H.J. Lu
ec0cac9a1f Force 32-bit displacement in memset-vec-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: Force
	32-bit displacement to avoid long nop between instructions.
2016-04-05 05:21:19 -07:00
H.J. Lu
696ac77484 Add a comment in memset-sse2-unaligned-erms.S
* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Add
	a comment on VMOVU and VMOVA.
2016-04-05 05:19:18 -07:00
H.J. Lu
5cd7af016d Don't put SSE2/AVX/AVX512 memmove/memset in ld.so
Since memmove and memset in ld.so don't use IFUNC, don't put SSE2, AVX
and AVX512 memmove and memset in ld.so.

	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: Skip
	if not in libc.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
2016-04-03 14:35:38 -07:00
H.J. Lu
ea2785e96f Fix memmove-vec-unaligned-erms.S
__mempcpy_erms and __memmove_erms can't be placed between __memmove_chk
and __memmove it breaks __memmove_chk.

Don't check source == destination first since it is less common.

	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	(__mempcpy_erms, __memmove_erms): Moved before __mempcpy_chk
	with unaligned_erms.
	(__memmove_erms): Skip if source == destination.
	(__memmove_unaligned_erms): Don't check source == destination
	first.
2016-04-03 12:38:25 -07:00
H.J. Lu
830566307f Add x86-64 memset with unaligned store and rep stosb
Implement x86-64 memset with unaligned store and rep movsb.  Support
16-byte, 32-byte and 64-byte vector register sizes.  A single file
provides 2 implementations of memset, one with rep stosb and the other
without rep stosb.  They share the same codes when size is between 2
times of vector register size and REP_STOSB_THRESHOLD which defaults
to 2KB.

Key features:

1. Use overlapping store to avoid branch.
2. For size <= 4 times of vector register size, fully unroll the loop.
3. For size > 4 times of vector register size, store 4 times of vector
register size at a time.

	[BZ #19881]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, memset-avx2-unaligned-erms and
	memset-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memset_chk_sse2_unaligned,
	__memset_chk_sse2_unaligned_erms, __memset_chk_avx2_unaligned,
	__memset_chk_avx2_unaligned_erms, __memset_chk_avx512_unaligned,
	__memset_chk_avx512_unaligned_erms, __memset_sse2_unaligned,
	__memset_sse2_unaligned_erms, __memset_erms,
	__memset_avx2_unaligned, __memset_avx2_unaligned_erms,
	__memset_avx512_unaligned_erms and __memset_avx512_unaligned.
	* sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S:
	Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:
	Likewise.
2016-03-31 10:06:07 -07:00
H.J. Lu
88b57b8ed4 Add x86-64 memmove with unaligned load/store and rep movsb
Implement x86-64 memmove with unaligned load/store and rep movsb.
Support 16-byte, 32-byte and 64-byte vector register sizes.  When
size <= 8 times of vector register size, there is no check for
address overlap bewteen source and destination.  Since overhead for
overlap check is small when size > 8 times of vector register size,
memcpy is an alias of memmove.

A single file provides 2 implementations of memmove, one with rep movsb
and the other without rep movsb.  They share the same codes when size is
between 2 times of vector register size and REP_MOVSB_THRESHOLD which
is 2KB for 16-byte vector register size and scaled up by large vector
register size.

Key features:

1. Use overlapping load and store to avoid branch.
2. For size <= 8 times of vector register size, load  all sources into
registers and store them together.
3. If there is no address overlap bewteen source and destination, copy
from both ends with 4 times of vector register size at a time.
4. If address of destination > address of source, backward copy 8 times
of vector register size at a time.
5. Otherwise, forward copy 8 times of vector register size at a time.
6. Use rep movsb only for forward copy.  Avoid slow backward rep movsb
by fallbacking to backward copy 8 times of vector register size at a
time.
7. Skip when address of destination == address of source.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memmove-avx-unaligned-erms and
	memmove-avx512-unaligned-erms.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test
	__memmove_chk_avx512_unaligned_2,
	__memmove_chk_avx512_unaligned_erms,
	__memmove_chk_avx_unaligned_2, __memmove_chk_avx_unaligned_erms,
	__memmove_chk_sse2_unaligned_2,
	__memmove_chk_sse2_unaligned_erms, __memmove_avx_unaligned_2,
	__memmove_avx_unaligned_erms, __memmove_avx512_unaligned_2,
	__memmove_avx512_unaligned_erms, __memmove_erms,
	__memmove_sse2_unaligned_2, __memmove_sse2_unaligned_erms,
	__memcpy_chk_avx512_unaligned_2,
	__memcpy_chk_avx512_unaligned_erms,
	__memcpy_chk_avx_unaligned_2, __memcpy_chk_avx_unaligned_erms,
	__memcpy_chk_sse2_unaligned_2, __memcpy_chk_sse2_unaligned_erms,
	__memcpy_avx_unaligned_2, __memcpy_avx_unaligned_erms,
	__memcpy_avx512_unaligned_2, __memcpy_avx512_unaligned_erms,
	__memcpy_sse2_unaligned_2, __memcpy_sse2_unaligned_erms,
	__memcpy_erms, __mempcpy_chk_avx512_unaligned_2,
	__mempcpy_chk_avx512_unaligned_erms,
	__mempcpy_chk_avx_unaligned_2, __mempcpy_chk_avx_unaligned_erms,
	__mempcpy_chk_sse2_unaligned_2, __mempcpy_chk_sse2_unaligned_erms,
	__mempcpy_avx512_unaligned_2, __mempcpy_avx512_unaligned_erms,
	__mempcpy_avx_unaligned_2, __mempcpy_avx_unaligned_erms,
	__mempcpy_sse2_unaligned_2, __mempcpy_sse2_unaligned_erms and
	__mempcpy_erms.
	* sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S: New
	file.
	* sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S:
	Likwise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:
	Likwise.
2016-03-31 10:04:40 -07:00
H.J. Lu
064f01b10b Make __memcpy_avx512_no_vzeroupper an alias
Since x86-64 memcpy-avx512-no-vzeroupper.S implements memmove, make
__memcpy_avx512_no_vzeroupper an alias of __memmove_avx512_no_vzeroupper
to reduce code size of libc.so.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	memcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: Renamed
	to ...
	* sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: This.
	(MEMCPY): Don't define.
	(MEMCPY_CHK): Likewise.
	(MEMPCPY): Likewise.
	(MEMPCPY_CHK): Likewise.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMPCPY_CHK): Renamed to ...
	(__mempcpy_chk_avx512_no_vzeroupper): This.
	(MEMCPY_CHK): Renamed to ...
	(__memmove_chk_avx512_no_vzeroupper): This.
	(MEMCPY): Renamed to ...
	(__memmove_avx512_no_vzeroupper): This.
	(__memcpy_avx512_no_vzeroupper): New alias.
	(__memcpy_chk_avx512_no_vzeroupper): Likewise.
2016-03-28 13:16:22 -07:00
H.J. Lu
c365e615f7 Implement x86-64 multiarch mempcpy in memcpy
Implement x86-64 multiarch mempcpy in memcpy to share most of code.  It
reduces code size of libc.so.

	[BZ #18858]
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Remove
	mempcpy-ssse3, mempcpy-ssse3-back, mempcpy-avx-unaligned
	and mempcpy-avx512-no-vzeroupper.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMPCPY_CHK):
	New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S
	(MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S (MEMPCPY_CHK): New.
	(MEMPCPY): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-avx-unaligned.S: Removed.
	* sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S:
	Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy-ssse3.S: Likewise.
2016-03-28 13:13:51 -07:00
H.J. Lu
e41b395523 [x86] Add a feature bit: Fast_Unaligned_Copy
On AMD processors, memcpy optimized with unaligned SSE load is
slower than emcpy optimized with aligned SSSE3 while other string
functions are faster with unaligned SSE load.  A feature bit,
Fast_Unaligned_Copy, is added to select memcpy optimized with
unaligned SSE load.

	[BZ #19583]
	* sysdeps/x86/cpu-features.c (init_cpu_features): Set
	Fast_Unaligned_Copy with Fast_Unaligned_Load for Intel
	processors.  Set Fast_Copy_Backward for AMD Excavator
	processors.
	* sysdeps/x86/cpu-features.h (bit_arch_Fast_Unaligned_Copy):
	New.
	(index_arch_Fast_Unaligned_Copy): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check
	Fast_Unaligned_Copy instead of Fast_Unaligned_Load.
2016-03-28 04:40:03 -07:00
H.J. Lu
3c9a4cd16c Don't set %rcx twice before "rep movsb"
* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S (MEMCPY):
	Don't set %rcx twice before "rep movsb".
2016-03-22 08:36:16 -07:00
H.J. Lu
fee9eb6200 Group AVX512 functions in .text.avx512 section
* sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S:
	Replace .text with .text.avx512.
	* sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S:
	Likewise.
2016-03-06 16:48:11 -08:00
H.J. Lu
14a1d7cc4c x86-64: Fix memcpy IFUNC selection
Chek Fast_Unaligned_Load, instead of Slow_BSF, and also check for
Fast_Copy_Backward to enable __memcpy_ssse3_back.  Existing selection
order is updated with following selection order:

1. __memcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __memcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __memcpy_sse2 if SSSE3 isn't available.
4. __memcpy_ssse3_back if Fast_Copy_Backward bit it set.
5. __memcpy_ssse3

	[BZ #18880]
	* sysdeps/x86_64/multiarch/memcpy.S: Check Fast_Unaligned_Load,
	instead of Slow_BSF, and also check for Fast_Copy_Backward to
	enable __memcpy_ssse3_back.
2016-03-04 08:39:07 -08:00
Andrew Senkevich
df782dc690 Fixed build with assembler w/o AVX-512 support.
* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Fixed build with
    assembler not supporting AVX-512.
2016-01-19 14:34:53 +03:00
Andrew Senkevich
214a44f394 Fixed typos in __memcpy_chk.
* sysdeps/x86_64/multiarch/memcpy_chk.S: Fixed typos.
2016-01-16 14:42:26 +03:00
Andrew Senkevich
72276d6e88 Added memcpy/memmove family optimized with AVX512 for KNL hardware.
Added AVX512 implementations of memcpy, mempcpy, memmove, memcpy_chk,
mempcpy_chk, memmove_chk.
It shows average improvement more than 30% over AVX versions on KNL
hardware (performance results in the thread
<https://sourceware.org/ml/libc-alpha/2016-01/msg00258.html>).

    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new files.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memcpy-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/mempcpy-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S: Likewise.
    * sysdeps/x86_64/multiarch/memcpy.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
    * sysdeps/x86_64/multiarch/memmove.c: Likewise.
    * sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
    * sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
2016-01-16 00:49:45 +03:00
Joseph Myers
f7a9f785e5 Update copyright dates with scripts/update-copyrights. 2016-01-04 16:05:18 +00:00
Andrew Senkevich
83d776f979 Added memset optimized with AVX512 for KNL hardware.
It shows improvement up to 28% over AVX2 memset (performance results
attached at <https://sourceware.org/ml/libc-alpha/2015-12/msg00052.html>).

    * sysdeps/x86_64/multiarch/memset-avx512-no-vzeroupper.S: New file.
    * sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Added new file.
    * sysdeps/x86_64/multiarch/ifunc-impl-list.c: Added new tests.
    * sysdeps/x86_64/multiarch/memset.S: Added new IFUNC branch.
    * sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
    * sysdeps/x86/cpu-features.h (bit_Prefer_No_VZEROUPPER,
    index_Prefer_No_VZEROUPPER): New.
    * sysdeps/x86/cpu-features.c (init_cpu_features): Set the
    Prefer_No_VZEROUPPER for Knights Landing.
2015-12-19 02:47:28 +03:00
Joseph Myers
c871b9b096 Remove -mavx2 configure tests.
There are configure tests for the -mavx2 compiler option.  AVX2
support was added in GCC 4.7, so these tests are now obsolete; this
patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by the patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_avx2): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* config.h.in (HAVE_AVX2_SUPPORT): Remove #undef.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 unconditionally instead of conditionally on
	[$(config-cflags-avx2) = yes].
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list) [HAVE_AVX2_SUPPORT]: Make code
	unconditional.
	* sysdeps/x86_64/multiarch/memset.S [HAVE_AVX2_SUPPORT]: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S
	[IS_IN (libc) && SHARED && HAVE_AVX2_SUPPORT]: Change conditional
	to [IS_IN (libc) && SHARED].
2015-10-28 13:29:03 +00:00
Joseph Myers
1b12cd7f4d Remove configure tests for AVX support.
GCC added support for -mavx and -msse2avx in version 4.4.  Thus the
configure tests for this support are obsolete, and this patch removes
them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by this patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_avx): Remove configure
	test.
	(libc_cv_cc_sse2avx): Likewise.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/i686/multiarch/Makefile
	[$(subdir)$(config-cflags-avx) = mathyes]: Change conditional to
	[$(subdir) = math].
	* sysdeps/i386/i686/multiarch/s_fma-fma.c [HAVE_AVX_SUPPORT]: Make
	code unconditional.
	* sysdeps/i386/i686/multiarch/s_fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/s_fmaf-fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_avx): Remove configure
	test.
	(libc_cv_cc_sse2avx): Likewise.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/Makefile [$(config-cflags-avx) = yes]: Make code
	unconditional.
	* sysdeps/x86_64/dl-trampoline.h (_dl_runtime_profile)
	[HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT]: Make code
	unconditional.
	(_dl_runtime_profile)
	[!(HAVE_AVX_SUPPORT || HAVE_AVX512_ASM_SUPPORT)]: Remove
	conditional code.
	* sysdeps/x86_64/fpu/multiarch/Makefile
	[$(config-cflags-sse2avx) = yes]: Make code unconditional.
	* sysdeps/x86_64/fpu/multiarch/e_atan2.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_exp.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_log.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_atan.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fma.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fmaf.c [HAVE_AVX_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sin.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_tan.c
	[HAVE_FMA4_SUPPORT || HAVE_AVX_SUPPORT]: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S [HAVE_AVX_SUPPORT]: Likewise.
	* config.h.in (HAVE_AVX_SUPPORT): Remove #undef.
	(HAVE_SSE2AVX_SUPPORT): Likewise.
2015-10-08 15:59:32 +00:00
Joseph Myers
3b7aa5bf59 Remove configure tests for SSE4 support.
GCC added support for -msse4 in version 4.3.  Thus the configure tests
for it are obsolete, and this patch removes them.

Tested for x86_64 and x86 (testsuite, and that installed stripped
shared libraries are unchanged by this patch).

	* sysdeps/i386/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/i386/i686/multiarch/Makefile
	[$(config-cflags-sse4) = yes]: Make code unconditional.
	* sysdeps/i386/i686/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/i386/i686/multiarch/strspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/configure.ac (libc_cv_cc_sse4): Remove configure
	test.
	* sysdeps/x86_64/configure: Regenerated.
	* sysdeps/x86_64/multiarch/Makefile [$(config-cflags-sse4) = yes]:
	Make code unconditional.
	* sysdeps/x86_64/multiarch/strcspn.S [HAVE_SSE4_SUPPORT]:
	Likewise.
	* sysdeps/x86_64/multiarch/strspn.S [HAVE_SSE4_SUPPORT]: Likewise.
	* config.h.in (HAVE_SSE4_SUPPORT): Remove #undef.
2015-10-06 20:47:40 +00:00
H.J. Lu
2339c6f4bd Remove x86-64 rtld-xxx.c and rtld-xxx.S
Since ld.so preserves vector registers now, we can use the regular,
non-ifunc string and memory functions in ld.so.

	* sysdeps/x86_64/rtld-memcmp.c: Removed.
	* sysdeps/x86_64/rtld-memset.S: Likewise.
	* sysdeps/x86_64/rtld-strchr.S: Likewise.
	* sysdeps/x86_64/rtld-strlen.S: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memset.S: Likewise.
2015-08-25 08:50:06 -07:00
H.J. Lu
daa4db69fc Remove the unused IFUNC files
sysdeps/i386/i686/multiarch/strcasestr-c.c became unused after

commit 1818483b15
Author: Andreas Schwab <schwab@suse.de>
Date:   Wed Dec 18 11:53:27 2013 +1000

    Remove use of SSE4.2 functions for strstr on i686

which contains

-sysdep_routines += strcspn-c strpbrk-c strspn-c strstr-c strcasestr-c
+sysdep_routines += strcspn-c strpbrk-c strspn-c

sysdeps/x86_64/multiarch/strcasestr.c became useless after

t 584b18eb4d
Author: Ondřej Bílka <neleai@seznam.cz>
Date:   Sat Dec 14 19:33:56 2013 +0100

    Add strstr with unaligned loads. Fixes bug 12100.

which changes sysdeps/x86_64/multiarch/strcasestr.c to

libc_ifunc (__strcasestr, __strcasestr_sse2);

This patch removes these file.

	* i386/i686/multiarch/strcasestr-c.c: Removed.
	* x86_64/multiarch/strcasestr.c: Likewise.
	* x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Remove strcasestr.
2015-08-20 12:47:20 -07:00
H.J. Lu
1ae6c72dc1 Move x86_64 init-arch.h to sysdeps/x86/init-arch.h
Move sysdeps/x86_64/multiarch/init-arch.h to sysdeps/x86/init-arch.h
which can be used for both i386 and x86_64.

	* sysdeps/i386/i686/multiarch/init-arch.h: Removed.
	* sysdeps/unix/sysv/linux/x86/init-arch.h: Likewise.
	* sysdeps/x86_64/cacheinfo.c: Include <init-arch.h> instead
	of "multiarch/init-arch.h".
	* sysdeps/x86_64/multiarch/init-arch.h: Renamed to ...
	* sysdeps/x86/init-arch.h: This.
2015-08-20 04:29:23 -07:00
H.J. Lu
0b5395f052 Update x86_64 multiarch functions for <cpu-features.h>
This patch updates x86_64 multiarch functions to use the newly defined
HAS_CPU_FEATURE, HAS_ARCH_FEATURE and LOAD_RTLD_GLOBAL_RO_RDX from
<cpu-features.h>.

	* sysdeps/x86_64/fpu/multiarch/e_asin.c: Replace HAS_XXX with
	HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX).
	* sysdeps/x86_64/fpu/multiarch/e_atan2.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_exp.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_log.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/e_pow.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_atan.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fma.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_fmaf.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sin.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_tan.c: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_ceil.S: Use
	LOAD_RTLD_GLOBAL_RO_RDX and HAS_CPU_FEATURE (SSE4_1).
	* sysdeps/x86_64/fpu/multiarch/s_ceilf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floor.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_floorf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_nearbyint.S : Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_nearbyintf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_rintf.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_rintf.S : Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c: Likewise.
	* sysdeps/x86_64/multiarch/sched_cpucount.c: Likewise.
	* sysdeps/x86_64/multiarch/strstr.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/test-multiarch.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Remove __init_cpu_features
	call.  Add LOAD_RTLD_GLOBAL_RO_RDX.  Replace HAS_XXX with
	HAS_CPU_FEATURE/HAS_ARCH_FEATURE (XXX).
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2015-08-13 03:41:30 -07:00
H.J. Lu
e2e4f56056 Add _dl_x86_cpu_features to rtld_global
This patch adds _dl_x86_cpu_features to rtld_global in x86 ld.so
and initializes it early before __libc_start_main is called so that
cpu_features is always available when it is used and we can avoid
calling __init_cpu_features in IFUNC selectors.

	* sysdeps/i386/dl-machine.h: Include <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/i386/dl-procinfo.c (_dl_x86_cpu_features): New.
	* sysdeps/i386/i686/cacheinfo.c
	(DISABLE_PREFERRED_MEMORY_INSTRUCTION): Removed.
	* sysdeps/i386/i686/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/i386/i686/multiarch/Versions: Removed.
	* sysdeps/i386/i686/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/i386/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/unix/sysv/linux/x86/Makefile
	(libpthread-sysdep_routines): Remove init-arch.
	* sysdeps/unix/sysv/linux/x86_64/dl-procinfo.c: Include
	<sysdeps/x86_64/dl-procinfo.c> instead of
	sysdeps/generic/dl-procinfo.c>.
	* sysdeps/x86/Makefile [$(subdir) == csu] (gen-as-const-headers):
	Add cpu-features-offsets.sym and rtld-global-offsets.sym.
	[$(subdir) == elf] (sysdep-dl-routines): Add dl-get-cpu-features.
	[$(subdir) == elf] (tests): Add tst-get-cpu-features.
	[$(subdir) == elf] (tests-static): Add
	tst-get-cpu-features-static.
	* sysdeps/x86/Versions: New file.
	* sysdeps/x86/cpu-features-offsets.sym: Likewise.
	* sysdeps/x86/cpu-features.c: Likewise.
	* sysdeps/x86/cpu-features.h: Likewise.
	* sysdeps/x86/dl-get-cpu-features.c: Likewise.
	* sysdeps/x86/libc-start.c: Likewise.
	* sysdeps/x86/rtld-global-offsets.sym: Likewise.
	* sysdeps/x86/tst-get-cpu-features-static.c: Likewise.
	* sysdeps/x86/tst-get-cpu-features.c: Likewise.
	* sysdeps/x86_64/dl-procinfo.c: Likewise.
	* sysdeps/x86_64/cacheinfo.c (__cpuid_count): Removed.
	Assume USE_MULTIARCH is defined and don't check it.
	(is_intel): Replace __cpu_features with GLRO(dl_x86_cpu_features).
	(is_amd): Likewise.
	(max_cpuid): Likewise.
	(intel_check_word): Likewise.
	(__cache_sysconf): Don't call __init_cpu_features.
	(__x86_preferred_memory_instruction): Removed.
	(init_cacheinfo): Don't call __init_cpu_features. Replace
	__cpu_features with GLRO(dl_x86_cpu_features).
	* sysdeps/x86_64/dl-machine.h: <cpu-features.c>.
	(dl_platform_init): Call init_cpu_features.
	* sysdeps/x86_64/ldsodefs.h: Include <cpu-features.h>.
	* sysdeps/x86_64/multiarch/Makefile (aux): Remove init-arch.
	* sysdeps/x86_64/multiarch/Versions: Removed.
	* sysdeps/x86_64/multiarch/cacheinfo.c: Likewise.
	* sysdeps/x86_64/multiarch/init-arch.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-defines.sym (KIND_OFFSET):
	Removed.
	* sysdeps/x86_64/multiarch/init-arch.h: Rewrite.
2015-08-13 03:41:22 -07:00
H.J. Lu
3b8d2eb7f8 Compile {memcpy,strcmp}-sse2-unaligned.S only for libc
{memcpy,strcmp}-sse2-unaligned.S aren't needed in ld.so.

	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Compile
	only for libc.
	* sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S: Likewise.
2015-08-05 08:28:37 -07:00
Andrew Senkevich
5fe2a126d1 This patch adds detection of availability for AVX512F and AVX512DQ ISAs.
* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX512F_Usable,
    bit_AVX512DQ_Usable, bit_Opmask_state, bit_ZMM0_15_state,
    bit_ZMM16_31_state): New macro.
    * sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
    Check and set bit_AVX512F_Usable, bit_AVX512DQ_Usable.
2015-06-08 14:07:59 +03:00
Ondřej Bílka
0f4840be25 Use strspn/strcspn/strpbrk ifunc in internal calls.
To make a strtok faster and improve performance in general we need to do one
additional change.

A comment:

/* It doesn't make sense to send libc-internal strcspn calls through a PLT.
   The speedup we get from using SSE4.2 instruction is likely eaten away
   by the indirect call in the PLT.  */

Does not make sense at all because nobody bothered to check it. Gap
between these implementations is quite big, when haystack is empty a
sse2 is around 40 cycles slower because it needs to populate a lookup
table and difference only increases with size. That is much bigger than
plt slowdown which is few cycles.

Even benchtest show a gap which also may be reverse by branch
misprediction but my internal benchmark shown.

 simple_strspn stupid_strspn __strspn_sse42  __strspn_sse2
Length    0, alignment  0, acc len  6:  18.6562 35.2344 17.0469 61.6719
Length    6, alignment  0, acc len  6:  59.5469 72.5781 16.4219 73.625

This patch also handles strpbrk which is implemented by including a
x86_64/multiarch/strcspn.S file.

	* sysdeps/x86_64/multiarch/strspn.S: Remove plt indirection.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
2015-05-12 20:18:51 +02:00
H.J. Lu
5f3d0b78e0 Use AVX unaligned memcpy only if AVX2 is available
memcpy with unaligned 256-bit AVX register loads/stores are slow on older
processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
and sets it only when AVX2 is available.

	[BZ #17801]
	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
	New.
	(index_AVX_Fast_Unaligned_Load): Likewise.
	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.
2015-01-30 15:37:58 -08:00
H.J. Lu
972af9e8dd Also treat model numbers 0x5a/0x5d as Silvermont 2015-01-23 18:52:45 -08:00
H.J. Lu
ede0236c86 Treat model numbers 0x4a/0x4d as Silvermont
* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Treat model numbers 0x4a/0x4d as Intel Silvermont architecture.
2015-01-23 18:08:10 -08:00
Joseph Myers
b168057aaa Update copyright dates with scripts/update-copyrights. 2015-01-02 16:29:47 +00:00
Siddhesh Poyarekar
4f41c682f3 Remove NOT_IN_libc
Replace with !IS_IN (libc).  This completes the transition from
the IS_IN/NOT_IN macros to the IN_MODULE macro set.

The generated code is unchanged on x86_64.

	* stdlib/isomac.c (fmt): Replace NOT_IN_libc with IN_MODULE.
	(get_null_defines): Adjust.
	* sunrpc/Makefile: Adjust comment.
	* Makerules (CPPFLAGS-nonlib): Remove NOT_IN_libc.
	* elf/Makefile (CPPFLAGS-sotruss-lib): Likewise.
	(CFLAGS-interp.c): Likewise.
	(CFLAGS-ldconfig.c): Likewise.
	(CPPFLAGS-.os): Likewise.
	* elf/rtld-Rules (rtld-CPPFLAGS): Likewise.
	* extra-lib.mk (CPPFLAGS-$(lib)): Likewise.
	* extra-modules.mk (extra-modules.mk): Likewise.
	* iconv/Makefile (CPPFLAGS-iconvprogs): Likewise.
	* locale/Makefile (CPPFLAGS-locale_programs): Likewise.
	* malloc/Makefile (CPPFLAGS-memusagestat): Likewise.
	* nscd/Makefile (CPPFLAGS-nscd): Likewise.
	* nss/Makefile (CPPFLAGS-nss_test1): Likewise.
	* stdlib/Makefile (CFLAGS-tst-putenvmod.c): Likewise.
	* sysdeps/gnu/Makefile ($(objpfx)errlist-compat.c): Likewise.
	* sysdeps/unix/sysv/linux/Makefile (CPPFLAGS-lddlibc4): Likewise.
	* iconvdata/Makefile (CPPFLAGS): Likewise.
	(cpp-srcs-left): Add libof for all iconvdata routines.
	* bits/stdio-lock.h: Replace NOT_IN_libc with IS_IN.
	* include/assert.h: Likewise.
	* include/ctype.h: Likewise.
	* include/errno.h: Likewise.
	* include/libc-symbols.h: Likewise.
	* include/math.h: Likewise.
	* include/netdb.h: Likewise.
	* include/resolv.h: Likewise.
	* include/stdio.h: Likewise.
	* include/stdlib.h: Likewise.
	* include/string.h: Likewise.
	* include/sys/stat.h: Likewise.
	* include/wctype.h: Likewise.
	* intl/l10nflist.c: Likewise.
	* libidn/idn-stub.c: Likewise.
	* libio/libioP.h: Likewise.
	* nptl/libc_multiple_threads.c: Likewise.
	* nptl/pthreadP.h: Likewise.
	* posix/regex_internal.h: Likewise.
	* resolv/res_hconf.c: Likewise.
	* sysdeps/arm/armv7/multiarch/memcpy.S: Likewise.
	* sysdeps/arm/memmove.S: Likewise.
	* sysdeps/arm/sysdep.h: Likewise.
	* sysdeps/generic/_itoa.h: Likewise.
	* sysdeps/generic/symbol-hacks.h: Likewise.
	* sysdeps/gnu/errlist.awk: Likewise.
	* sysdeps/gnu/errlist.c: Likewise.
	* sysdeps/i386/i586/memcpy.S: Likewise.
	* sysdeps/i386/i586/memset.S: Likewise.
	* sysdeps/i386/i686/memcpy.S: Likewise.
	* sysdeps/i386/i686/memmove.S: Likewise.
	* sysdeps/i386/i686/mempcpy.S: Likewise.
	* sysdeps/i386/i686/memset.S: Likewise.
	* sysdeps/i386/i686/multiarch/bcopy.S: Likewise.
	* sysdeps/i386/i686/multiarch/bzero.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp-sse4.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy-ssse3-rep.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/memmove.S: Likewise.
	* sysdeps/i386/i686/multiarch/memmove_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/mempcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset-sse2-rep.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset.S: Likewise.
	* sysdeps/i386/i686/multiarch/memset_chk.S: Likewise.
	* sysdeps/i386/i686/multiarch/rawmemchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcat.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp-sse4.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/strcspn.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strlen.S: Likewise.
	* sysdeps/i386/i686/multiarch/strnlen.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr-sse2-bsf.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/strrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/strspn.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcschr.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscmp-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscmp.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy-ssse3.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcscpy.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcslen.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/i386/i686/multiarch/wcsrchr.S: Likewise.
	* sysdeps/i386/i686/multiarch/wmemcmp-c.c: Likewise.
	* sysdeps/i386/i686/multiarch/wmemcmp.S: Likewise.
	* sysdeps/ia64/fpu/libm-symbols.h: Likewise.
	* sysdeps/nptl/bits/libc-lock.h: Likewise.
	* sysdeps/nptl/bits/libc-lockP.h: Likewise.
	* sysdeps/nptl/bits/stdio-lock.h: Likewise.
	* sysdeps/posix/closedir.c: Likewise.
	* sysdeps/posix/opendir.c: Likewise.
	* sysdeps/posix/readdir.c: Likewise.
	* sysdeps/posix/rewinddir.c: Likewise.
	* sysdeps/powerpc/novmx-sigjmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/bsd-_setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/fpu/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/fpu/setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/bzero.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcpy-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memcpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memmove.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/mempcpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memrchr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memrchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memset-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/memset.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/rawmemchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strcasecmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strchrnul.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strlen-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strlen.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncase.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncase_l.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncmp-ppc32.S: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strncmp.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/strnlen.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr-ppc32.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc32/power4/multiarch/wordcopy.c: Likewise.
	* sysdeps/powerpc/powerpc32/power6/memset.S: Likewise.
	* sysdeps/powerpc/powerpc32/setjmp.S: Likewise.
	* sysdeps/powerpc/powerpc64/__longjmp.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/bzero.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcmp-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memmove-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memmove.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/mempcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memset-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/memset.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/stpncpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcat.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strchrnul.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcmp-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcpy-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strcspn.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strlen-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strlen.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncase.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncase_l.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncat.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncmp-ppc64.S: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncmp.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncpy-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strncpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strnlen.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strpbrk.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strrchr-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strspn-ppc64.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/strspn.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcschr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcscpy.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wcsrchr.c: Likewise.
	* sysdeps/powerpc/powerpc64/multiarch/wordcopy.c: Likewise.
	* sysdeps/powerpc/powerpc64/setjmp.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/ifunc-resolve.c: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcmp.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memcpy.S: Likewise.
	* sysdeps/s390/s390-32/multiarch/memset.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/ifunc-resolve.c: Likewise.
	* sysdeps/s390/s390-64/multiarch/memcmp.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/memcpy.S: Likewise.
	* sysdeps/s390/s390-64/multiarch/memset.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara1.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara2.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-niagara4.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy-ultra3.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memcpy.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset-niagara1.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset-niagara4.S: Likewise.
	* sysdeps/sparc/sparc64/multiarch/memset.S: Likewise.
	* sysdeps/unix/alpha/sysdep.S: Likewise.
	* sysdeps/unix/alpha/sysdep.h: Likewise.
	* sysdeps/unix/make-syscalls.sh: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/alpha/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/alpha/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/arm/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/arm/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/getpid.c: Likewise.
	* sysdeps/unix/sysv/linux/hppa/nptl/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/hppa/nptl/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/i486/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/i386/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/i386/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/ia64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/lowlevellock-futex.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/bits/m68k-vdso.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/m68k/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/microblaze/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/mips/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/not-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/longjmp_chk.c: Likewise.
	* sysdeps/unix/sysv/linux/s390/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep.S: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/sh/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/sh/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/sh/vfork.S: Likewise.
	* sysdeps/unix/sysv/linux/sparc/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/brk.S: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/sysdep.h: Likewise.
	* sysdeps/unix/sysv/linux/tile/waitpid.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/lowlevellock.h: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/sysdep-cancel.h: Likewise.
	* sysdeps/unix/sysv/linux/x86_64/sysdep.h: Likewise.
	* sysdeps/wordsize-32/symbol-hacks.h: Likewise.
	* sysdeps/x86_64/memcpy.S: Likewise.
	* sysdeps/x86_64/memmove.c: Likewise.
	* sysdeps/x86_64/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/init-arch.h: Likewise.
	* sysdeps/x86_64/multiarch/memcmp-sse4.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr-sse2-no-bsf.S: Likewise.
	* sysdeps/x86_64/multiarch/strchr.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy-ssse3.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscpy.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp-c.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
	* sysdeps/x86_64/strcmp.S: Likewise.
2014-11-24 15:03:45 +05:30
Andreas Schwab
3574f2fdf3 Fix misdetected Slow_SSE4_2 cpu feature bit (bug 17501) 2014-10-27 10:44:28 +01:00
Ling Ma
05f3633da4 Improve 64bit memcpy performance for Haswell CPU with AVX instruction
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 2% to 10%.
2014-07-30 08:02:35 -07:00
H.J. Lu
f2fef657d8 Enable AVX2 optimized memset only if -mavx2 works
* config.h.in (HAVE_AVX2_SUPPORT): New #undef.
	* sysdeps/i386/configure.ac: Set HAVE_AVX2_SUPPORT and
	config-cflags-avx2.
	* sysdeps/x86_64/configure.ac: Likewise.
	* sysdeps/i386/configure: Regenerated.
	* sysdeps/x86_64/configure: Likewise.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-avx2 only if config-cflags-avx2 is yes.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Tests for memset_chk and memset only if HAVE_AVX2_SUPPORT is
	defined.
	* sysdeps/x86_64/multiarch/memset.S: Define multiple versions
	only if HAVE_AVX2_SUPPORT is defined.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
2014-07-14 07:58:27 -07:00
H.J. Lu
d92d8f8a42 Add ifunc tests for x86_64 memset_chk and memset
This patch adds ifunc tests for x86_64 memset_chk and memset.  It also
defines HAS_AVX2 with AVX2_Usable since AVX2 may not be usable even if
processor has AVX2.

	* sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Add tests for memset_chk and memset.

	* sysdeps/x86_64/multiarch/init-arch.h (HAS_AVX2): Defined
	with AVX2_Usable.
2014-06-20 14:52:29 -07:00
H.J. Lu
224c6c51c5 Remove sysdeps/x86_64/multiarch/rtld-strlen.S
Since there is no sysdeps/x86_64/multiarch/strlen.S,
sysdeps/x86_64/rtld-strlen.S will be used.

	* sysdeps/x86_64/multiarch/rtld-strlen.S: Removed.
2014-06-20 08:10:07 -07:00
Ling Ma
5c74e47cd6 Add x86_64 memset optimized for AVX2
In this patch we take advantage of HSW memory bandwidth, manage to
reduce miss branch prediction by avoiding using branch instructions and
force destination to be aligned with avx & avx2 instruction.

The CPU2006 403.gcc benchmark indicates this patch improves performance
from 26% to 59%.

	* sysdeps/x86_64/multiarch/Makefile: Add memset-avx2.
	* sysdeps/x86_64/multiarch/memset-avx2.S: New file.
	* sysdeps/x86_64/multiarch/memset.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/rtld-memset.S: Likewise.
2014-06-19 15:14:08 -07:00
Carlos O'Donell
8f1df5cf9d Fix -Wundef warning for FEATURE_INDEX_1.
Define FEATURE_INDEX_1 and FEATURE_INDEX_MAX as macros
for use by both assembly and C code. This fixes the
-Wundef error for cases where FEATURE_INDEX_1 was not
defined but used the correct value of 0 for an undefined
macro.
2014-05-03 00:25:21 -04:00
Sihai Yao
f9281df995 Detect if AVX2 is usable
This patch checks and sets bit_AVX2_Usable in __cpu_features.feature.

	* sysdeps/x86_64/multiarch/ifunc-defines.sym (COMMON_CPUID_INDEX_7):
	New.
	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Check and set bit_AVX2_Usable.
	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX2_Usable): New
	macro.
	(bit_AVX2): Likewise.
	(index_AVX2_Usable): Likewise.
	(CPUID_AVX2): Likewise.
	(HAS_AVX2): Likewise.
2014-04-17 08:00:21 -07:00
Allan McRae
d4697bc93d Update copyright notices with scripts/update-copyrights 2014-01-01 22:00:23 +10:00
Allan McRae
6f8e37ebf8 Update file name in x86_64 ifunc list
File name update missed in commit 584b18eb.
2013-12-16 13:00:39 +10:00
Ondřej Bílka
584b18eb4d Add strstr with unaligned loads. Fixes bug 12100.
A sse42 version of strstr used pcmpistr instruction which is quite
ineffective. A faster way is look for pairs of characters which is uses
sse2, is faster than pcmpistr and for real strings a pairs we look for
are relatively rare.

For linear time complexity we use buy or rent technique which switches
to two-way algorithm when superlinear behaviour is detected.
2013-12-14 20:08:13 +01:00
Ondřej Bílka
e7044ea76b Use p2align instead ALIGN 2013-10-08 15:46:48 +02:00
Ondřej Bílka
dc1a95c730 Faster strrchr. 2013-09-26 19:23:01 +02:00
Ondřej Bílka
5905e7b3e2 Faster strchr implementation. 2013-09-11 17:07:38 +02:00
Ondřej Bílka
8f02859f17 Add unaligned strcmp. 2013-09-03 16:27:10 +02:00