Commit Graph

105 Commits

Author SHA1 Message Date
Adhemerval Zanella
721314c980 x86_64: Remove avx512 strstr implementation
As indicated in a recent thread, this it is a simple brute-force
algorithm that checks the whole needle at a matching character pair
(and does so 1 byte at a time after the first 64 bytes of a needle).
Also it never skips ahead and thus can match at every haystack
position after trying to match all of the needle, which generic
implementation avoids.

As indicated by Wilco, a 4x larger needle and 16x larger haystack gives
a clear 65x slowdown both basic_strstr and __strstr_avx512:

  "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512",
"__strstr_sse2_unaligned", "__strstr_generic"],

    {
     "len_haystack": 65536,
     "len_needle": 1024,
     "align_haystack": 0,
     "align_needle": 0,
     "fail": 1,
     "desc": "Difficult bruteforce needle",
     "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2]
    },
    {
     "len_haystack": 1048576,
     "len_needle": 4096,
     "align_haystack": 0,
     "align_needle": 0,
     "fail": 1,
     "desc": "Difficult bruteforce needle",
     "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9]
    }

PS: I don't have an AVX512 capable machine to verify this issues, but
    skimming through the code it does seems to follow what Wilco has
    described.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2024-03-27 13:48:16 -03:00
H.J. Lu
ef7f4b1fef Apply the Makefile sorting fix
Apply the Makefile sorting fix generated by sort-makefile-lines.py.
2024-02-15 11:19:56 -08:00
Noah Goldstein
52cf11004e x86: Add avx2 optimized functions for the wchar_t strcpy family
Implemented:
    wcscat-avx2  (+ 744 bytes
    wcscpy-avx2  (+ 539 bytes)
    wcpcpy-avx2  (+ 577 bytes)
    wcsncpy-avx2 (+1108 bytes)
    wcpncpy-avx2 (+1214 bytes)
    wcsncat-avx2 (+1085 bytes)

Performance Changes:
    Times are from N = 10 runs of the benchmark suite and are reported
    as geometric mean of all ratios of New Implementation / Best Old
    Implementation. Best Old Implementation was determined with the
    highest ISA implementation.

    wcscat-avx2     -> 0.975
    wcscpy-avx2     -> 0.591
    wcpcpy-avx2     -> 0.698
    wcsncpy-avx2    -> 0.730
    wcpncpy-avx2    -> 0.711
    wcsncat-avx2    -> 0.954

Code Size Changes:
    This change  increase the size of libc.so by ~5.5kb bytes. For
    reference the patch optimizing the normal strcpy family functions
    decreases libc.so by ~5.2kb.

Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
2022-11-08 19:22:33 -08:00
Noah Goldstein
64b8b6516b x86: Add evex optimized functions for the wchar_t strcpy family
Implemented:
    wcscat-evex  (+ 905 bytes)
    wcscpy-evex  (+ 674 bytes)
    wcpcpy-evex  (+ 709 bytes)
    wcsncpy-evex (+1358 bytes)
    wcpncpy-evex (+1467 bytes)
    wcsncat-evex (+1213 bytes)

Performance Changes:
    Times are from N = 10 runs of the benchmark suite and are reported
    as geometric mean of all ratios of New Implementation / Best Old
    Implementation. Best Old Implementation was determined with the
    highest ISA implementation.

    wcscat-evex     -> 0.991
    wcscpy-evex     -> 0.587
    wcpcpy-evex     -> 0.695
    wcsncpy-evex    -> 0.719
    wcpncpy-evex    -> 0.694
    wcsncat-evex    -> 0.979

Code Size Changes:
    This change  increase the size of libc.so by ~6.3kb bytes. For
    reference the patch optimizing the normal strcpy family functions
    decreases libc.so by ~5.7kb.

Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.
2022-11-08 19:22:33 -08:00
Sunil K Pandey
faaf733f49 x86_64: Implement evex512 version of strrchr and wcsrchr
Changes from v1:
  Use vec api for register.
  Replace VPCMP with VPCMPEQ
  Restructure and remove 1 unconditional jump.
  Change page cross logic to use sall.

This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.

- strrchr function using 512 bit vectors.
- wcsrchr function using 512 bit vectors.

Code size data:

strrchr-evex.o		879 byte
strrchr-evex512.o	601 byte (-32%)

wcsrchr-evex.o		882 byte
wcsrchr-evex512.o	572 byte (-35%)

Placeholder function, not used by any processor at the moment.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-11-03 15:51:52 -07:00
Sunil K Pandey
59e501f204 x86_64: Implement evex512 version of strchrnul, strchr and wcschr
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.

- strchrnul function using 512 bit vectors.
- strchr function using 512 bit vectors.
- wcschr function using 512 bit vectors.

Code size data:

strchrnul-evex.o	599 byte
strchrnul-evex512.o	569 byte (-5%)

strchr-evex.o		639 byte
strchr-evex512.o	595 byte (-7%)

wcschr-evex.o		644 byte
wcschr-evex512.o	607 byte (-6%)

Placeholder function, not used by any processor at the moment.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-10-25 22:39:35 -07:00
Sunil K Pandey
451c6e5854 x86_64: Implement evex512 version of memchr, rawmemchr and wmemchr
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.

- memchr function using 512 bit vectors.
- rawmemchr function using 512 bit vectors.
- wmemchr function using 512 bit vectors.

Code size data:

memchr-evex.o		762 byte
memchr-evex512.o	576 byte (-24%)

rawmemchr-evex.o	461 byte
rawmemchr-evex512.o	412 byte (-11%)

wmemchr-evex.o		794 byte
wmemchr-evex512.o	552 byte (-30%)

Placeholder function, not used by any processor at the moment.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-10-18 13:26:33 -07:00
Noah Goldstein
192979ee35 x86: Add support to build wcscpy with explicit ISA level
1. Add ISA level build guards to different implementations.
    - wcscpy-ssse3.S is used as ISA level 2/3/4.
    - wcscpy-generic.c is only used at ISA level 1 and will
      only build if compiled with ISA level == 1. Otherwise
      there is no reason to include it as we will always use
      wcscpy-ssse3.S

2. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-16 03:07:59 -07:00
Noah Goldstein
ceabdcd130 x86: Add support to build strcmp/strlen/strchr with explicit ISA level
1. Add default ISA level selection in non-multiarch/rtld
   implementations.

2. Add ISA level build guards to different implementations.
    - I.e strcmp-avx2.S which is ISA level 3 will only build if
      compiled ISA level <= 3. Otherwise there is no reason to
      include it as we will always use one of the ISA level 4
      implementations (strcmp-evex.S).

3. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-16 03:07:59 -07:00
Noah Goldstein
9c38deec96 x86: Remove generic strncat, strncpy, and stpncpy implementations
These functions all have optimized versions:
__strncat_sse2_unaligned, __strncpy_sse2_unaligned, and
stpncpy_sse2_unaligned which are faster than their respective generic
implementations.  Since the sse2 versions can run on baseline x86_64,
we should use these as the baseline implementation and can remove the
generic implementations.

Geometric mean of N=20 runs of the entire benchmark suite on:
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake)

__strncat_sse2_unaligned / __strncat_generic: .944
__strncpy_sse2_unaligned / __strncpy_generic: .726
__stpncpy_sse2_unaligned / __stpncpy_generic: .650

Tested build with and without multiarch and full check with multiarch.
2022-07-12 11:44:12 -07:00
H.J. Lu
ec9013727d x86-64: Remove redundant strcspn-generic/strpbrk-generic/strspn-generic
Remove redundant strcspn-generic, strpbrk-generic and strspn-generic
from sysdep_routines in sysdeps/x86_64/multiarch/Makefile added by

commit c69f960b01
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jul 3 21:28:07 2022 -0700

    x86: Add support for building str{c|p}{brk|spn} with explicit ISA level

since they have been added to sysdep_routines in sysdeps/x86_64/Makefile.
2022-07-08 16:06:04 -07:00
Noah Goldstein
21925f6473 x86: Move mem{p}{mov|cpy}_{chk_}erms to its own file
The primary memmove_{impl}_unaligned_erms implementations don't
interact with this function. Putting them in same file both
wastes space and unnecessarily bloats a hot code section.
2022-06-29 19:47:52 -07:00
Noah Goldstein
4a3f29e7e4 x86: Move and slightly improve memset_erms
Implementation wise:
    1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not
       use the L(stosb) label that was previously defined.

    2. Don't give the hotpath (fallthrough) to zero size.

Code positioning wise:

Move memset_{chk}_erms to its own file.  Leaving it in between the
memset_{impl}_unaligned both adds unnecessary complexity to the
file and wastes space in a relatively hot cache section.
2022-06-29 19:47:52 -07:00
Noah Goldstein
c22eb807b0 x86: Rename generic functions with unique postfix for clarity
No functions are changed. It just renames generic implementations from
'{func}_sse2' to '{func}_generic'. This is just because the postfix
"_sse2" was overloaded and was used for files that had hand-optimized
sse2 assembly implementations and files that just redirected back
to the generic implementation.

Full xcheck passed on x86_64.
2022-06-16 20:17:45 -07:00
Raghuveer Devulapalli
5082a287d5 x86_64: Add strstr function with 512-bit EVEX
Adding a 512-bit EVEX version of strstr. The algorithm works as follows:

(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.

(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.

(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.

Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)

Geometric mean of all benchmarks: new / old =  0.66

Difficult skiptable(0) : new / old =  0.02
Difficult skiptable(1) : new / old =  0.01
Difficult 2-way : new / old =  0.25
Difficult testing first 2 : new / old =  1.26
Difficult skiptable(0) : new / old =  0.05
Difficult skiptable(1) : new / old =  0.06
Difficult 2-way : new / old =  0.26
Difficult testing first 2 : new / old =  1.05
Difficult skiptable(0) : new / old =  0.42
Difficult skiptable(1) : new / old =  0.24
Difficult 2-way : new / old =  0.21
Difficult testing first 2 : new / old =  1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-06-06 19:46:55 -07:00
Sunil K Pandey
9c66efb86f x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen
This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.

Placeholder function, not used by any processor at the moment.

- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-05-26 13:11:36 -07:00
Adhemerval Zanella
9403b71ae9 x86_64: Remove bzero optimization
Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.

GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.

On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.

  $ cat count_bstring.sh
  #!/bin/bash

  files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
  total=0
  for file in $files; do
    symbols=`objdump -R $file 2>&1`
    if [ $? -eq 0 ]; then
      ncalls=`echo $symbols | grep -w $1 | wc -l`
      ((total=total+ncalls))
      if [ $ncalls -gt 0 ]; then
        echo "$file: $ncalls"
      fi
    fi
  done
  echo "TOTAL=$total"
  $ ./count_bstring.sh bzero
  TOTAL=0

Checked on x86_64-linux-gnu.
2022-05-16 09:36:06 -03:00
Noah Goldstein
7cbc03d030 x86: Remove memcmp-sse4.S
Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d717
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
 128,      4,      8,   0,             0.858
 128,      4,      8,   1,             0.879
 128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:42 -05:00
Noah Goldstein
8804157ad9 x86: Optimize memcmp SSE2 in memcmp.S
New code save size (-303 bytes) and has significantly better
performance.

geometric_mean(N=20) of page cross cases New / Original: 0.634
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-15 13:08:35 -05:00
Noah Goldstein
26b2478322 x86: Reduce code size of mem{move|pcpy|cpy}-ssse3
The goal is to remove most SSSE3 function as SSE4, AVX2, and EVEX are
generally preferable. memcpy/memmove is one exception where avoiding
unaligned loads with `palignr` is important for some targets.

This commit replaces memmove-ssse3 with a better optimized are lower
code footprint verion. As well it aliases memcpy to memmove.

Aside from this function all other SSSE3 functions should be safe to
remove.

The performance is not changed drastically although shows overall
improvements without any major regressions or gains.

bench-memcpy geometric_mean(N=50) New / Original: 0.957

bench-memcpy-random geometric_mean(N=50) New / Original: 0.912

bench-memcpy-large geometric_mean(N=50) New / Original: 0.892

Benchmarks where run on Zhaoxin KX-6840@2000MHz See attached numbers
for all results.

More important this saves 7246 bytes of code size in memmove an
additional 10741 bytes by reusing memmove code for memcpy (total 17987
bytes saves). As well an additional 896 bytes of rodata for the jump
table entries.
2022-04-14 23:21:42 -05:00
Noah Goldstein
d85916e30a x86: Remove mem{move|cpy}-ssse3-back
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
41bfe224e5 x86: Remove str{p}{n}cpy-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:42 -05:00
Noah Goldstein
e084ccd37e x86: Remove str{n}cat-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c024a0b9cf x86: Remove str{n}{case}cmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
c725f408db x86: Remove {w}memcmp-ssse3
With SSE2, SSE4.1, AVX2, and EVEX versions very few targets prefer
SSSE3. As a result it is no longer worth it to keep the SSSE3
versions given the code size cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-04-14 23:21:41 -05:00
Noah Goldstein
305769b2a1 x86: Remove AVX str{n}casecmp
The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:51 -05:00
Noah Goldstein
84e7c46df4 x86: Add EVEX optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:50 -05:00
Noah Goldstein
bbf8122234 x86: Add AVX2 optimized str{n}casecmp
geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2022-03-25 13:16:43 -05:00
H.J. Lu
3d9f171bfb x86-64: Optimize bzero
memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC).

bzero can be slightly more optimized for this case by using a zero-idiom
xor for broadcasting the set value to a register (vector or GPR).

Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-02-08 15:58:56 -08:00
H.J. Lu
c328d0152d x86_64/multiarch: Sort sysdep_routines and put one entry per line 2022-02-05 16:42:17 -08:00
Noah Goldstein
cf4fd28ea4 x86_64: Add support for __memcmpeq using sse2, avx2, and evex
No bug. This commit adds support for __memcmpeq to be implemented
seperately from memcmp. Support is added for versions optimized with
sse2, avx2, and evex.
2021-10-27 13:03:46 -05:00
Noah Goldstein
6f573a27b6 x86-64: Add wcslen optimize for sse4.1
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-06-23 14:12:36 -04:00
Noah Goldstein
104c7b1967 x86: Add EVEX optimized memchr family not safe for RTM
No bug.

This commit adds a new implementation for EVEX memchr that is not safe
for RTM because it uses vzeroupper. The benefit is that by using
ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is
faster than the RTM safe version which cannot use vpcmpeq because
there is no EVEX encoding for the instruction. All parts of the
implementation aside from the 4x loop are the same for the two
versions and the optimization is only relevant for large sizes.

Tigerlake:
size  , algn  , Pos   , Cur T , New T , Win     , Dif
512   , 6     , 192   , 9.2   , 9.04  , no-RTM  , 0.16
512   , 7     , 224   , 9.19  , 8.98  , no-RTM  , 0.21
2048  , 0     , 256   , 10.74 , 10.54 , no-RTM  , 0.2
2048  , 0     , 512   , 14.81 , 14.87 , RTM     , 0.06
2048  , 0     , 1024  , 22.97 , 22.57 , no-RTM  , 0.4
2048  , 0     , 2048  , 37.49 , 34.51 , no-RTM  , 2.98   <--

Icelake:
size  , algn  , Pos   , Cur T , New T , Win     , Dif
512   , 6     , 192   , 7.6   , 7.3   , no-RTM  , 0.3
512   , 7     , 224   , 7.63  , 7.27  , no-RTM  , 0.36
2048  , 0     , 256   , 8.48  , 8.38  , no-RTM  , 0.1
2048  , 0     , 512   , 11.57 , 11.42 , no-RTM  , 0.15
2048  , 0     , 1024  , 17.92 , 17.38 , no-RTM  , 0.54
2048  , 0     , 2048  , 30.37 , 27.34 , no-RTM  , 3.03   <--

test-memchr, test-wmemchr, and test-rawmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2021-05-08 16:26:30 -04:00
H.J. Lu
7ebba91361 x86-64: Add AVX optimized string/memory functions for RTM
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with

	xtest
	jz	1f
	vzeroall
	ret
1:
	vzeroupper
	ret

at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
2021-03-29 07:40:17 -07:00
H.J. Lu
91264fe357 x86-64: Add memcmp family functions with 256-bit EVEX
Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
1b968b6b9b x86-64: Add memset family functions with 256-bit EVEX
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
63ad43566f x86-64: Add memmove family functions with 256-bit EVEX
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
525bc2a32c x86-64: Add strcpy family functions with 256-bit EVEX
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
2021-03-29 07:40:17 -07:00
H.J. Lu
1fd8c163a8 x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.

For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
2021-03-29 07:40:17 -07:00
H.J. Lu
9620398097 x86: Install <sys/platform/x86.h> [BZ #26124]
Install <sys/platform/x86.h> so that programmers can do

 #if __has_include(<sys/platform/x86.h>)
 #include <sys/platform/x86.h>
 #endif
 ...

   if (CPU_FEATURE_USABLE (SSE2))
 ...
   if (CPU_FEATURE_USABLE (AVX2))
 ...

<sys/platform/x86.h> exports only:

enum
{
  COMMON_CPUID_INDEX_1 = 0,
  COMMON_CPUID_INDEX_7,
  COMMON_CPUID_INDEX_80000001,
  COMMON_CPUID_INDEX_D_ECX_1,
  COMMON_CPUID_INDEX_80000007,
  COMMON_CPUID_INDEX_80000008,
  COMMON_CPUID_INDEX_7_ECX_1,
  /* Keep the following line at the end.  */
  COMMON_CPUID_INDEX_MAX
};

struct cpuid_features
{
  struct cpuid_registers cpuid;
  struct cpuid_registers usable;
};

struct cpu_features
{
  struct cpu_features_basic basic;
  struct cpuid_features features[COMMON_CPUID_INDEX_MAX];
};

/* Get a pointer to the CPU features structure.  */
extern const struct cpu_features *__x86_get_cpu_features
  (unsigned int max) __attribute__ ((const));

Since all feature checks are done through macros, programs compiled with
a newer <sys/platform/x86.h> are compatible with the older glibc binaries
as long as the layout of struct cpu_features is identical.  The features
array can be expanded with backward binary compatibility for both .o and
.so files.  When COMMON_CPUID_INDEX_MAX is increased to support new
processor features, __x86_get_cpu_features in the older glibc binaries
returns NULL and HAS_CPU_FEATURE/CPU_FEATURE_USABLE return false on the
new processor feature.  No new symbol version is neeeded.

Both CPU_FEATURE_USABLE and HAS_CPU_FEATURE are provided.  HAS_CPU_FEATURE
can be used to identify processor features.

Note: Although GCC has __builtin_cpu_supports, it only supports a subset
of <sys/platform/x86.h> and it is equivalent to CPU_FEATURE_USABLE.  It
doesn't support HAS_CPU_FEATURE.
2020-09-11 17:20:52 -07:00
Leonardo Sandoval
1a153e47fc x86-64: Optimize strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2
Optimize x86-64 strcat/strncat, strcpy/strncpy and stpcpy/stpncpy with AVX2.
It uses vector comparison as much as possible. In general, the larger the
source string, the greater performance gain observed, reaching speedups of
1.6x compared to SSE2 unaligned routines. Select AVX2 strcat/strncat,
strcpy/strncpy and stpcpy/stpncpy on AVX2 machines where vzeroupper is
preferred and AVX unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-avx2, strncat-avx2, strcpy-avx2, strncpy-avx2,
	stpcpy-avx2 and stpncpy-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c:
	(__libc_ifunc_impl_list): Add tests for __strcat_avx2,
	__strncat_avx2, __strcpy_avx2, __strncpy_avx2, __stpcpy_avx2
	and __stpncpy_avx2.
	* sysdeps/x86_64/multiarch/{ifunc-unaligned-ssse3.h =>
	ifunc-strcpy.h}: rename header for a more generic name.
	* sysdeps/x86_64/multiarch/ifunc-strcpy.h:
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
	AVX unaligned load is fast and vzeroupper is preferred.
	* sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
	* sysdeps/x86_64/multiarch/stpncpy-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strcat-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strncat-avx2.S: Likewise
	* sysdeps/x86_64/multiarch/strncpy-avx2.S: Likewise
2019-01-14 09:43:38 -06:00
Leonardo Sandoval
1457016337 x86-64: Optimize strcmp/wcscmp and strncmp/wcsncmp with AVX2
Optimize x86-64 strcmp/wcscmp and strncmp/wcsncmp with AVX2. It uses vector
comparison as much as possible. Peak performance observed on a SkyLake
machine: 9x, 3x, 2.5x and 5.5x for strcmp, strncmp, wcscmp and wcsncmp,
respectively. The larger the comparison length, the more benefit using
avx2 functions, except on the strcmp, where peak is observed at length
== 32 bytes. Select AVX2 strcmp/wcscmp on AVX2 machines where vzeroupper
is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-avx2, strncmp-avx2, wcscmp-avx2, wcscmp-sse2, wcsncmp-avx2 and
	wcsncmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strcmp_avx2,
	__strncmp_avx2,	__wcscmp_avx2, __wcsncmp_avx2, __wcscmp_sse2
	and __wcsncmp_sse2.
	* sysdeps/x86_64/multiarch/strcmp.c (OPTIMIZE (avx2)):
	(IFUNC_SELECTOR): Return OPTIMIZE (avx2) on AVX 2 machines if
	AVX unaligned load is fast and vzeroupper is preferred.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strncmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcscmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp-sse2.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsncmp.c: Likewise.
	* sysdeps/x86_64/wcscmp.S (__wcscmp): Add alias only if __wcscmp
	is undefined.
2018-06-01 16:32:43 -05:00
H.J. Lu
11ffcacb64 x86-64: Implement strcmp family IFUNC selectors in C
Implement strcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcmp-sse2, strcmp-sse4_2, strncmp-sse2, strncmp-sse4_2,
	strcasecmp_l-sse2, strcasecmp_l-sse4_2, strcasecmp_l-avx,
	strncase_l-sse2, strncase_l-sse4_2 and strncase_l-avx.
	* sysdeps/x86_64/multiarch/ifunc-strcasecmp.h: New file.
	* sysdeps/x86_64/multiarch/strcasecmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-avx.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase.c: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-avx.S : Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.c: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp-sse4_2.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.c: Likewise.
	* sysdeps/x86_64/multiarch/strcasecmp_l.S: Removed.
	* sysdeps/x86_64/multiarch/strcmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strncase_l.S: Likewise.
	* sysdeps/x86_64/multiarch/strncmp.S: Likewise.
	* sysdeps/x86_64/multiarch/strcmp-sse42.S: Include <sysdep.h>.
	(STRCMP_SSE42): New.  Defined to __strcmp_sse42 if not defined.
	[USE_AS_STRCASECMP_L || USE_AS_STRNCASECMP_L]: Include
	"locale-defines.h".
	(UPDATE_STRNCMP_COUNTER): New.
	(SECTION): Likewise.
	(GLABEL): Likewise.
	(LABEL): Likewise.
	* sysdeps/x86_64/multiarch/strncmp-ssse3.S: Rewrite and enable
	for libc.a.
2017-06-21 12:11:06 -07:00
H.J. Lu
70fe2eb794 x86-64: Implement strcspn/strpbrk/strspn IFUNC selectors in C
Implement strcspn/strpbrk/strspn IFUNC selectors in C

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcspn/strpbrk/strspn functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcspn-sse2, strpbrk-sse2 and strspn-sse2.
	* sysdeps/x86_64/strcspn.S (STRPBRK_P): Removed.
	Check USE_AS_STRPBRK instead of STRPBRK_P.
	* sysdeps/x86_64/strpbrk.S (USE_AS_STRPBRK): New.
	* sysdeps/x86_64/multiarch/ifunc-sse4_2.h: New file.
	* sysdeps/x86_64/multiarch/strcspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk.c: Likewise.
	* sysdeps/x86_64/multiarch/strspn-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.c: Likewise.
	* sysdeps/x86_64/multiarch/strcspn.S: Removed.
	* sysdeps/x86_64/multiarch/strpbrk.S: Likewise.
	* sysdeps/x86_64/multiarch/strspn.S: Likewise.
	* sysdeps/x86_64/multiarch/strpbrk-c.c: Remove "#ifdef SHARED"
	and "#endif".
2017-06-15 08:59:05 -07:00
H.J. Lu
9ed0aa15d3 x86-64: Implement strcat family IFUNC selectors in C
Implement strcat family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcat family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcat-sse2.
	* sysdeps/x86_64/multiarch/strcat-sse2.S: New file.
	* sysdeps/x86_64/multiarch/strcat.c: Likewise.
	* sysdeps/x86_64/multiarch/strncat.c: Likewise.
	* sysdeps/x86_64/multiarch/strcat.S: Removed.
	* sysdeps/x86_64/multiarch/strncat.S: Likewise.
2017-06-15 08:56:59 -07:00
H.J. Lu
b91a52d0d7 x86-64: Implement memcmp family IFUNC selectors in C
Implement memcmp family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memcmp family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memcmp-sse2.
	* sysdeps/x86_64/multiarch/ifunc-memcmp.h: New file.
	* sysdeps/x86_64/multiarch/memcmp-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemcmp.c: Likewise.
	* sysdeps/x86_64/multiarch/memcmp.S: Removed.
	* sysdeps/x86_64/multiarch/wmemcmp.S: Likewise.
2017-06-15 08:49:57 -07:00
H.J. Lu
93e46f8773 x86-64: Implement memset family IFUNC selectors in C
Implement memset family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memset functions within libc.

2017-06-07  H.J. Lu  <hongjiu.lu@intel.com>
	    Erich Elsen  <eriche@google.com>

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memset-sse2-unaligned-erms, and memset_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add test for __memset_chk_erms.
	Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memset.h: New file.
	* sysdeps/x86_64/multiarch/memset-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memset.c: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memset_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memset.S: Removed.
	* sysdeps/x86_64/multiarch/memset_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
	(__memset_chk_erms): New function.
2017-06-15 08:33:35 -07:00
H.J. Lu
5c3e322d3b x86-64: Implement memmove family IFUNC selectors in C
Implement memmove family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for memmove family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memmove-sse2-unaligned-erms, memcpy_chk-nonshared,
	mempcpy_chk-nonshared and memmove_chk-nonshared.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __memmove_chk_erms,
	__memcpy_chk_erms and __mempcpy_chk_erms.  Update comments.
	* sysdeps/x86_64/multiarch/ifunc-memmove.h: New file.
	* sysdeps/x86_64/multiarch/memcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove-sse2-unaligned-erms.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.c: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk-nonshared.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.c: Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S: Removed.
	* sysdeps/x86_64/multiarch/memcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S: Likewise.
	* sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
	(__mempcpy_chk_erms): New function.
	(__memmove_chk_erms): Likewise.
	(__memcpy_chk_erms): New alias.
2017-06-14 12:11:10 -07:00
H.J. Lu
5a103908c0 x86-64: Implement strcpy family IFUNC selectors in C
Implement strcpy family IFUNC selectors in C.

All internal calls within libc.so can use IFUNC on x86-64 since unlike
x86, x86-64 supports PC-relative addressing to access the GOT entry so
that it can call via PLT without using an extra register.  For libc.a,
we can't use IFUNC for functions which are called before IFUNC has been
initialized.  Use IFUNC internally reduces the icache footprint since
libc.so and other codes in the process use the same implementations.
This patch uses IFUNC for strcpy family functions within libc.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strcpy-sse2 and stpcpy-sse2.
	* sysdeps/x86_64/multiarch/ifunc-unaligned-ssse3.h: New file.
	* sysdeps/x86_64/multiarch/stpcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strcpy-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.c: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.c: Likewise.
	* sysdeps/x86_64/multiarch/stpcpy.S: Removed.
	* sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strcpy.S: Likewise.
	* sysdeps/x86_64/multiarch/strncpy.S: Likewise.
	* sysdeps/x86_64/multiarch/stpncpy-c.c (weak_alias): New.
	(libc_hidden_def): Always defined as empty.
	* sysdeps/x86_64/multiarch/strncpy-c.c (libc_hidden_builtin_def):
	Always Defined as empty.
2017-06-12 09:06:09 -07:00
H.J. Lu
d2538b9156 x86-64: Optimize strrchr/wcsrchr with AVX2
Optimize strrchr/wcsrchr with AVX2 to check 32 bytes with vector
instructions.  It is as fast as SSE2 version for small data sizes
and up to 1X faster for large data sizes on Haswell.  Select AVX2
version on AVX2 machines where vzeroupper is preferred and AVX
unaligned load is fast.

	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	strrchr-sse2, strrchr-avx2, wcsrchr-sse2 and wcsrchr-avx2.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Add tests for __strrchr_avx2,
	__strrchr_sse2, __wcsrchr_avx2 and __wcsrchr_sse2.
	* sysdeps/x86_64/multiarch/strrchr-avx2.S: New file.
	* sysdeps/x86_64/multiarch/strrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/strrchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wcsrchr.c: Likewise.
2017-06-09 05:45:52 -07:00