Commit Graph

2 Commits

Author SHA1 Message Date
Noah Goldstein
3edda6a0f0 x86: Add support for compiling {raw|w}memchr with high ISA level
1. Refactor files so that all implementations for in the multiarch
   directory.
    - Essentially moved sse2 {raw|w}memchr.S implementation to
      multiarch/{raw|w}memchr-sse2.S

    - The non-multiarch {raw|w}memchr.S file now only includes one of
      the implementations in the multiarch directory based on the
      compiled ISA level (only used for non-multiarch builds.
      Otherwise we go through the ifunc selector).

2. Add ISA level build guards to different implementations.
    - I.e memchr-avx2.S which is ISA level 3 will only build if
      compiled ISA level <= 3. Otherwise there is no reason to include
      it as we will always use one of the ISA level 4
      implementations (memchr-evex{-rtm}.S).

3. Add new multiarch/rtld-{raw}memchr.S that just include the
   non-multiarch {raw}memchr.S which will in turn select the best
   implementation based on the compiled ISA level.

4. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.
    - Guranteed replacement essentially means that for any ISA level
      build there must be a function that the baseline of the ISA
      supports. So for {raw|w}memchr.S since there is not ISA level 2
      function, the ISA level 2 build still includes the ISA level
      1 (sse2) function. Once we reach the ISA level 3 build, however,
      {raw|w}memchr-avx2{-rtm}.S will always be sufficient so the ISA
      level 1 implementation ({raw|w}memchr-sse2.S) will not be built.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-06-22 19:41:35 -07:00
H.J. Lu
2f5d20ac99 x86-64: Optimize memchr/rawmemchr/wmemchr with SSE2/AVX2
SSE2 memchr is extended to support wmemchr.  AVX2 memchr/rawmemchr/wmemchr
are added to search 32 bytes with a single vector compare instruction.
AVX2 memchr/rawmemchr/wmemchr are as fast as SSE2 memchr/rawmemchr/wmemchr
for small sizes and up to 1.5X faster for larger sizes on Haswell and
Skylake.  Select AVX2 memchr/rawmemchr/wmemchr on AVX2 machines where
vzeroupper is preferred and AVX unaligned load is fast.

NB: It uses TZCNT instead of BSF since TZCNT produces the same result
as BSF for non-zero input.  TZCNT is faster than BSF and is executed
as BSF if machine doesn't support TZCNT.

	* sysdeps/x86_64/memchr.S (MEMCHR): New.  Depending on if
	USE_AS_WMEMCHR is defined.
	(PCMPEQ): Likewise.
	(memchr): Renamed to ...
	(MEMCHR): This.  Support wmemchr if USE_AS_WMEMCHR is defined.
	Replace pcmpeqb with PCMPEQ.
	* sysdeps/x86_64/multiarch/Makefile (sysdep_routines): Add
	memchr-sse2, rawmemchr-sse2, memchr-avx2, rawmemchr-avx2,
	wmemchr-sse4_1, wmemchr-avx2 and wmemchr-c.
	* sysdeps/x86_64/multiarch/ifunc-avx2.h: New file.
	* sysdeps/x86_64/multiarch/memchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/memchr.c: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/rawmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-avx2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr-sse2.S: Likewise.
	* sysdeps/x86_64/multiarch/wmemchr.c: Likewise.
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memchr_avx2, __memchr_sse2,
	__rawmemchr_avx2, __rawmemchr_sse2, __wmemchr_avx2 and
	__wmemchr_sse2.
2017-06-09 05:13:31 -07:00