same 16->8 bit packing trick for SSE2/SSE4.1

It's funny how now that I'm on a machine that doesn't support AVX2, it's suddenly important for me that pack() is optimized for SSE! This is basically the same as this morning, without any weird AVX2 pack ordering issues. This replaces something like movdqa 2300(%rip), %xmm0 pshufb %xmm0, %xmm3 pshufb %xmm0, %xmm2 punpcklqdq %xmm3, %xmm2 (This is SSE4.1; the SSE2 version is worse.) with psrlw $8, %xmm3 psrlw $8, %xmm2 packuswb %xmm3, %xmm2 (SSE2 and SSE4.1 both.) It's always nice to not need to load a shuffle mask out of memory. Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622 Reviewed-on: https://skia-review.googlesource.com/30583 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
2017-08-03 12:18:24 -04:00 · 2017-08-03 12:18:24 -04:00 · 29dbae9acf
commit 29dbae9acf
parent d0677bc44f
3 changed files with 777 additions and 1026 deletions
--- a/src/jumper/SkJumper_generated.S
+++ b/src/jumper/SkJumper_generated.S
--- a/src/jumper/SkJumper_generated_win.S
+++ b/src/jumper/SkJumper_generated_win.S
--- a/src/jumper/SkJumper_stages_8bit.cpp
+++ b/src/jumper/SkJumper_stages_8bit.cpp
@ -65,6 +65,11 @@ SI U8x4 pack(U16x4 v) {
    auto _02 = _mm256_permute2x128_si256(lo,hi, 0x20),
         _13 = _mm256_permute2x128_si256(lo,hi, 0x31);
    return _mm256_packus_epi16(_02, _13);
+#elif defined(__SSE2__)
+    static_assert(sizeof(v) == 32, "");
+    auto lo = unaligned_load<__m128i>((char*)&v +  0),
+         hi = unaligned_load<__m128i>((char*)&v + 16);
+    return _mm_packus_epi16(lo,hi);
 #else
    return __builtin_convertvector(v, U8x4);
 #endif