Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
/*
|
|
|
|
* Copyright 2015 Google Inc.
|
|
|
|
*
|
|
|
|
* Use of this source code is governed by a BSD-style license that can be
|
|
|
|
* found in the LICENSE file.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef SkNx_sse_DEFINED
|
|
|
|
#define SkNx_sse_DEFINED
|
|
|
|
|
2019-04-23 17:05:21 +00:00
|
|
|
#include "include/core/SkTypes.h"
|
2018-06-12 16:11:28 +00:00
|
|
|
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
#include <smmintrin.h>
|
|
|
|
#elif SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
#include <tmmintrin.h>
|
|
|
|
#else
|
|
|
|
#include <emmintrin.h>
|
|
|
|
#endif
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
// This file may assume <= SSE2, but must check SK_CPU_SSE_LEVEL for anything more recent.
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
// If you do, make sure this is in a static inline function... anywhere else risks violating ODR.
|
2015-05-23 00:18:21 +00:00
|
|
|
|
2018-12-12 13:47:54 +00:00
|
|
|
namespace { // NOLINT(google-build-namespaces)
|
2016-10-14 21:09:03 +00:00
|
|
|
|
2018-06-08 17:46:42 +00:00
|
|
|
// Emulate _mm_floor_ps() with SSE2:
|
|
|
|
// - roundtrip through integers via truncation
|
|
|
|
// - subtract 1 if that's too big (possible for negative values).
|
|
|
|
// This restricts the domain of our inputs to a maximum somehwere around 2^31.
|
|
|
|
// Seems plenty big.
|
|
|
|
AI static __m128 emulate_mm_floor_ps(__m128 v) {
|
|
|
|
__m128 roundtrip = _mm_cvtepi32_ps(_mm_cvttps_epi32(v));
|
|
|
|
__m128 too_big = _mm_cmpgt_ps(roundtrip, v);
|
|
|
|
return _mm_sub_ps(roundtrip, _mm_and_ps(too_big, _mm_set1_ps(1.0f)));
|
|
|
|
}
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<2, float> {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128& vec) : fVec(vec) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
|
|
|
AI SkNx(float val) : fVec(_mm_set1_ps(val)) {}
|
|
|
|
AI static SkNx Load(const void* ptr) {
|
2016-01-31 16:02:47 +00:00
|
|
|
return _mm_castsi128_ps(_mm_loadl_epi64((const __m128i*)ptr));
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(float a, float b) : fVec(_mm_setr_ps(a,b,0,0)) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI void store(void* ptr) const { _mm_storel_pi((__m64*)ptr, fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2018-04-11 20:01:04 +00:00
|
|
|
AI static void Load2(const void* ptr, SkNx* x, SkNx* y) {
|
|
|
|
const float* m = (const float*)ptr;
|
|
|
|
*x = SkNx{m[0], m[2]};
|
|
|
|
*y = SkNx{m[1], m[3]};
|
|
|
|
}
|
|
|
|
|
2018-04-09 05:58:43 +00:00
|
|
|
AI static void Store2(void* dst, const SkNx& a, const SkNx& b) {
|
|
|
|
auto vals = _mm_unpacklo_ps(a.fVec, b.fVec);
|
|
|
|
_mm_storeu_ps((float*)dst, vals);
|
|
|
|
}
|
|
|
|
|
2017-12-01 20:23:05 +00:00
|
|
|
AI static void Store3(void* dst, const SkNx& a, const SkNx& b, const SkNx& c) {
|
|
|
|
auto lo = _mm_setr_ps(a[0], b[0], c[0], a[1]),
|
|
|
|
hi = _mm_setr_ps(b[1], c[1], 0, 0);
|
|
|
|
_mm_storeu_ps((float*)dst, lo);
|
|
|
|
_mm_storel_pi(((__m64*)dst) + 2, hi);
|
|
|
|
}
|
|
|
|
|
2018-02-07 00:55:30 +00:00
|
|
|
AI static void Store4(void* dst, const SkNx& a, const SkNx& b, const SkNx& c, const SkNx& d) {
|
|
|
|
auto lo = _mm_setr_ps(a[0], b[0], c[0], d[0]),
|
|
|
|
hi = _mm_setr_ps(a[1], b[1], c[1], d[1]);
|
|
|
|
_mm_storeu_ps((float*)dst, lo);
|
|
|
|
_mm_storeu_ps(((float*)dst) + 4, hi);
|
|
|
|
}
|
|
|
|
|
2017-08-28 20:45:40 +00:00
|
|
|
AI SkNx operator - () const { return _mm_xor_ps(_mm_set1_ps(-0.0f), fVec); }
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator * (const SkNx& o) const { return _mm_mul_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator / (const SkNx& o) const { return _mm_div_ps(fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator == (const SkNx& o) const { return _mm_cmpeq_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator != (const SkNx& o) const { return _mm_cmpneq_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator < (const SkNx& o) const { return _mm_cmplt_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator > (const SkNx& o) const { return _mm_cmpgt_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator <= (const SkNx& o) const { return _mm_cmple_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator >= (const SkNx& o) const { return _mm_cmpge_ps (fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Min(const SkNx& l, const SkNx& r) { return _mm_min_ps(l.fVec, r.fVec); }
|
|
|
|
AI static SkNx Max(const SkNx& l, const SkNx& r) { return _mm_max_ps(l.fVec, r.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2018-06-08 17:46:42 +00:00
|
|
|
AI SkNx abs() const { return _mm_andnot_ps(_mm_set1_ps(-0.0f), fVec); }
|
|
|
|
AI SkNx floor() const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_floor_ps(fVec);
|
|
|
|
#else
|
|
|
|
return emulate_mm_floor_ps(fVec);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx sqrt() const { return _mm_sqrt_ps (fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI float operator[](int k) const {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
SkASSERT(0 <= k && k < 2);
|
|
|
|
union { __m128 v; float fs[4]; } pun = {fVec};
|
2015-04-03 13:16:13 +00:00
|
|
|
return pun.fs[k&1];
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
|
|
|
|
2020-09-02 16:17:53 +00:00
|
|
|
AI bool allTrue() const { return 0b11 == (_mm_movemask_ps(fVec) & 0b11); }
|
|
|
|
AI bool anyTrue() const { return 0b00 != (_mm_movemask_ps(fVec) & 0b11); }
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
|
2017-08-28 20:45:40 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_ps(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_ps(_mm_and_ps (fVec, t.fVec),
|
|
|
|
_mm_andnot_ps(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
__m128 fVec;
|
|
|
|
};
|
|
|
|
|
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<4, float> {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128& vec) : fVec(vec) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
|
|
|
AI SkNx(float val) : fVec( _mm_set1_ps(val) ) {}
|
|
|
|
AI SkNx(float a, float b, float c, float d) : fVec(_mm_setr_ps(a,b,c,d)) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadu_ps((const float*)ptr); }
|
|
|
|
AI void store(void* ptr) const { _mm_storeu_ps((float*)ptr, fVec); }
|
2015-12-01 15:10:21 +00:00
|
|
|
|
2017-11-30 17:07:20 +00:00
|
|
|
AI static void Load2(const void* ptr, SkNx* x, SkNx* y) {
|
|
|
|
SkNx lo = SkNx::Load((const float*)ptr+0),
|
|
|
|
hi = SkNx::Load((const float*)ptr+4);
|
|
|
|
*x = SkNx{lo[0], lo[2], hi[0], hi[2]};
|
|
|
|
*y = SkNx{lo[1], lo[3], hi[1], hi[3]};
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Load4(const void* ptr, SkNx* r, SkNx* g, SkNx* b, SkNx* a) {
|
2016-10-06 15:09:27 +00:00
|
|
|
__m128 v0 = _mm_loadu_ps(((float*)ptr) + 0),
|
|
|
|
v1 = _mm_loadu_ps(((float*)ptr) + 4),
|
|
|
|
v2 = _mm_loadu_ps(((float*)ptr) + 8),
|
|
|
|
v3 = _mm_loadu_ps(((float*)ptr) + 12);
|
|
|
|
_MM_TRANSPOSE4_PS(v0, v1, v2, v3);
|
|
|
|
*r = v0;
|
|
|
|
*g = v1;
|
|
|
|
*b = v2;
|
|
|
|
*a = v3;
|
|
|
|
}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Store4(void* dst, const SkNx& r, const SkNx& g, const SkNx& b, const SkNx& a) {
|
2016-10-06 15:09:27 +00:00
|
|
|
__m128 v0 = r.fVec,
|
|
|
|
v1 = g.fVec,
|
|
|
|
v2 = b.fVec,
|
|
|
|
v3 = a.fVec;
|
|
|
|
_MM_TRANSPOSE4_PS(v0, v1, v2, v3);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 0, v0);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 4, v1);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 8, v2);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 12, v3);
|
|
|
|
}
|
|
|
|
|
2017-08-28 20:45:40 +00:00
|
|
|
AI SkNx operator - () const { return _mm_xor_ps(_mm_set1_ps(-0.0f), fVec); }
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator * (const SkNx& o) const { return _mm_mul_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator / (const SkNx& o) const { return _mm_div_ps(fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator == (const SkNx& o) const { return _mm_cmpeq_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator != (const SkNx& o) const { return _mm_cmpneq_ps(fVec, o.fVec); }
|
|
|
|
AI SkNx operator < (const SkNx& o) const { return _mm_cmplt_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator > (const SkNx& o) const { return _mm_cmpgt_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator <= (const SkNx& o) const { return _mm_cmple_ps (fVec, o.fVec); }
|
|
|
|
AI SkNx operator >= (const SkNx& o) const { return _mm_cmpge_ps (fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Min(const SkNx& l, const SkNx& r) { return _mm_min_ps(l.fVec, r.fVec); }
|
|
|
|
AI static SkNx Max(const SkNx& l, const SkNx& r) { return _mm_max_ps(l.fVec, r.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx abs() const { return _mm_andnot_ps(_mm_set1_ps(-0.0f), fVec); }
|
|
|
|
AI SkNx floor() const {
|
2016-07-11 16:59:21 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_floor_ps(fVec);
|
|
|
|
#else
|
2018-06-08 17:46:42 +00:00
|
|
|
return emulate_mm_floor_ps(fVec);
|
2016-07-11 16:59:21 +00:00
|
|
|
#endif
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
}
|
2016-01-15 20:16:40 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx sqrt() const { return _mm_sqrt_ps (fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI float operator[](int k) const {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128 v; float fs[4]; } pun = {fVec};
|
2015-04-03 13:16:13 +00:00
|
|
|
return pun.fs[k&3];
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
|
|
|
|
2018-04-11 19:18:09 +00:00
|
|
|
AI float min() const {
|
|
|
|
SkNx min = Min(*this, _mm_shuffle_ps(fVec, fVec, _MM_SHUFFLE(2,3,0,1)));
|
|
|
|
min = Min(min, _mm_shuffle_ps(min.fVec, min.fVec, _MM_SHUFFLE(0,1,2,3)));
|
|
|
|
return min[0];
|
|
|
|
}
|
|
|
|
|
|
|
|
AI float max() const {
|
|
|
|
SkNx max = Max(*this, _mm_shuffle_ps(fVec, fVec, _MM_SHUFFLE(2,3,0,1)));
|
|
|
|
max = Max(max, _mm_shuffle_ps(max.fVec, max.fVec, _MM_SHUFFLE(0,1,2,3)));
|
|
|
|
return max[0];
|
|
|
|
}
|
|
|
|
|
2020-09-02 16:17:53 +00:00
|
|
|
AI bool allTrue() const { return 0b1111 == _mm_movemask_ps(fVec); }
|
|
|
|
AI bool anyTrue() const { return 0b0000 != _mm_movemask_ps(fVec); }
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2016-07-11 16:59:21 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
2016-06-20 13:07:45 +00:00
|
|
|
return _mm_blendv_ps(e.fVec, t.fVec, fVec);
|
2016-07-11 16:59:21 +00:00
|
|
|
#else
|
Color dodge and burn with SkPMFloat.
Both 25-35% faster with SSE.
With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement.
The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content.
Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows.
In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%.
Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight.
BUG=skia:
Review URL: https://codereview.chromium.org/1214443002
2015-06-26 17:46:31 +00:00
|
|
|
return _mm_or_ps(_mm_and_ps (fVec, t.fVec),
|
|
|
|
_mm_andnot_ps(fVec, e.fVec));
|
2016-07-11 16:59:21 +00:00
|
|
|
#endif
|
Color dodge and burn with SkPMFloat.
Both 25-35% faster with SSE.
With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement.
The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content.
Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows.
In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%.
Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight.
BUG=skia:
Review URL: https://codereview.chromium.org/1214443002
2015-06-26 17:46:31 +00:00
|
|
|
}
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
__m128 fVec;
|
|
|
|
};
|
|
|
|
|
2017-09-13 20:46:05 +00:00
|
|
|
AI static __m128i mullo32(__m128i a, __m128i b) {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_mullo_epi32(a, b);
|
|
|
|
#else
|
|
|
|
__m128i mul20 = _mm_mul_epu32(a, b),
|
|
|
|
mul31 = _mm_mul_epu32(_mm_srli_si128(a, 4), _mm_srli_si128(b, 4));
|
|
|
|
return _mm_unpacklo_epi32(_mm_shuffle_epi32(mul20, _MM_SHUFFLE(0,0,2,0)),
|
|
|
|
_mm_shuffle_epi32(mul31, _MM_SHUFFLE(0,0,2,0)));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-02-09 19:32:51 +00:00
|
|
|
template <>
|
2016-07-29 17:10:15 +00:00
|
|
|
class SkNx<4, int32_t> {
|
2016-02-09 19:32:51 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
2016-02-09 19:32:51 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
|
|
|
AI SkNx(int32_t val) : fVec(_mm_set1_epi32(val)) {}
|
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
|
|
|
AI SkNx(int32_t a, int32_t b, int32_t c, int32_t d) : fVec(_mm_setr_epi32(a,b,c,d)) {}
|
2016-02-09 19:32:51 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2016-02-09 19:32:51 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi32(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi32(fVec, o.fVec); }
|
2017-09-13 20:46:05 +00:00
|
|
|
AI SkNx operator * (const SkNx& o) const { return mullo32(fVec, o.fVec); }
|
2016-02-09 19:32:51 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
2017-09-13 20:46:05 +00:00
|
|
|
AI SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator ^ (const SkNx& o) const { return _mm_xor_si128(fVec, o.fVec); }
|
2016-06-17 19:09:16 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator << (int bits) const { return _mm_slli_epi32(fVec, bits); }
|
|
|
|
AI SkNx operator >> (int bits) const { return _mm_srai_epi32(fVec, bits); }
|
2016-02-09 19:32:51 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator == (const SkNx& o) const { return _mm_cmpeq_epi32 (fVec, o.fVec); }
|
|
|
|
AI SkNx operator < (const SkNx& o) const { return _mm_cmplt_epi32 (fVec, o.fVec); }
|
|
|
|
AI SkNx operator > (const SkNx& o) const { return _mm_cmpgt_epi32 (fVec, o.fVec); }
|
2016-07-15 14:00:11 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI int32_t operator[](int k) const {
|
2016-02-09 19:32:51 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
2016-07-29 17:10:15 +00:00
|
|
|
union { __m128i v; int32_t is[4]; } pun = {fVec};
|
2016-02-09 19:32:51 +00:00
|
|
|
return pun.is[k&3];
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2016-07-15 14:00:11 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_epi8(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-07-12 17:36:05 +00:00
|
|
|
AI SkNx abs() const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
return _mm_abs_epi32(fVec);
|
|
|
|
#else
|
|
|
|
SkNx mask = (*this) >> 31;
|
|
|
|
return (mask ^ (*this)) - mask;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
AI static SkNx Min(const SkNx& x, const SkNx& y) {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_min_epi32(x.fVec, y.fVec);
|
|
|
|
#else
|
2017-07-14 14:17:04 +00:00
|
|
|
return (x < y).thenElse(x, y);
|
2017-07-12 17:36:05 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
AI static SkNx Max(const SkNx& x, const SkNx& y) {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_max_epi32(x.fVec, y.fVec);
|
|
|
|
#else
|
2017-07-14 14:17:04 +00:00
|
|
|
return (x > y).thenElse(x, y);
|
2017-07-12 17:36:05 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-02-09 19:32:51 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2018-03-20 16:42:38 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<2, uint32_t> {
|
|
|
|
public:
|
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
|
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint32_t val) : fVec(_mm_set1_epi32((int)val)) {}
|
2018-03-20 16:42:38 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadl_epi64((const __m128i*)ptr); }
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint32_t a, uint32_t b) : fVec(_mm_setr_epi32((int)a,(int)b,0,0)) {}
|
2018-03-20 16:42:38 +00:00
|
|
|
|
|
|
|
AI void store(void* ptr) const { _mm_storel_epi64((__m128i*)ptr, fVec); }
|
|
|
|
|
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi32(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi32(fVec, o.fVec); }
|
|
|
|
AI SkNx operator * (const SkNx& o) const { return mullo32(fVec, o.fVec); }
|
|
|
|
|
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
|
|
|
AI SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
|
|
|
AI SkNx operator ^ (const SkNx& o) const { return _mm_xor_si128(fVec, o.fVec); }
|
|
|
|
|
|
|
|
AI SkNx operator << (int bits) const { return _mm_slli_epi32(fVec, bits); }
|
|
|
|
AI SkNx operator >> (int bits) const { return _mm_srli_epi32(fVec, bits); }
|
|
|
|
|
|
|
|
AI SkNx operator == (const SkNx& o) const { return _mm_cmpeq_epi32 (fVec, o.fVec); }
|
|
|
|
AI SkNx operator != (const SkNx& o) const { return (*this == o) ^ 0xffffffff; }
|
|
|
|
// operator < and > take a little extra fiddling to make work for unsigned ints.
|
|
|
|
|
|
|
|
AI uint32_t operator[](int k) const {
|
|
|
|
SkASSERT(0 <= k && k < 2);
|
|
|
|
union { __m128i v; uint32_t us[4]; } pun = {fVec};
|
|
|
|
return pun.us[k&1];
|
|
|
|
}
|
|
|
|
|
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_epi8(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
AI bool allTrue() const { return 0xff == (_mm_movemask_epi8(fVec) & 0xff); }
|
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<4, uint32_t> {
|
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint32_t val) : fVec(_mm_set1_epi32((int)val)) {}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint32_t a, uint32_t b, uint32_t c, uint32_t d)
|
|
|
|
: fVec(_mm_setr_epi32((int)a,(int)b,(int)c,(int)d)) {}
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi32(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi32(fVec, o.fVec); }
|
2017-09-13 20:46:05 +00:00
|
|
|
AI SkNx operator * (const SkNx& o) const { return mullo32(fVec, o.fVec); }
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
2017-09-13 20:46:05 +00:00
|
|
|
AI SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator ^ (const SkNx& o) const { return _mm_xor_si128(fVec, o.fVec); }
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator << (int bits) const { return _mm_slli_epi32(fVec, bits); }
|
|
|
|
AI SkNx operator >> (int bits) const { return _mm_srli_epi32(fVec, bits); }
|
2016-07-29 17:10:15 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator == (const SkNx& o) const { return _mm_cmpeq_epi32 (fVec, o.fVec); }
|
2018-03-20 16:42:38 +00:00
|
|
|
AI SkNx operator != (const SkNx& o) const { return (*this == o) ^ 0xffffffff; }
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
// operator < and > take a little extra fiddling to make work for unsigned ints.
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI uint32_t operator[](int k) const {
|
2016-07-29 17:10:15 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128i v; uint32_t us[4]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2016-07-29 17:10:15 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_epi8(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-10-10 21:14:18 +00:00
|
|
|
AI SkNx mulHi(SkNx m) const {
|
|
|
|
SkNx v20{_mm_mul_epu32(m.fVec, fVec)};
|
|
|
|
SkNx v31{_mm_mul_epu32(_mm_srli_si128(m.fVec, 4), _mm_srli_si128(fVec, 4))};
|
|
|
|
|
|
|
|
return SkNx{v20[1], v31[1], v20[3], v31[3]};
|
|
|
|
}
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<4, uint16_t> {
|
2015-04-14 21:02:52 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint16_t val) : fVec(_mm_set1_epi16((short)val)) {}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(uint16_t a, uint16_t b, uint16_t c, uint16_t d)
|
2021-05-13 13:17:23 +00:00
|
|
|
: fVec(_mm_setr_epi16((short)a,(short)b,(short)c,(short)d,0,0,0,0)) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadl_epi64((const __m128i*)ptr); }
|
|
|
|
AI void store(void* ptr) const { _mm_storel_epi64((__m128i*)ptr, fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Load4(const void* ptr, SkNx* r, SkNx* g, SkNx* b, SkNx* a) {
|
2016-10-06 15:09:27 +00:00
|
|
|
__m128i lo = _mm_loadu_si128(((__m128i*)ptr) + 0),
|
|
|
|
hi = _mm_loadu_si128(((__m128i*)ptr) + 1);
|
|
|
|
__m128i even = _mm_unpacklo_epi16(lo, hi), // r0 r2 g0 g2 b0 b2 a0 a2
|
|
|
|
odd = _mm_unpackhi_epi16(lo, hi); // r1 r3 ...
|
|
|
|
__m128i rg = _mm_unpacklo_epi16(even, odd), // r0 r1 r2 r3 g0 g1 g2 g3
|
|
|
|
ba = _mm_unpackhi_epi16(even, odd); // b0 b1 ... a0 a1 ...
|
|
|
|
*r = rg;
|
|
|
|
*g = _mm_srli_si128(rg, 8);
|
|
|
|
*b = ba;
|
|
|
|
*a = _mm_srli_si128(ba, 8);
|
|
|
|
}
|
2017-01-19 17:04:32 +00:00
|
|
|
AI static void Load3(const void* ptr, SkNx* r, SkNx* g, SkNx* b) {
|
|
|
|
// The idea here is to get 4 vectors that are R G B _ _ _ _ _.
|
|
|
|
// The second load is at a funny location to make sure we don't read past
|
|
|
|
// the bounds of memory. This is fine, we just need to shift it a little bit.
|
|
|
|
const uint8_t* ptr8 = (const uint8_t*) ptr;
|
|
|
|
__m128i rgb0 = _mm_loadu_si128((const __m128i*) (ptr8 + 0));
|
|
|
|
__m128i rgb1 = _mm_srli_si128(rgb0, 3*2);
|
|
|
|
__m128i rgb2 = _mm_srli_si128(_mm_loadu_si128((const __m128i*) (ptr8 + 4*2)), 2*2);
|
|
|
|
__m128i rgb3 = _mm_srli_si128(rgb2, 3*2);
|
|
|
|
|
|
|
|
__m128i rrggbb01 = _mm_unpacklo_epi16(rgb0, rgb1);
|
|
|
|
__m128i rrggbb23 = _mm_unpacklo_epi16(rgb2, rgb3);
|
|
|
|
*r = _mm_unpacklo_epi32(rrggbb01, rrggbb23);
|
|
|
|
*g = _mm_srli_si128(r->fVec, 4*2);
|
|
|
|
*b = _mm_unpackhi_epi32(rrggbb01, rrggbb23);
|
|
|
|
}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Store4(void* dst, const SkNx& r, const SkNx& g, const SkNx& b, const SkNx& a) {
|
2016-10-06 15:09:27 +00:00
|
|
|
__m128i rg = _mm_unpacklo_epi16(r.fVec, g.fVec);
|
|
|
|
__m128i ba = _mm_unpacklo_epi16(b.fVec, a.fVec);
|
|
|
|
__m128i lo = _mm_unpacklo_epi32(rg, ba);
|
|
|
|
__m128i hi = _mm_unpackhi_epi32(rg, ba);
|
|
|
|
_mm_storeu_si128(((__m128i*) dst) + 0, lo);
|
|
|
|
_mm_storeu_si128(((__m128i*) dst) + 1, hi);
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi16(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi16(fVec, o.fVec); }
|
|
|
|
AI SkNx operator * (const SkNx& o) const { return _mm_mullo_epi16(fVec, o.fVec); }
|
2017-01-12 23:34:29 +00:00
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
|
|
|
AI SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator << (int bits) const { return _mm_slli_epi16(fVec, bits); }
|
|
|
|
AI SkNx operator >> (int bits) const { return _mm_srli_epi16(fVec, bits); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI uint16_t operator[](int k) const {
|
2015-04-14 21:02:52 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint16_t us[8]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
2015-04-14 21:02:52 +00:00
|
|
|
}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<8, uint16_t> {
|
2015-04-14 21:02:52 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint16_t val) : fVec(_mm_set1_epi16((short)val)) {}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(uint16_t a, uint16_t b, uint16_t c, uint16_t d,
|
|
|
|
uint16_t e, uint16_t f, uint16_t g, uint16_t h)
|
2021-05-13 13:17:23 +00:00
|
|
|
: fVec(_mm_setr_epi16((short)a,(short)b,(short)c,(short)d,
|
|
|
|
(short)e,(short)f,(short)g,(short)h)) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
|
|
|
AI void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Load4(const void* ptr, SkNx* r, SkNx* g, SkNx* b, SkNx* a) {
|
2016-10-12 13:52:55 +00:00
|
|
|
__m128i _01 = _mm_loadu_si128(((__m128i*)ptr) + 0),
|
|
|
|
_23 = _mm_loadu_si128(((__m128i*)ptr) + 1),
|
|
|
|
_45 = _mm_loadu_si128(((__m128i*)ptr) + 2),
|
|
|
|
_67 = _mm_loadu_si128(((__m128i*)ptr) + 3);
|
|
|
|
|
|
|
|
__m128i _02 = _mm_unpacklo_epi16(_01, _23), // r0 r2 g0 g2 b0 b2 a0 a2
|
|
|
|
_13 = _mm_unpackhi_epi16(_01, _23), // r1 r3 g1 g3 b1 b3 a1 a3
|
|
|
|
_46 = _mm_unpacklo_epi16(_45, _67),
|
|
|
|
_57 = _mm_unpackhi_epi16(_45, _67);
|
|
|
|
|
|
|
|
__m128i rg0123 = _mm_unpacklo_epi16(_02, _13), // r0 r1 r2 r3 g0 g1 g2 g3
|
|
|
|
ba0123 = _mm_unpackhi_epi16(_02, _13), // b0 b1 b2 b3 a0 a1 a2 a3
|
|
|
|
rg4567 = _mm_unpacklo_epi16(_46, _57),
|
|
|
|
ba4567 = _mm_unpackhi_epi16(_46, _57);
|
|
|
|
|
|
|
|
*r = _mm_unpacklo_epi64(rg0123, rg4567);
|
|
|
|
*g = _mm_unpackhi_epi64(rg0123, rg4567);
|
|
|
|
*b = _mm_unpacklo_epi64(ba0123, ba4567);
|
|
|
|
*a = _mm_unpackhi_epi64(ba0123, ba4567);
|
|
|
|
}
|
2017-01-19 17:04:32 +00:00
|
|
|
AI static void Load3(const void* ptr, SkNx* r, SkNx* g, SkNx* b) {
|
|
|
|
const uint8_t* ptr8 = (const uint8_t*) ptr;
|
|
|
|
__m128i rgb0 = _mm_loadu_si128((const __m128i*) (ptr8 + 0*2));
|
|
|
|
__m128i rgb1 = _mm_srli_si128(rgb0, 3*2);
|
|
|
|
__m128i rgb2 = _mm_loadu_si128((const __m128i*) (ptr8 + 6*2));
|
|
|
|
__m128i rgb3 = _mm_srli_si128(rgb2, 3*2);
|
|
|
|
__m128i rgb4 = _mm_loadu_si128((const __m128i*) (ptr8 + 12*2));
|
|
|
|
__m128i rgb5 = _mm_srli_si128(rgb4, 3*2);
|
|
|
|
__m128i rgb6 = _mm_srli_si128(_mm_loadu_si128((const __m128i*) (ptr8 + 16*2)), 2*2);
|
|
|
|
__m128i rgb7 = _mm_srli_si128(rgb6, 3*2);
|
|
|
|
|
|
|
|
__m128i rgb01 = _mm_unpacklo_epi16(rgb0, rgb1);
|
|
|
|
__m128i rgb23 = _mm_unpacklo_epi16(rgb2, rgb3);
|
|
|
|
__m128i rgb45 = _mm_unpacklo_epi16(rgb4, rgb5);
|
|
|
|
__m128i rgb67 = _mm_unpacklo_epi16(rgb6, rgb7);
|
|
|
|
|
|
|
|
__m128i rg03 = _mm_unpacklo_epi32(rgb01, rgb23);
|
|
|
|
__m128i bx03 = _mm_unpackhi_epi32(rgb01, rgb23);
|
|
|
|
__m128i rg47 = _mm_unpacklo_epi32(rgb45, rgb67);
|
|
|
|
__m128i bx47 = _mm_unpackhi_epi32(rgb45, rgb67);
|
|
|
|
|
|
|
|
*r = _mm_unpacklo_epi64(rg03, rg47);
|
|
|
|
*g = _mm_unpackhi_epi64(rg03, rg47);
|
|
|
|
*b = _mm_unpacklo_epi64(bx03, bx47);
|
|
|
|
}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static void Store4(void* ptr, const SkNx& r, const SkNx& g, const SkNx& b, const SkNx& a) {
|
2016-10-12 13:52:55 +00:00
|
|
|
__m128i rg0123 = _mm_unpacklo_epi16(r.fVec, g.fVec), // r0 g0 r1 g1 r2 g2 r3 g3
|
|
|
|
rg4567 = _mm_unpackhi_epi16(r.fVec, g.fVec), // r4 g4 r5 g5 r6 g6 r7 g7
|
|
|
|
ba0123 = _mm_unpacklo_epi16(b.fVec, a.fVec),
|
|
|
|
ba4567 = _mm_unpackhi_epi16(b.fVec, a.fVec);
|
|
|
|
|
|
|
|
_mm_storeu_si128((__m128i*)ptr + 0, _mm_unpacklo_epi32(rg0123, ba0123));
|
|
|
|
_mm_storeu_si128((__m128i*)ptr + 1, _mm_unpackhi_epi32(rg0123, ba0123));
|
|
|
|
_mm_storeu_si128((__m128i*)ptr + 2, _mm_unpacklo_epi32(rg4567, ba4567));
|
|
|
|
_mm_storeu_si128((__m128i*)ptr + 3, _mm_unpackhi_epi32(rg4567, ba4567));
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi16(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi16(fVec, o.fVec); }
|
|
|
|
AI SkNx operator * (const SkNx& o) const { return _mm_mullo_epi16(fVec, o.fVec); }
|
2017-01-12 23:34:29 +00:00
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
|
|
|
AI SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator << (int bits) const { return _mm_slli_epi16(fVec, bits); }
|
|
|
|
AI SkNx operator >> (int bits) const { return _mm_srli_epi16(fVec, bits); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Min(const SkNx& a, const SkNx& b) {
|
2015-05-15 00:53:04 +00:00
|
|
|
// No unsigned _mm_min_epu16, so we'll shift into a space where we can use the
|
|
|
|
// signed version, _mm_min_epi16, then shift back.
|
|
|
|
const uint16_t top = 0x8000; // Keep this separate from _mm_set1_epi16 or MSVC will whine.
|
2021-05-13 13:17:23 +00:00
|
|
|
const __m128i top_8x = _mm_set1_epi16((short)top);
|
2015-05-15 00:53:04 +00:00
|
|
|
return _mm_add_epi8(top_8x, _mm_min_epi16(_mm_sub_epi8(a.fVec, top_8x),
|
|
|
|
_mm_sub_epi8(b.fVec, top_8x)));
|
|
|
|
}
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
AI SkNx mulHi(const SkNx& m) const {
|
|
|
|
return _mm_mulhi_epu16(fVec, m.fVec);
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2015-07-14 17:54:19 +00:00
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI uint16_t operator[](int k) const {
|
2015-04-14 21:02:52 +00:00
|
|
|
SkASSERT(0 <= k && k < 8);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint16_t us[8]; } pun = {fVec};
|
|
|
|
return pun.us[k&7];
|
2015-04-14 21:02:52 +00:00
|
|
|
}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2015-12-14 19:25:18 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<4, uint8_t> {
|
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx() {}
|
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
AI SkNx(uint8_t a, uint8_t b, uint8_t c, uint8_t d)
|
2021-05-13 13:17:23 +00:00
|
|
|
: fVec(_mm_setr_epi8((char)a,(char)b,(char)c,(char)d, 0,0,0,0, 0,0,0,0, 0,0,0,0)) {}
|
2016-03-01 15:01:23 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_cvtsi32_si128(*(const int*)ptr); }
|
|
|
|
AI void store(void* ptr) const { *(int*)ptr = _mm_cvtsi128_si32(fVec); }
|
2015-12-14 19:25:18 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI uint8_t operator[](int k) const {
|
2016-03-01 15:01:23 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128i v; uint8_t us[16]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
|
|
|
}
|
|
|
|
|
2015-12-14 19:25:18 +00:00
|
|
|
// TODO as needed
|
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<8, uint8_t> {
|
|
|
|
public:
|
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
|
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint8_t val) : fVec(_mm_set1_epi8((char)val)) {}
|
2017-11-16 21:04:42 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadl_epi64((const __m128i*)ptr); }
|
2017-11-02 17:18:38 +00:00
|
|
|
AI SkNx(uint8_t a, uint8_t b, uint8_t c, uint8_t d,
|
|
|
|
uint8_t e, uint8_t f, uint8_t g, uint8_t h)
|
2021-05-13 13:17:23 +00:00
|
|
|
: fVec(_mm_setr_epi8((char)a,(char)b,(char)c,(char)d,
|
|
|
|
(char)e,(char)f,(char)g,(char)h,
|
|
|
|
0,0,0,0, 0,0,0,0)) {}
|
2017-11-02 17:18:38 +00:00
|
|
|
|
|
|
|
AI void store(void* ptr) const {_mm_storel_epi64((__m128i*)ptr, fVec);}
|
|
|
|
|
|
|
|
AI SkNx saturatedAdd(const SkNx& o) const { return _mm_adds_epu8(fVec, o.fVec); }
|
|
|
|
|
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi8(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi8(fVec, o.fVec); }
|
|
|
|
|
|
|
|
AI static SkNx Min(const SkNx& a, const SkNx& b) { return _mm_min_epu8(a.fVec, b.fVec); }
|
|
|
|
AI SkNx operator < (const SkNx& o) const {
|
|
|
|
// There's no unsigned _mm_cmplt_epu8, so we flip the sign bits then use a signed compare.
|
|
|
|
auto flip = _mm_set1_epi8(char(0x80));
|
|
|
|
return _mm_cmplt_epi8(_mm_xor_si128(flip, fVec), _mm_xor_si128(flip, o.fVec));
|
|
|
|
}
|
|
|
|
|
|
|
|
AI uint8_t operator[](int k) const {
|
|
|
|
SkASSERT(0 <= k && k < 16);
|
|
|
|
union { __m128i v; uint8_t us[16]; } pun = {fVec};
|
|
|
|
return pun.us[k&15];
|
|
|
|
}
|
|
|
|
|
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
}
|
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2015-05-12 13:11:21 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<16, uint8_t> {
|
2015-05-12 13:11:21 +00:00
|
|
|
public:
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
|
|
|
|
AI SkNx() {}
|
2021-05-13 13:17:23 +00:00
|
|
|
AI SkNx(uint8_t val) : fVec(_mm_set1_epi8((char)val)) {}
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
|
|
|
AI SkNx(uint8_t a, uint8_t b, uint8_t c, uint8_t d,
|
|
|
|
uint8_t e, uint8_t f, uint8_t g, uint8_t h,
|
|
|
|
uint8_t i, uint8_t j, uint8_t k, uint8_t l,
|
|
|
|
uint8_t m, uint8_t n, uint8_t o, uint8_t p)
|
2021-05-13 13:17:23 +00:00
|
|
|
: fVec(_mm_setr_epi8((char)a,(char)b,(char)c,(char)d,
|
|
|
|
(char)e,(char)f,(char)g,(char)h,
|
|
|
|
(char)i,(char)j,(char)k,(char)l,
|
|
|
|
(char)m,(char)n,(char)o,(char)p)) {}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx saturatedAdd(const SkNx& o) const { return _mm_adds_epu8(fVec, o.fVec); }
|
2015-05-12 22:48:09 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx operator + (const SkNx& o) const { return _mm_add_epi8(fVec, o.fVec); }
|
|
|
|
AI SkNx operator - (const SkNx& o) const { return _mm_sub_epi8(fVec, o.fVec); }
|
2018-12-18 19:53:37 +00:00
|
|
|
AI SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static SkNx Min(const SkNx& a, const SkNx& b) { return _mm_min_epu8(a.fVec, b.fVec); }
|
|
|
|
AI SkNx operator < (const SkNx& o) const {
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
// There's no unsigned _mm_cmplt_epu8, so we flip the sign bits then use a signed compare.
|
|
|
|
auto flip = _mm_set1_epi8(char(0x80));
|
|
|
|
return _mm_cmplt_epi8(_mm_xor_si128(flip, fVec), _mm_xor_si128(flip, o.fVec));
|
|
|
|
}
|
2015-05-15 00:53:04 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI uint8_t operator[](int k) const {
|
2015-05-12 13:11:21 +00:00
|
|
|
SkASSERT(0 <= k && k < 16);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint8_t us[16]; } pun = {fVec};
|
|
|
|
return pun.us[k&15];
|
2015-05-12 13:11:21 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
}
|
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4f SkNx_cast<float, int32_t>(const Sk4i& src) {
|
2016-02-17 15:23:36 +00:00
|
|
|
return _mm_cvtepi32_ps(src.fVec);
|
|
|
|
}
|
2017-11-02 17:18:38 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4f SkNx_cast<float, uint32_t>(const Sk4u& src) {
|
2016-07-29 17:10:15 +00:00
|
|
|
return SkNx_cast<float>(Sk4i::Load(&src));
|
|
|
|
}
|
2016-02-17 15:23:36 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template <> AI /*static*/ Sk4i SkNx_cast<int32_t, float>(const Sk4f& src) {
|
2016-02-17 15:23:36 +00:00
|
|
|
return _mm_cvttps_epi32(src.fVec);
|
|
|
|
}
|
2015-11-20 21:53:19 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4h SkNx_cast<uint16_t, int32_t>(const Sk4i& src) {
|
2016-07-15 14:45:53 +00:00
|
|
|
#if 0 && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
// TODO: This seems to be causing code generation problems. Investigate?
|
|
|
|
return _mm_packus_epi32(src.fVec);
|
|
|
|
#elif SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
2016-02-12 22:19:06 +00:00
|
|
|
// With SSSE3, we can just shuffle the low 2 bytes from each lane right into place.
|
2016-02-08 13:54:38 +00:00
|
|
|
const int _ = ~0;
|
2016-07-15 14:45:53 +00:00
|
|
|
return _mm_shuffle_epi8(src.fVec, _mm_setr_epi8(0,1, 4,5, 8,9, 12,13, _,_,_,_,_,_,_,_));
|
2016-02-08 13:54:38 +00:00
|
|
|
#else
|
2016-07-15 14:45:53 +00:00
|
|
|
// With SSE2, we have to sign extend our input, making _mm_packs_epi32 do the pack we want.
|
|
|
|
__m128i x = _mm_srai_epi32(_mm_slli_epi32(src.fVec, 16), 16);
|
|
|
|
return _mm_packs_epi32(x,x);
|
2016-02-08 13:54:38 +00:00
|
|
|
#endif
|
2016-02-07 03:38:39 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4h SkNx_cast<uint16_t, float>(const Sk4f& src) {
|
2016-07-29 17:10:15 +00:00
|
|
|
return SkNx_cast<uint16_t>(SkNx_cast<int32_t>(src));
|
2016-07-15 14:45:53 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4b SkNx_cast<uint8_t, float>(const Sk4f& src) {
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _32 = _mm_cvttps_epi32(src.fVec);
|
2015-12-14 19:25:18 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
const int _ = ~0;
|
2016-02-12 22:19:06 +00:00
|
|
|
return _mm_shuffle_epi8(_32, _mm_setr_epi8(0,4,8,12, _,_,_,_, _,_,_,_, _,_,_,_));
|
2015-12-14 19:25:18 +00:00
|
|
|
#else
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _16 = _mm_packus_epi16(_32, _32);
|
|
|
|
return _mm_packus_epi16(_16, _16);
|
2015-12-14 19:25:18 +00:00
|
|
|
#endif
|
|
|
|
}
|
2016-02-12 22:19:06 +00:00
|
|
|
|
2017-10-10 21:14:18 +00:00
|
|
|
template<> AI /*static*/ Sk4u SkNx_cast<uint32_t, uint8_t>(const Sk4b& src) {
|
2015-12-14 19:25:18 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
const int _ = ~0;
|
2016-11-17 17:39:09 +00:00
|
|
|
return _mm_shuffle_epi8(src.fVec, _mm_setr_epi8(0,_,_,_, 1,_,_,_, 2,_,_,_, 3,_,_,_));
|
2015-12-14 19:25:18 +00:00
|
|
|
#else
|
2016-11-17 17:39:09 +00:00
|
|
|
auto _16 = _mm_unpacklo_epi8(src.fVec, _mm_setzero_si128());
|
|
|
|
return _mm_unpacklo_epi16(_16, _mm_setzero_si128());
|
2015-12-14 19:25:18 +00:00
|
|
|
#endif
|
2016-11-17 17:39:09 +00:00
|
|
|
}
|
|
|
|
|
2017-10-10 21:14:18 +00:00
|
|
|
template<> AI /*static*/ Sk4i SkNx_cast<int32_t, uint8_t>(const Sk4b& src) {
|
|
|
|
return SkNx_cast<uint32_t>(src).fVec;
|
|
|
|
}
|
|
|
|
|
2016-11-17 17:39:09 +00:00
|
|
|
template<> AI /*static*/ Sk4f SkNx_cast<float, uint8_t>(const Sk4b& src) {
|
|
|
|
return _mm_cvtepi32_ps(SkNx_cast<int32_t>(src).fVec);
|
2015-12-14 19:25:18 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4f SkNx_cast<float, uint16_t>(const Sk4h& src) {
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _32 = _mm_unpacklo_epi16(src.fVec, _mm_setzero_si128());
|
|
|
|
return _mm_cvtepi32_ps(_32);
|
2016-01-20 19:55:51 +00:00
|
|
|
}
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
template<> AI /*static*/ Sk8b SkNx_cast<uint8_t, int32_t>(const Sk8i& src) {
|
|
|
|
Sk4i lo, hi;
|
|
|
|
SkNx_split(src, &lo, &hi);
|
|
|
|
|
|
|
|
auto t = _mm_packs_epi32(lo.fVec, hi.fVec);
|
|
|
|
return _mm_packus_epi16(t, t);
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk16b SkNx_cast<uint8_t, float>(const Sk16f& src) {
|
2016-03-21 17:04:46 +00:00
|
|
|
Sk8f ab, cd;
|
|
|
|
SkNx_split(src, &ab, &cd);
|
|
|
|
|
|
|
|
Sk4f a,b,c,d;
|
|
|
|
SkNx_split(ab, &a, &b);
|
|
|
|
SkNx_split(cd, &c, &d);
|
|
|
|
|
|
|
|
return _mm_packus_epi16(_mm_packus_epi16(_mm_cvttps_epi32(a.fVec),
|
|
|
|
_mm_cvttps_epi32(b.fVec)),
|
|
|
|
_mm_packus_epi16(_mm_cvttps_epi32(c.fVec),
|
|
|
|
_mm_cvttps_epi32(d.fVec)));
|
2016-02-12 22:19:06 +00:00
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4h SkNx_cast<uint16_t, uint8_t>(const Sk4b& src) {
|
2016-02-12 22:19:06 +00:00
|
|
|
return _mm_unpacklo_epi8(src.fVec, _mm_setzero_si128());
|
|
|
|
}
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
template<> AI /*static*/ Sk8h SkNx_cast<uint16_t, uint8_t>(const Sk8b& src) {
|
|
|
|
return _mm_unpacklo_epi8(src.fVec, _mm_setzero_si128());
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4b SkNx_cast<uint8_t, uint16_t>(const Sk4h& src) {
|
2016-02-12 22:19:06 +00:00
|
|
|
return _mm_packus_epi16(src.fVec, src.fVec);
|
2016-01-20 19:55:51 +00:00
|
|
|
}
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
template<> AI /*static*/ Sk8b SkNx_cast<uint8_t, uint16_t>(const Sk8h& src) {
|
|
|
|
return _mm_packus_epi16(src.fVec, src.fVec);
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4i SkNx_cast<int32_t, uint16_t>(const Sk4h& src) {
|
2016-07-15 14:00:11 +00:00
|
|
|
return _mm_unpacklo_epi16(src.fVec, _mm_setzero_si128());
|
|
|
|
}
|
|
|
|
|
2017-11-02 17:18:38 +00:00
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4b SkNx_cast<uint8_t, int32_t>(const Sk4i& src) {
|
2016-07-12 21:55:45 +00:00
|
|
|
return _mm_packus_epi16(_mm_packus_epi16(src.fVec, src.fVec), src.fVec);
|
|
|
|
}
|
|
|
|
|
2017-09-13 20:46:05 +00:00
|
|
|
template<> AI /*static*/ Sk4b SkNx_cast<uint8_t, uint32_t>(const Sk4u& src) {
|
|
|
|
return _mm_packus_epi16(_mm_packus_epi16(src.fVec, src.fVec), src.fVec);
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
template<> AI /*static*/ Sk4i SkNx_cast<int32_t, uint32_t>(const Sk4u& src) {
|
2016-08-02 18:30:30 +00:00
|
|
|
return src.fVec;
|
|
|
|
}
|
|
|
|
|
2016-10-19 13:21:11 +00:00
|
|
|
AI static Sk4i Sk4f_round(const Sk4f& x) {
|
2016-07-12 21:55:45 +00:00
|
|
|
return _mm_cvtps_epi32(x.fVec);
|
|
|
|
}
|
|
|
|
|
2016-10-14 21:09:03 +00:00
|
|
|
} // namespace
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
#endif//SkNx_sse_DEFINED
|