Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
/*
|
|
|
|
* Copyright 2015 Google Inc.
|
|
|
|
*
|
|
|
|
* Use of this source code is governed by a BSD-style license that can be
|
|
|
|
* found in the LICENSE file.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef SkNx_sse_DEFINED
|
|
|
|
#define SkNx_sse_DEFINED
|
|
|
|
|
2016-06-09 20:40:56 +00:00
|
|
|
#include <immintrin.h>
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
// This file may assume <= SSE2, but must check SK_CPU_SSE_LEVEL for anything more recent.
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
// If you do, make sure this is in a static inline function... anywhere else risks violating ODR.
|
2015-05-23 00:18:21 +00:00
|
|
|
|
2015-12-03 17:15:25 +00:00
|
|
|
#define SKNX_IS_FAST
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<2, float> {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
public:
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(const __m128& vec) : fVec(vec) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx() {}
|
|
|
|
SkNx(float val) : fVec(_mm_set1_ps(val)) {}
|
2016-01-31 16:02:47 +00:00
|
|
|
static SkNx Load(const void* ptr) {
|
|
|
|
return _mm_castsi128_ps(_mm_loadl_epi64((const __m128i*)ptr));
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(float a, float b) : fVec(_mm_setr_ps(a,b,0,0)) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-01-31 16:02:47 +00:00
|
|
|
void store(void* ptr) const { _mm_storel_pi((__m64*)ptr, fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator * (const SkNx& o) const { return _mm_mul_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator / (const SkNx& o) const { return _mm_div_ps(fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator == (const SkNx& o) const { return _mm_cmpeq_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator != (const SkNx& o) const { return _mm_cmpneq_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator < (const SkNx& o) const { return _mm_cmplt_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator > (const SkNx& o) const { return _mm_cmpgt_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator <= (const SkNx& o) const { return _mm_cmple_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator >= (const SkNx& o) const { return _mm_cmpge_ps (fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
static SkNx Min(const SkNx& l, const SkNx& r) { return _mm_min_ps(l.fVec, r.fVec); }
|
|
|
|
static SkNx Max(const SkNx& l, const SkNx& r) { return _mm_max_ps(l.fVec, r.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-03-21 17:04:46 +00:00
|
|
|
SkNx sqrt() const { return _mm_sqrt_ps (fVec); }
|
|
|
|
SkNx rsqrt() const { return _mm_rsqrt_ps(fVec); }
|
|
|
|
SkNx invert() const { return _mm_rcp_ps(fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
float operator[](int k) const {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
SkASSERT(0 <= k && k < 2);
|
|
|
|
union { __m128 v; float fs[4]; } pun = {fVec};
|
2015-04-03 13:16:13 +00:00
|
|
|
return pun.fs[k&1];
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
|
|
|
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
bool allTrue() const { return 0xff == (_mm_movemask_epi8(_mm_castps_si128(fVec)) & 0xff); }
|
|
|
|
bool anyTrue() const { return 0x00 != (_mm_movemask_epi8(_mm_castps_si128(fVec)) & 0xff); }
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
__m128 fVec;
|
|
|
|
};
|
|
|
|
|
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<4, float> {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
public:
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(const __m128& vec) : fVec(vec) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx() {}
|
|
|
|
SkNx(float val) : fVec( _mm_set1_ps(val) ) {}
|
|
|
|
SkNx(float a, float b, float c, float d) : fVec(_mm_setr_ps(a,b,c,d)) {}
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-10-06 15:09:27 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadu_ps((const float*)ptr); }
|
2016-01-31 16:02:47 +00:00
|
|
|
void store(void* ptr) const { _mm_storeu_ps((float*)ptr, fVec); }
|
2015-12-01 15:10:21 +00:00
|
|
|
|
2016-10-06 15:09:27 +00:00
|
|
|
static void Load4(const void* ptr, SkNx* r, SkNx* g, SkNx* b, SkNx* a) {
|
|
|
|
__m128 v0 = _mm_loadu_ps(((float*)ptr) + 0),
|
|
|
|
v1 = _mm_loadu_ps(((float*)ptr) + 4),
|
|
|
|
v2 = _mm_loadu_ps(((float*)ptr) + 8),
|
|
|
|
v3 = _mm_loadu_ps(((float*)ptr) + 12);
|
|
|
|
_MM_TRANSPOSE4_PS(v0, v1, v2, v3);
|
|
|
|
*r = v0;
|
|
|
|
*g = v1;
|
|
|
|
*b = v2;
|
|
|
|
*a = v3;
|
|
|
|
}
|
|
|
|
static void Store4(void* dst, const SkNx& r, const SkNx& g, const SkNx& b, const SkNx& a) {
|
|
|
|
__m128 v0 = r.fVec,
|
|
|
|
v1 = g.fVec,
|
|
|
|
v2 = b.fVec,
|
|
|
|
v3 = a.fVec;
|
|
|
|
_MM_TRANSPOSE4_PS(v0, v1, v2, v3);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 0, v0);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 4, v1);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 8, v2);
|
|
|
|
_mm_storeu_ps(((float*) dst) + 12, v3);
|
|
|
|
}
|
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator * (const SkNx& o) const { return _mm_mul_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator / (const SkNx& o) const { return _mm_div_ps(fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator == (const SkNx& o) const { return _mm_cmpeq_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator != (const SkNx& o) const { return _mm_cmpneq_ps(fVec, o.fVec); }
|
|
|
|
SkNx operator < (const SkNx& o) const { return _mm_cmplt_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator > (const SkNx& o) const { return _mm_cmpgt_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator <= (const SkNx& o) const { return _mm_cmple_ps (fVec, o.fVec); }
|
|
|
|
SkNx operator >= (const SkNx& o) const { return _mm_cmpge_ps (fVec, o.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
static SkNx Min(const SkNx& l, const SkNx& r) { return _mm_min_ps(l.fVec, r.fVec); }
|
|
|
|
static SkNx Max(const SkNx& l, const SkNx& r) { return _mm_max_ps(l.fVec, r.fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-01-15 20:16:40 +00:00
|
|
|
SkNx abs() const { return _mm_andnot_ps(_mm_set1_ps(-0.0f), fVec); }
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
SkNx floor() const {
|
2016-07-11 16:59:21 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_floor_ps(fVec);
|
|
|
|
#else
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
// Emulate _mm_floor_ps() with SSE2:
|
|
|
|
// - roundtrip through integers via truncation
|
|
|
|
// - subtract 1 if that's too big (possible for negative values).
|
|
|
|
// This restricts the domain of our inputs to a maximum somehwere around 2^31.
|
|
|
|
// Seems plenty big.
|
|
|
|
__m128 roundtrip = _mm_cvtepi32_ps(_mm_cvttps_epi32(fVec));
|
|
|
|
__m128 too_big = _mm_cmpgt_ps(roundtrip, fVec);
|
|
|
|
return _mm_sub_ps(roundtrip, _mm_and_ps(too_big, _mm_set1_ps(1.0f)));
|
2016-07-11 16:59:21 +00:00
|
|
|
#endif
|
skcpu: sse4.1 floor, f16c f16<->f32
- floor with roundps is about 4.5x faster when available
- f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions:
+0x180 movups (%r15), %xmm1
+0x184 vcvtph2ps (%rbx), %xmm2
+0x189 movaps %xmm1, %xmm3
+0x18c shufps $255, %xmm3, %xmm3
+0x190 movaps %xmm0, %xmm4
+0x193 subps %xmm3, %xmm4
+0x196 mulps %xmm2, %xmm4
+0x199 addps %xmm1, %xmm4
+0x19c vcvtps2ph $0, %xmm4, (%rbx)
+0x1a2 addq $16, %r15
+0x1a6 addq $8, %rbx
+0x1aa decl %r14d
+0x1ad jne +0x180
If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
Review URL: https://codereview.chromium.org/1891513002
2016-04-19 21:21:30 +00:00
|
|
|
}
|
2016-01-15 20:16:40 +00:00
|
|
|
|
2016-03-21 17:04:46 +00:00
|
|
|
SkNx sqrt() const { return _mm_sqrt_ps (fVec); }
|
|
|
|
SkNx rsqrt() const { return _mm_rsqrt_ps(fVec); }
|
|
|
|
SkNx invert() const { return _mm_rcp_ps(fVec); }
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
float operator[](int k) const {
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128 v; float fs[4]; } pun = {fVec};
|
2015-04-03 13:16:13 +00:00
|
|
|
return pun.fs[k&3];
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
}
|
|
|
|
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
bool allTrue() const { return 0xffff == _mm_movemask_epi8(_mm_castps_si128(fVec)); }
|
|
|
|
bool anyTrue() const { return 0x0000 != _mm_movemask_epi8(_mm_castps_si128(fVec)); }
|
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2016-07-11 16:59:21 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
2016-06-20 13:07:45 +00:00
|
|
|
return _mm_blendv_ps(e.fVec, t.fVec, fVec);
|
2016-07-11 16:59:21 +00:00
|
|
|
#else
|
Color dodge and burn with SkPMFloat.
Both 25-35% faster with SSE.
With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement.
The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content.
Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows.
In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%.
Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight.
BUG=skia:
Review URL: https://codereview.chromium.org/1214443002
2015-06-26 17:46:31 +00:00
|
|
|
return _mm_or_ps(_mm_and_ps (fVec, t.fVec),
|
|
|
|
_mm_andnot_ps(fVec, e.fVec));
|
2016-07-11 16:59:21 +00:00
|
|
|
#endif
|
Color dodge and burn with SkPMFloat.
Both 25-35% faster with SSE.
With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement.
The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content.
Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows.
In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%.
Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight.
BUG=skia:
Review URL: https://codereview.chromium.org/1214443002
2015-06-26 17:46:31 +00:00
|
|
|
}
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
__m128 fVec;
|
|
|
|
};
|
|
|
|
|
2016-02-09 19:32:51 +00:00
|
|
|
template <>
|
2016-07-29 17:10:15 +00:00
|
|
|
class SkNx<4, int32_t> {
|
2016-02-09 19:32:51 +00:00
|
|
|
public:
|
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
|
|
|
|
SkNx() {}
|
2016-07-29 17:10:15 +00:00
|
|
|
SkNx(int32_t val) : fVec(_mm_set1_epi32(val)) {}
|
2016-02-09 19:32:51 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
2016-07-29 17:10:15 +00:00
|
|
|
SkNx(int32_t a, int32_t b, int32_t c, int32_t d) : fVec(_mm_setr_epi32(a,b,c,d)) {}
|
2016-02-09 19:32:51 +00:00
|
|
|
|
|
|
|
void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
|
|
|
|
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_epi32(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_epi32(fVec, o.fVec); }
|
|
|
|
SkNx operator * (const SkNx& o) const {
|
|
|
|
__m128i mul20 = _mm_mul_epu32(fVec, o.fVec),
|
|
|
|
mul31 = _mm_mul_epu32(_mm_srli_si128(fVec, 4), _mm_srli_si128(o.fVec, 4));
|
|
|
|
return _mm_unpacklo_epi32(_mm_shuffle_epi32(mul20, _MM_SHUFFLE(0,0,2,0)),
|
|
|
|
_mm_shuffle_epi32(mul31, _MM_SHUFFLE(0,0,2,0)));
|
|
|
|
}
|
|
|
|
|
2016-07-12 22:01:26 +00:00
|
|
|
SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
2016-06-17 19:09:16 +00:00
|
|
|
SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
2016-07-15 14:00:11 +00:00
|
|
|
SkNx operator ^ (const SkNx& o) const { return _mm_xor_si128(fVec, o.fVec); }
|
2016-06-17 19:09:16 +00:00
|
|
|
|
2016-02-09 19:32:51 +00:00
|
|
|
SkNx operator << (int bits) const { return _mm_slli_epi32(fVec, bits); }
|
|
|
|
SkNx operator >> (int bits) const { return _mm_srai_epi32(fVec, bits); }
|
|
|
|
|
2016-07-15 14:00:11 +00:00
|
|
|
SkNx operator == (const SkNx& o) const { return _mm_cmpeq_epi32 (fVec, o.fVec); }
|
|
|
|
SkNx operator < (const SkNx& o) const { return _mm_cmplt_epi32 (fVec, o.fVec); }
|
|
|
|
SkNx operator > (const SkNx& o) const { return _mm_cmpgt_epi32 (fVec, o.fVec); }
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
int32_t operator[](int k) const {
|
2016-02-09 19:32:51 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
2016-07-29 17:10:15 +00:00
|
|
|
union { __m128i v; int32_t is[4]; } pun = {fVec};
|
2016-02-09 19:32:51 +00:00
|
|
|
return pun.is[k&3];
|
|
|
|
}
|
|
|
|
|
2016-07-15 14:00:11 +00:00
|
|
|
SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_epi8(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-02-09 19:32:51 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<4, uint32_t> {
|
|
|
|
public:
|
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
|
|
|
|
|
|
|
SkNx() {}
|
|
|
|
SkNx(uint32_t val) : fVec(_mm_set1_epi32(val)) {}
|
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
|
|
|
SkNx(uint32_t a, uint32_t b, uint32_t c, uint32_t d) : fVec(_mm_setr_epi32(a,b,c,d)) {}
|
|
|
|
|
|
|
|
void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
|
|
|
|
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_epi32(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_epi32(fVec, o.fVec); }
|
|
|
|
// Not quite sure how to best do operator * in SSE2. We probably don't use it.
|
|
|
|
|
|
|
|
SkNx operator & (const SkNx& o) const { return _mm_and_si128(fVec, o.fVec); }
|
|
|
|
SkNx operator | (const SkNx& o) const { return _mm_or_si128(fVec, o.fVec); }
|
|
|
|
SkNx operator ^ (const SkNx& o) const { return _mm_xor_si128(fVec, o.fVec); }
|
|
|
|
|
|
|
|
SkNx operator << (int bits) const { return _mm_slli_epi32(fVec, bits); }
|
|
|
|
SkNx operator >> (int bits) const { return _mm_srli_epi32(fVec, bits); }
|
|
|
|
|
|
|
|
SkNx operator == (const SkNx& o) const { return _mm_cmpeq_epi32 (fVec, o.fVec); }
|
|
|
|
// operator < and > take a little extra fiddling to make work for unsigned ints.
|
|
|
|
|
|
|
|
uint32_t operator[](int k) const {
|
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128i v; uint32_t us[4]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
|
|
|
}
|
|
|
|
|
|
|
|
SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
return _mm_blendv_epi8(e.fVec, t.fVec, fVec);
|
|
|
|
#else
|
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<4, uint16_t> {
|
2015-04-14 21:02:52 +00:00
|
|
|
public:
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx() {}
|
|
|
|
SkNx(uint16_t val) : fVec(_mm_set1_epi16(val)) {}
|
|
|
|
SkNx(uint16_t a, uint16_t b, uint16_t c, uint16_t d) : fVec(_mm_setr_epi16(a,b,c,d,0,0,0,0)) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-06 15:09:27 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadl_epi64((const __m128i*)ptr); }
|
2016-01-31 16:02:47 +00:00
|
|
|
void store(void* ptr) const { _mm_storel_epi64((__m128i*)ptr, fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2016-10-06 15:09:27 +00:00
|
|
|
static void Load4(const void* ptr, SkNx* r, SkNx* g, SkNx* b, SkNx* a) {
|
|
|
|
__m128i lo = _mm_loadu_si128(((__m128i*)ptr) + 0),
|
|
|
|
hi = _mm_loadu_si128(((__m128i*)ptr) + 1);
|
|
|
|
__m128i even = _mm_unpacklo_epi16(lo, hi), // r0 r2 g0 g2 b0 b2 a0 a2
|
|
|
|
odd = _mm_unpackhi_epi16(lo, hi); // r1 r3 ...
|
|
|
|
__m128i rg = _mm_unpacklo_epi16(even, odd), // r0 r1 r2 r3 g0 g1 g2 g3
|
|
|
|
ba = _mm_unpackhi_epi16(even, odd); // b0 b1 ... a0 a1 ...
|
|
|
|
*r = rg;
|
|
|
|
*g = _mm_srli_si128(rg, 8);
|
|
|
|
*b = ba;
|
|
|
|
*a = _mm_srli_si128(ba, 8);
|
|
|
|
}
|
|
|
|
static void Store4(void* dst, const SkNx& r, const SkNx& g, const SkNx& b, const SkNx& a) {
|
|
|
|
__m128i rg = _mm_unpacklo_epi16(r.fVec, g.fVec);
|
|
|
|
__m128i ba = _mm_unpacklo_epi16(b.fVec, a.fVec);
|
|
|
|
__m128i lo = _mm_unpacklo_epi32(rg, ba);
|
|
|
|
__m128i hi = _mm_unpackhi_epi32(rg, ba);
|
|
|
|
_mm_storeu_si128(((__m128i*) dst) + 0, lo);
|
|
|
|
_mm_storeu_si128(((__m128i*) dst) + 1, hi);
|
|
|
|
}
|
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_epi16(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_epi16(fVec, o.fVec); }
|
|
|
|
SkNx operator * (const SkNx& o) const { return _mm_mullo_epi16(fVec, o.fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator << (int bits) const { return _mm_slli_epi16(fVec, bits); }
|
|
|
|
SkNx operator >> (int bits) const { return _mm_srli_epi16(fVec, bits); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
uint16_t operator[](int k) const {
|
2015-04-14 21:02:52 +00:00
|
|
|
SkASSERT(0 <= k && k < 4);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint16_t us[8]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
2015-04-14 21:02:52 +00:00
|
|
|
}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<8, uint16_t> {
|
2015-04-14 21:02:52 +00:00
|
|
|
public:
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx() {}
|
|
|
|
SkNx(uint16_t val) : fVec(_mm_set1_epi16(val)) {}
|
2016-10-10 14:23:37 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(uint16_t a, uint16_t b, uint16_t c, uint16_t d,
|
2015-04-14 21:02:52 +00:00
|
|
|
uint16_t e, uint16_t f, uint16_t g, uint16_t h) : fVec(_mm_setr_epi16(a,b,c,d,e,f,g,h)) {}
|
|
|
|
|
2016-01-31 16:02:47 +00:00
|
|
|
void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_epi16(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_epi16(fVec, o.fVec); }
|
|
|
|
SkNx operator * (const SkNx& o) const { return _mm_mullo_epi16(fVec, o.fVec); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator << (int bits) const { return _mm_slli_epi16(fVec, bits); }
|
|
|
|
SkNx operator >> (int bits) const { return _mm_srli_epi16(fVec, bits); }
|
2015-04-14 21:02:52 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
static SkNx Min(const SkNx& a, const SkNx& b) {
|
2015-05-15 00:53:04 +00:00
|
|
|
// No unsigned _mm_min_epu16, so we'll shift into a space where we can use the
|
|
|
|
// signed version, _mm_min_epi16, then shift back.
|
|
|
|
const uint16_t top = 0x8000; // Keep this separate from _mm_set1_epi16 or MSVC will whine.
|
|
|
|
const __m128i top_8x = _mm_set1_epi16(top);
|
|
|
|
return _mm_add_epi8(top_8x, _mm_min_epi16(_mm_sub_epi8(a.fVec, top_8x),
|
|
|
|
_mm_sub_epi8(b.fVec, top_8x)));
|
|
|
|
}
|
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
2015-07-14 17:54:19 +00:00
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
}
|
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
uint16_t operator[](int k) const {
|
2015-04-14 21:02:52 +00:00
|
|
|
SkASSERT(0 <= k && k < 8);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint16_t us[8]; } pun = {fVec};
|
|
|
|
return pun.us[k&7];
|
2015-04-14 21:02:52 +00:00
|
|
|
}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2015-12-14 19:25:18 +00:00
|
|
|
template <>
|
|
|
|
class SkNx<4, uint8_t> {
|
|
|
|
public:
|
2016-03-01 15:01:23 +00:00
|
|
|
SkNx() {}
|
2015-12-14 19:25:18 +00:00
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
2016-03-01 15:01:23 +00:00
|
|
|
SkNx(uint8_t a, uint8_t b, uint8_t c, uint8_t d)
|
|
|
|
: fVec(_mm_setr_epi8(a,b,c,d, 0,0,0,0, 0,0,0,0, 0,0,0,0)) {}
|
|
|
|
|
2015-12-14 19:25:18 +00:00
|
|
|
|
2016-01-31 16:02:47 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_cvtsi32_si128(*(const int*)ptr); }
|
|
|
|
void store(void* ptr) const { *(int*)ptr = _mm_cvtsi128_si32(fVec); }
|
2015-12-14 19:25:18 +00:00
|
|
|
|
2016-03-01 15:01:23 +00:00
|
|
|
uint8_t operator[](int k) const {
|
|
|
|
SkASSERT(0 <= k && k < 4);
|
|
|
|
union { __m128i v; uint8_t us[16]; } pun = {fVec};
|
|
|
|
return pun.us[k&3];
|
|
|
|
}
|
|
|
|
|
2015-12-14 19:25:18 +00:00
|
|
|
// TODO as needed
|
|
|
|
|
|
|
|
__m128i fVec;
|
|
|
|
};
|
|
|
|
|
2015-05-12 13:11:21 +00:00
|
|
|
template <>
|
2015-11-20 21:53:19 +00:00
|
|
|
class SkNx<16, uint8_t> {
|
2015-05-12 13:11:21 +00:00
|
|
|
public:
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(const __m128i& vec) : fVec(vec) {}
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx() {}
|
|
|
|
SkNx(uint8_t val) : fVec(_mm_set1_epi8(val)) {}
|
2016-01-31 16:02:47 +00:00
|
|
|
static SkNx Load(const void* ptr) { return _mm_loadu_si128((const __m128i*)ptr); }
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx(uint8_t a, uint8_t b, uint8_t c, uint8_t d,
|
2015-05-12 13:11:21 +00:00
|
|
|
uint8_t e, uint8_t f, uint8_t g, uint8_t h,
|
|
|
|
uint8_t i, uint8_t j, uint8_t k, uint8_t l,
|
|
|
|
uint8_t m, uint8_t n, uint8_t o, uint8_t p)
|
|
|
|
: fVec(_mm_setr_epi8(a,b,c,d, e,f,g,h, i,j,k,l, m,n,o,p)) {}
|
|
|
|
|
2016-01-31 16:02:47 +00:00
|
|
|
void store(void* ptr) const { _mm_storeu_si128((__m128i*)ptr, fVec); }
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx saturatedAdd(const SkNx& o) const { return _mm_adds_epu8(fVec, o.fVec); }
|
2015-05-12 22:48:09 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx operator + (const SkNx& o) const { return _mm_add_epi8(fVec, o.fVec); }
|
|
|
|
SkNx operator - (const SkNx& o) const { return _mm_sub_epi8(fVec, o.fVec); }
|
2015-05-12 13:11:21 +00:00
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
static SkNx Min(const SkNx& a, const SkNx& b) { return _mm_min_epu8(a.fVec, b.fVec); }
|
|
|
|
SkNx operator < (const SkNx& o) const {
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
// There's no unsigned _mm_cmplt_epu8, so we flip the sign bits then use a signed compare.
|
|
|
|
auto flip = _mm_set1_epi8(char(0x80));
|
|
|
|
return _mm_cmplt_epi8(_mm_xor_si128(flip, fVec), _mm_xor_si128(flip, o.fVec));
|
|
|
|
}
|
2015-05-15 00:53:04 +00:00
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
uint8_t operator[](int k) const {
|
2015-05-12 13:11:21 +00:00
|
|
|
SkASSERT(0 <= k && k < 16);
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
union { __m128i v; uint8_t us[16]; } pun = {fVec};
|
|
|
|
return pun.us[k&15];
|
2015-05-12 13:11:21 +00:00
|
|
|
}
|
|
|
|
|
2015-11-20 21:53:19 +00:00
|
|
|
SkNx thenElse(const SkNx& t, const SkNx& e) const {
|
Implement four more xfermodes with Sk4px.
HardLight, Overlay, Darken, and Lighten are all
~2x faster with SSE, ~25% faster with NEON.
This covers all previously-implemented NEON xfermodes.
3 previous SSE xfermodes remain. Those need division
and sqrt, so I'm planning on using SkPMFloat for them.
It'll help the readability and NEON speed if I move that
into [0,1] space first.
The main new concept here is c.thenElse(t,e), which behaves like
(c ? t : e) except, of course, both t and e are evaluated. This allows
us to emulate conditionals with vectors.
This also removes the concept of SkNb. Instead of a standalone bool
vector, each SkNi or SkNf will just return their own types for
comparisons. Turns out to be a lot more manageable this way.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1196713004
2015-06-24 22:18:39 +00:00
|
|
|
return _mm_or_si128(_mm_and_si128 (fVec, t.fVec),
|
|
|
|
_mm_andnot_si128(fVec, e.fVec));
|
|
|
|
}
|
|
|
|
|
2015-04-14 21:02:52 +00:00
|
|
|
__m128i fVec;
|
|
|
|
};
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template<> /*static*/ inline Sk4f SkNx_cast<float, int32_t>(const Sk4i& src) {
|
2016-02-17 15:23:36 +00:00
|
|
|
return _mm_cvtepi32_ps(src.fVec);
|
|
|
|
}
|
2016-07-29 17:10:15 +00:00
|
|
|
template<> /*static*/ inline Sk4f SkNx_cast<float, uint32_t>(const Sk4u& src) {
|
|
|
|
return SkNx_cast<float>(Sk4i::Load(&src));
|
|
|
|
}
|
2016-02-17 15:23:36 +00:00
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template <> /*static*/ inline Sk4i SkNx_cast<int32_t, float>(const Sk4f& src) {
|
2016-02-17 15:23:36 +00:00
|
|
|
return _mm_cvttps_epi32(src.fVec);
|
|
|
|
}
|
2015-11-20 21:53:19 +00:00
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template<> /*static*/ inline Sk4h SkNx_cast<uint16_t, int32_t>(const Sk4i& src) {
|
2016-07-15 14:45:53 +00:00
|
|
|
#if 0 && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41
|
|
|
|
// TODO: This seems to be causing code generation problems. Investigate?
|
|
|
|
return _mm_packus_epi32(src.fVec);
|
|
|
|
#elif SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
2016-02-12 22:19:06 +00:00
|
|
|
// With SSSE3, we can just shuffle the low 2 bytes from each lane right into place.
|
2016-02-08 13:54:38 +00:00
|
|
|
const int _ = ~0;
|
2016-07-15 14:45:53 +00:00
|
|
|
return _mm_shuffle_epi8(src.fVec, _mm_setr_epi8(0,1, 4,5, 8,9, 12,13, _,_,_,_,_,_,_,_));
|
2016-02-08 13:54:38 +00:00
|
|
|
#else
|
2016-07-15 14:45:53 +00:00
|
|
|
// With SSE2, we have to sign extend our input, making _mm_packs_epi32 do the pack we want.
|
|
|
|
__m128i x = _mm_srai_epi32(_mm_slli_epi32(src.fVec, 16), 16);
|
|
|
|
return _mm_packs_epi32(x,x);
|
2016-02-08 13:54:38 +00:00
|
|
|
#endif
|
2016-02-07 03:38:39 +00:00
|
|
|
}
|
|
|
|
|
2016-07-15 14:45:53 +00:00
|
|
|
template<> /*static*/ inline Sk4h SkNx_cast<uint16_t, float>(const Sk4f& src) {
|
2016-07-29 17:10:15 +00:00
|
|
|
return SkNx_cast<uint16_t>(SkNx_cast<int32_t>(src));
|
2016-07-15 14:45:53 +00:00
|
|
|
}
|
|
|
|
|
2016-02-12 22:19:06 +00:00
|
|
|
template<> /*static*/ inline Sk4b SkNx_cast<uint8_t, float>(const Sk4f& src) {
|
|
|
|
auto _32 = _mm_cvttps_epi32(src.fVec);
|
2015-12-14 19:25:18 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
const int _ = ~0;
|
2016-02-12 22:19:06 +00:00
|
|
|
return _mm_shuffle_epi8(_32, _mm_setr_epi8(0,4,8,12, _,_,_,_, _,_,_,_, _,_,_,_));
|
2015-12-14 19:25:18 +00:00
|
|
|
#else
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _16 = _mm_packus_epi16(_32, _32);
|
|
|
|
return _mm_packus_epi16(_16, _16);
|
2015-12-14 19:25:18 +00:00
|
|
|
#endif
|
|
|
|
}
|
2016-02-12 22:19:06 +00:00
|
|
|
|
|
|
|
template<> /*static*/ inline Sk4f SkNx_cast<float, uint8_t>(const Sk4b& src) {
|
2015-12-14 19:25:18 +00:00
|
|
|
#if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSSE3
|
|
|
|
const int _ = ~0;
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _32 = _mm_shuffle_epi8(src.fVec, _mm_setr_epi8(0,_,_,_, 1,_,_,_, 2,_,_,_, 3,_,_,_));
|
2015-12-14 19:25:18 +00:00
|
|
|
#else
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _16 = _mm_unpacklo_epi8(src.fVec, _mm_setzero_si128()),
|
|
|
|
_32 = _mm_unpacklo_epi16(_16, _mm_setzero_si128());
|
2015-12-14 19:25:18 +00:00
|
|
|
#endif
|
2016-02-12 22:19:06 +00:00
|
|
|
return _mm_cvtepi32_ps(_32);
|
2015-12-14 19:25:18 +00:00
|
|
|
}
|
|
|
|
|
sknx refactoring
- trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup)
- expand apis a little
* v[0] == v.kth<0>()
* SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f
- remove anonymous namespace
I believe it's safe to remove the anonymous namespace right now.
We're worried about violating the One Definition Rule; the anonymous namespace protected us from that.
In Release builds, this is mostly moot, as everything tends to inline completely.
In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken.
Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR.
I plan to follow up with a tedious .kth<...>() -> [...] auto-replace.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1683543002
2016-02-09 18:35:27 +00:00
|
|
|
template<> /*static*/ inline Sk4f SkNx_cast<float, uint16_t>(const Sk4h& src) {
|
2016-02-12 22:19:06 +00:00
|
|
|
auto _32 = _mm_unpacklo_epi16(src.fVec, _mm_setzero_si128());
|
|
|
|
return _mm_cvtepi32_ps(_32);
|
2016-01-20 19:55:51 +00:00
|
|
|
}
|
|
|
|
|
2016-03-21 17:04:46 +00:00
|
|
|
template<> /*static*/ inline Sk16b SkNx_cast<uint8_t, float>(const Sk16f& src) {
|
|
|
|
Sk8f ab, cd;
|
|
|
|
SkNx_split(src, &ab, &cd);
|
|
|
|
|
|
|
|
Sk4f a,b,c,d;
|
|
|
|
SkNx_split(ab, &a, &b);
|
|
|
|
SkNx_split(cd, &c, &d);
|
|
|
|
|
|
|
|
return _mm_packus_epi16(_mm_packus_epi16(_mm_cvttps_epi32(a.fVec),
|
|
|
|
_mm_cvttps_epi32(b.fVec)),
|
|
|
|
_mm_packus_epi16(_mm_cvttps_epi32(c.fVec),
|
|
|
|
_mm_cvttps_epi32(d.fVec)));
|
2016-02-12 22:19:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
template<> /*static*/ inline Sk4h SkNx_cast<uint16_t, uint8_t>(const Sk4b& src) {
|
|
|
|
return _mm_unpacklo_epi8(src.fVec, _mm_setzero_si128());
|
|
|
|
}
|
|
|
|
|
|
|
|
template<> /*static*/ inline Sk4b SkNx_cast<uint8_t, uint16_t>(const Sk4h& src) {
|
|
|
|
return _mm_packus_epi16(src.fVec, src.fVec);
|
2016-01-20 19:55:51 +00:00
|
|
|
}
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template<> /*static*/ inline Sk4i SkNx_cast<int32_t, uint16_t>(const Sk4h& src) {
|
2016-07-15 14:00:11 +00:00
|
|
|
return _mm_unpacklo_epi16(src.fVec, _mm_setzero_si128());
|
|
|
|
}
|
|
|
|
|
2016-07-29 17:10:15 +00:00
|
|
|
template<> /*static*/ inline Sk4b SkNx_cast<uint8_t, int32_t>(const Sk4i& src) {
|
2016-07-12 21:55:45 +00:00
|
|
|
return _mm_packus_epi16(_mm_packus_epi16(src.fVec, src.fVec), src.fVec);
|
|
|
|
}
|
|
|
|
|
2016-08-02 18:30:30 +00:00
|
|
|
template<> /*static*/ inline Sk4i SkNx_cast<int32_t, uint32_t>(const Sk4u& src) {
|
|
|
|
return src.fVec;
|
|
|
|
}
|
|
|
|
|
2016-07-12 21:55:45 +00:00
|
|
|
static inline Sk4i Sk4f_round(const Sk4f& x) {
|
|
|
|
return _mm_cvtps_epi32(x.fVec);
|
|
|
|
}
|
|
|
|
|
Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
2015-03-30 17:50:27 +00:00
|
|
|
#endif//SkNx_sse_DEFINED
|