e7f89fc257
__builtin_convertvector(..., U8x4) is producing a fairly long sequence of code to convert U16x4 to U8x4 on HSW: vextracti128 $0x1,%ymm2,%xmm3 vmovdqa 0x1848(%rip),%xmm4 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm2,%xmm2 vpunpcklqdq %xmm3,%xmm2,%xmm2 vextracti128 $0x1,%ymm0,%xmm3 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm0,%xmm0 vpunpcklqdq %xmm3,%xmm0,%xmm0 vinserti128 $0x1,%xmm2,%ymm0,%ymm0 We can do much better with _mm256_packus_epi16: vinserti128 $0x1,%xmm0,%ymm2,%ymm3 vperm2i128 $0x31,%ymm0,%ymm2,%ymm0 vpackuswb %ymm0,%ymm3,%ymm0 vpackuswb packs the values in a somewhat surprising order, which the first two instructions get us lined up for. This is a pretty noticeable speedup, 7-8% on some benchmarks. The same sort of change could be made for SSE2 and SSE4.1 also using _mm_packus_epi16, but the difference for that change is much less dramatic. Might as well stick to focusing on HSW. Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c Reviewed-on: https://skia-review.googlesource.com/30420 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> |
||
---|---|---|
animations | ||
bench | ||
bin | ||
debugger | ||
dm | ||
docs | ||
example | ||
experimental | ||
fuzz | ||
gm | ||
gn | ||
include | ||
infra | ||
platform_tools | ||
resources | ||
samplecode | ||
site | ||
src | ||
tests | ||
third_party | ||
tools | ||
.clang-format | ||
.gitignore | ||
.gn | ||
AUTHORS | ||
BUILD.gn | ||
codereview.settings | ||
CONTRIBUTING | ||
CQ_COMMITTERS | ||
DEPS | ||
Doxyfile | ||
LICENSE | ||
PRESUBMIT.py | ||
public.bzl | ||
README | ||
README.chromium | ||
whitespace.txt |
Skia is a complete 2D graphic library for drawing Text, Geometries, and Images. See full details, and build instructions, at https://skia.org.