The whole point of mempcy32_sse2_unalign is that we didn't align dst128
and src128. So it's not safe at all to cast them back to dst and src.
That tells the compiler that dst/src are 128-bit aligned, and then it
autovectorizes the cleanup while-loop using that (false) knowledge with
aligned SSE instructions.
This leads to crashes on memcpy32_sse2_unalign_10, which is small enough
that we actually get non-16-byte aligned memory. The larger size
benches could be crashing too, but they're big enough allocations that
they're probably always 16-byte aligned anyway.
BUG=skia:2589
R=fmalita@chromium.org, mtklein@google.com
Author: mtklein@chromium.org
Review URL: https://codereview.chromium.org/291893008
git-svn-id: http://skia.googlecode.com/svn/trunk@14851 2bbb7eff-a529-9590-31e7-b0007b416f81