The implementation is almost the same 4-way-unrolled loop, but because
of the wider registers, we fill 128 bytes per loop. Unlike the SSE2
implementation, the AVX2 version uses unaligned stores and won't try to
align in the prologue, matching glibc's __memset_avx2 (also unaligned).
Change-Id: Iba4b5c183776497d8ee1fffd15637ccb2a7b83bc
Reviewed-by: Allan Sandfeld Jensen <allan.jensen@qt.io>