We can't make the -4 versions inline, since we use ifuncs for them, so make vectorized versions. Test included.
We can't use this flag for any code that may get run outside the __builtin_cpu_supports() check, and meson doesn't allow per-file cflags. So we have to split this code off into its own static library.