Customize memcmp.c for tile, using similar tricks from memcpy:
- replace MERGE macro with dblalign.
- replace memcmp_bytes function with revbytes.
- use __glibc_likely.
- use post-increment addressing.
The schedule is still not perfect: the compiler is not hoisting
code above the comparison branch, which could save a bundle or two.
memcmp speeds up by 30-40% on shorter aligned tests in benchtest,
with some tests with unaligned lengths taking a small performance hit.