Wrong copy algorithm for last bytes, not thread safety.
In some particular cases it uses the destination
memory beyond the string end for
16-byte load, puts changes into that part that is relevant
to destination string and writes whole 16-byte chunk into memory.
I have a test case where the memory beyond the string end contains
malloc/free data, that appear corrupted in case free() updates
it in between the 16-byte read and 16-byte write.