For arm this makes no difference--the result is bit-for-bit identical;
for thumb this results in smaller encodings. Perhaps it ought not and
this is in fact an assembler bug, but I also think it's clearer.
Factor out the sequence needed to call kuser_get_tls, as we can't
play subtract into pc games in thumb mode. Prepare for hard-tp,
pulling the save of LR into the macro.
There are several places in which we access negative offsets from
the thread-pointer, but thumb2 only supports positive offsets in
memory references.
Avoid duplicating the rather large macros in which these references
are embedded by abstracting out the operation.