Written from scratch rather than copied from GMP, due to LGPL 2.1 vs GPL 3, but tested with the GMP testsuite. This is 250% faster than the generic code as measured on Cortex-A15, and the same speed as GMP on the same core, and probably everywhere.