This more than halves the total runtime of this function since the
previous commit, from 8.36% to 4.02%, and is most likely memory
bandwidth limited on this specific board now.
I tried to do a SSE2 version as well, but couldn’t find any equivalent
of the LD4/ST4 ARM instruction.
On x86 on a Kaby Lake CPU, this makes it go from 6.63% of the total
execution time (loading some PNGs using the cairo backend) down to
3.20%.
On ARM on a Cortex-A7, on the same workload, this makes it go from 57%
to 8.36%.
Make it use gdk_memory_texture_from_texture().
Also make gdk_memory_format_alpha() privately available so that we can
detect if an image contains an alpha channel.
Also, now make gdk_memory_convert() the only conversion functions
and allow conversions between any 2 formats by going via a float[4].
This could be optimized via fast-paths, but so far it isn't.