That way, all permutations are possible. Previously it was only useful
in the cairo renderer, which required rgba8 → premultiplied bgra8, while
the GL renderer required rgba8 → premultiplied rgba8. Now both are
available.
This was only useful when building for AArch32 without -mfpu=neon, on
AArch64 or with -mfpu=neon gcc is smart enough to do the auto-
vectorisation, leading to code almost as good as what I wrote in
1fdf5b7cf8.
This more than halves the total runtime of this function since the
previous commit, from 8.36% to 4.02%, and is most likely memory
bandwidth limited on this specific board now.
I tried to do a SSE2 version as well, but couldn’t find any equivalent
of the LD4/ST4 ARM instruction.
On x86 on a Kaby Lake CPU, this makes it go from 6.63% of the total
execution time (loading some PNGs using the cairo backend) down to
3.20%.
On ARM on a Cortex-A7, on the same workload, this makes it go from 57%
to 8.36%.
Make it use gdk_memory_texture_from_texture().
Also make gdk_memory_format_alpha() privately available so that we can
detect if an image contains an alpha channel.
Also, now make gdk_memory_convert() the only conversion functions
and allow conversions between any 2 formats by going via a float[4].
This could be optimized via fast-paths, but so far it isn't.