This will make it easier to handle HRTF data sets that have separate left and
right ear responses. Will need an mhr version update to take advantage of that.
This improves fading between HRIRs as sources pan around. In particular, it
improves the issue with individual coefficients having various rounding errors
in the stepping values, as well as issues with interpolating delay values.
It does this by doing two mixing passes for each source. First using the last
coefficients that fade to silence, and then again using the new coefficients
that fade from silence. When added together, it creates a linear fade from one
to the other. Additionally, the gain is applied separately so the individual
coefficients don't step with rounding errors. Although this does increase CPU
cost since it's doing two mixes per source, each mix is a bit cheaper now since
the stepping is simplified to a single gain value, and the overall quality is
improved.
Unsigned 32-bit offsets actually have some potential overhead on 64-bit targets
for pointer/array accesses due to rules on integer wrapping. No idea how much
impact it has in practice, but it's nice to be correct about it.
This should improve positional quality for relatively low cost. Full HRTF
rendering still only uses first-order since the only use of the dry buffer
there is for first-order content (B-Format buffers, effects).
It still fades between HRIRs when it changes, but now it selects the nearest
one instead of blending the nearest four. Due to the minimum-phase nature of
the HRIRs, interpolating between delays lead to some oddities which are
exasperated by the fading (and the fading is needed to avoid clicks and pops,
and smooth out changes).
Last time this attempted to average the HRIRs according to their contribution
to a given B-Format channel as if they were loudspeakers, as well as averaging
the HRIR delays. The latter part resulted in the loss of the ITD (inter-aural
time delay), a key component of HRTF.
This time, the HRIRs are averaged similar to above, except instead of averaging
the delays, they're applied to the resulting coefficients (for example, a delay
of 8 would apply the HRIR starting at the 8th sample of the target HRIR). This
does roughly double the IR length, as the largest delay is about 35 samples
while the filter is normally 32 samples. However, this is still smaller the
original data set IR (which was 256 samples), it also only needs to be applied
to 4 channels for first-order ambisonics, rather than the 8-channel cube. So
it's doing twice as much work per sample, but only working on half the number
of samples.
Additionally, since the resulting HRIRs no longer rely on an extra delay line,
a more efficient HRTF mixing function can be made that doesn't use one. Such a
function can also avoid the per-sample stepping parameters the original uses.
There were phase issues caused by applying HRTF directly to the B-Format
channels, since the HRIR delays were all averaged which removed the inter-aural
time-delay, which in turn removed significant spatial information.
It's possible to calculate HRTF coefficients for full third-order ambisonics
now, but it's still not possible to use them here without upmixing first-order
content.
This adds the ability to directly decode B-Format with HRTF, though only first-
order (WXYZ) for now. Second- and third-order would be easilly doable, however
we'd need to be able to up-mix first-order content (from the BFORMAT2D and
BFORMAT3D buffer formats) since it would be inappropriate to decode lower-order
content with a higher-order decoder.
The sound localization with virtual channel mixing was just too poor, so while
it's more costly to do per-source HRTF mixing, it's unavoidable if you want
good localization.
This is only partially reverted because having the virtual channel is still
beneficial, particularly with B-Format rendering and effect mixing which
otherwise skip HRTF processing. As before, the number of virtual channels can
potentially be customized, specifying more or less channels depending on the
system's needs.
This new method mixes sources normally into a 14-channel buffer with the
channels placed all around the listener. HRTF is then applied to the channels
given their positions and written to a 2-channel buffer, which gets written out
to the device.
This method has the benefit that HRTF processing becomes more scalable. The
costly HRTF filters are applied to the 14-channel buffer after the mix is done,
turning it into a post-process with a fixed overhead. Mixing sources is done
with normal non-HRTF methods, so increasing the number of playing sources only
incurs normal mixing costs.
Another benefit is that it improves B-Format playback since the soundfield gets
mixed into speakers covering all three dimensions, which then get filtered
based on their locations.
The main downside to this is that the spatial resolution of the HRTF dataset
does not play a big role anymore. However, the hope is that with ambisonics-
based panning, the perceptual position of panned sounds will still be good. It
is also an option to increase the number of virtual channels for systems that
can handle it, or maybe even decrease it for weaker systems.
At 0 distance from the listener, the sound is omni-directional. As the source
and listener become 'radius' units apart, the sound becomes more directional.
With HRTF, an omni-directional sound is handled using 0-delay, pass-through
filter coefficients, which is blended with the real delay and coefficients as
needed to become more directional.