Low bit rate sinusoidal codecs (and relatives like MBE) tend to sound different for male and female speakers. Males tend to have a reverberant sound, a little bit like they are underwater. You can hear this on some TV news reports delivered from the field via satellite phones.
A good example of this reverberation are the male 2400 bit/s AMBE samples over on the DVSI web site sample page (original, 2400 bit/s) (especially the words “frog” and “weeds” that contain long vowels). Listening through headphones you can hear the reverberation more clearly, it’s tends to be muffled by loudspeakers.
Over the weekend I performed some interesting experiments that show how our hearing perception (due to ear and brain) processes phase information, and how that processing is different for males and females. I think these experiments might lead to a better way of synthesising male speech when the bandwidth is not available to transmit phase information.
Phase and Sinusoidal Coding
Sinusoidal codecs model speech as a series of sine waves:
However to reduce the bit rate the phase of each sine wave is usually discarded, then regenerated at the decoder using some sort of rule based approach. It is generally thought that preserving the phases is not important, although I would argue that the phase information is important for high quality speech. For low rate codecs the choice of the rules used to synthesis the phases affects the quality of the synthesised speech.
An Experiment on Phase Perception
I wrote a simple Octave script called phase.m to test perception of phase. This generates a train of pulses using the sinusoidal model above with all of the amplitudes set to a constant. Any time domain signal can be expressed as a series of sinusoids, when we make all of the amplitudes equal we happen to get a train of pulses.
I simulated the pitch of a male speaker (50 Hz) and female (200Hz) by varying . I then tried two different models for the phase – the first has all phase sets to 0, the second used totally random phases.
Here are plots of the waveforms produced – click on each image to hear what they sound like.
Male 50Hz Random Phase
Female 200Hz Random Phase
The female with random phases sounds nearly the same as the female with uniform phases, despite a big difference in the time domain waveforms. However the males samples sound very different – the effect of random phases is much more pronounced. This suggests that for low pitched speech, we need to take special care with phase, however for high pitched speech the ear is less sensitive to phase.
I tried a few different pitches and found that (for my ear) about 125Hz was the cross over point where it became difficult to perceive large differences between random and non-random phases. This corresponds to a pitch period of 8ms.
The phase of speech harmonics affects the time domain waveform in the following way. When a bunch of the sinusoidal oscillators all have the same phase at the same time, they reinforce and a pulse is formed. With all set to zero this tends to happen just once per pitch period. However if are not carefully chosen there are other points in time where the oscillators come into the same phase and several time domain pulses can form every pitch period.
Why the Ear is Sensitive to Phase for Low Pitch Speakers
I dug up some books on the ear over the weekend , to see what they had to say about the ears perception of pitch and the subject of phase. My theory is that for low pitched speakers the ear is sensitive to pulses in the time domain. With male speakers there is a big gap between pitch pulses (e.g. 20ms for a 50Hz pitch). If the phases are such that other pulses can form between the major 20ms pitch pulses, then the ear can detect them. For high pitched speakers the gap between pitch pulses is already small, and the ear can’t resolve another pitch pulse between them.
There could be a physiological reason for this. The cochlear has a bunch of hairs that vibrate back and forth in time with the sound entering the ear. As the hairs move past a certain point they “fire” a nerve impulse off to the brain. Think about a time domain impulse (say clicking your fingers) entering the ear – it will have broadband energy and cause each hair cell to fire. At a low pulse rate all of the nerve cells can fire with each click that enters the cochlear. However as they are based on chemical processes there is a limit to the rate they can fire – past a certain rate they can’t fire on every click. It’s like reloading a gun, you need a certain time before pulling the trigger. So past a certain pulse rate (pitch period) the ear can no longer resolve separate time domain pulses.
Phase and the Open Source Low Rate Codec
So how does this apply to the low rate codec project? Well if we don’t transmit phases we need a rule based approach to reconstruct them at the decoder. The theories in this post suggest we should choose the phases so that for low pitched speakers only one major pulse can form per pitch period. I’ll test this theory over the next few weeks as I quantise the sinusoidal coder further.
Possible Improvements to AMBE
LPC Residual of Original
LPC Residual of 2400 bits/s AMBE
The LPC residual is the signal after LPC filtering. LPC filtering tends to remove the effects of the spectral envelope (effectively stripping off the amplitudes of the harmonics ) ideally leaving a train of impulses, much like our test sequences above.
For voiced speech MBE uses a very similar sum-of-sinusoids model to our sinusoidal codec. Now the theory above suggests the reverberation is caused by misplaced pulses in the time domain, due to inappropriate selection of the sinusoid phases.
In the original signal you can see a relatively smooth train of impulses at the pitch period of amplitude 2000-4000. However the the MBE coded signal you can see several large pulses that are out of step with the underlying pulses train. The theory I discussed above would suggest that these pulses are the cause of the reverberant speech – unwanted pitch pulses due to inappropriate selection of phases during synthesis.
The good news is that the male speech quality can possibly be improved without changing the algorithm bit stream by choosing a better algorithm at the decoder for phase reconstruction. This means a firmware upgrade would improve speech quality without affecting interoperability.