Open Source Low Rate Speech Codec Part 3 – Phase and Male Speech

Low bit rate sinusoidal codecs (and relatives like MBE) tend to sound different for male and female speakers. Males tend to have a reverberant sound, a little bit like they are underwater. You can hear this on some TV news reports delivered from the field via satellite phones.

A good example of this reverberation are the male 2400 bit/s AMBE samples over on the DVSI web site sample page (original, 2400 bit/s) (especially the words “frog” and “weeds” that contain long vowels). Listening through headphones you can hear the reverberation more clearly, it’s tends to be muffled by loudspeakers.

Over the weekend I performed some interesting experiments that show how our hearing perception (due to ear and brain) processes phase information, and how that processing is different for males and females. I think these experiments might lead to a better way of synthesising male speech when the bandwidth is not available to transmit phase information.

Phase and Sinusoidal Coding

Sinusoidal codecs model speech as a series of sine waves:

However to reduce the bit rate the phase of each sine wave is usually discarded, then regenerated at the decoder using some sort of rule based approach. It is generally thought that preserving the phases is not important, although I would argue that the phase information is important for high quality speech. For low rate codecs the choice of the rules used to synthesis the phases affects the quality of the synthesised speech.

An Experiment on Phase Perception

I wrote a simple Octave script called phase.m to test perception of phase. This generates a train of pulses using the sinusoidal model above with all of the amplitudes set to a constant. Any time domain signal can be expressed as a series of sinusoids, when we make all of the amplitudes equal we happen to get a train of pulses.

I simulated the pitch of a male speaker (50 Hz) and female (200Hz) by varying . I then tried two different models for the phase – the first has all phase sets to 0, the second used totally random phases.

Here are plots of the waveforms produced – click on each image to hear what they sound like.

Male 50Hz

Male 50Hz Random Phase

Female 200Hz

Female 200Hz Random Phase

The female with random phases sounds nearly the same as the female with uniform phases, despite a big difference in the time domain waveforms. However the males samples sound very different – the effect of random phases is much more pronounced. This suggests that for low pitched speech, we need to take special care with phase, however for high pitched speech the ear is less sensitive to phase.

I tried a few different pitches and found that (for my ear) about 125Hz was the cross over point where it became difficult to perceive large differences between random and non-random phases. This corresponds to a pitch period of 8ms.

The phase of speech harmonics affects the time domain waveform in the following way. When a bunch of the sinusoidal oscillators all have the same phase at the same time, they reinforce and a pulse is formed. With all set to zero this tends to happen just once per pitch period. However if are not carefully chosen there are other points in time where the oscillators come into the same phase and several time domain pulses can form every pitch period.

Why the Ear is Sensitive to Phase for Low Pitch Speakers

I dug up some books on the ear over the weekend [1], to see what they had to say about the ears perception of pitch and the subject of phase. My theory is that for low pitched speakers the ear is sensitive to pulses in the time domain. With male speakers there is a big gap between pitch pulses (e.g. 20ms for a 50Hz pitch). If the phases are such that other pulses can form between the major 20ms pitch pulses, then the ear can detect them. For high pitched speakers the gap between pitch pulses is already small, and the ear can’t resolve another pitch pulse between them.

There could be a physiological reason for this. The cochlear has a bunch of hairs that vibrate back and forth in time with the sound entering the ear. As the hairs move past a certain point they “fire” a nerve impulse off to the brain. Think about a time domain impulse (say clicking your fingers) entering the ear – it will have broadband energy and cause each hair cell to fire. At a low pulse rate all of the nerve cells can fire with each click that enters the cochlear. However as they are based on chemical processes there is a limit to the rate they can fire – past a certain rate they can’t fire on every click. It’s like reloading a gun, you need a certain time before pulling the trigger. So past a certain pulse rate (pitch period) the ear can no longer resolve separate time domain pulses.

Phase and the Open Source Low Rate Codec

So how does this apply to the low rate codec project? Well if we don’t transmit phases we need a rule based approach to reconstruct them at the decoder. The theories in this post suggest we should choose the phases so that for low pitched speakers only one major pulse can form per pitch period. I’ll test this theory over the next few weeks as I quantise the sinusoidal coder further.

Possible Improvements to AMBE

Below are some plots of the LPC residual of source (morig.wav) and 2400 bit/s codec AMBE (m2400.wav) speech for the word “frog”:

LPC Residual of Original

LPC Residual of 2400 bits/s AMBE

The LPC residual is the signal after LPC filtering. LPC filtering tends to remove the effects of the spectral envelope (effectively stripping off the amplitudes of the harmonics ) ideally leaving a train of impulses, much like our test sequences above.

For voiced speech MBE uses a very similar sum-of-sinusoids model to our sinusoidal codec. Now the theory above suggests the reverberation is caused by misplaced pulses in the time domain, due to inappropriate selection of the sinusoid phases.

In the original signal you can see a relatively smooth train of impulses at the pitch period of amplitude 2000-4000. However the the MBE coded signal you can see several large pulses that are out of step with the underlying pulses train. The theory I discussed above would suggest that these pulses are the cause of the reverberant speech – unwanted pitch pulses due to inappropriate selection of phases during synthesis.

The good news is that the male speech quality can possibly be improved without changing the algorithm bit stream by choosing a better algorithm at the decoder for phase reconstruction. This means a firmware upgrade would improve speech quality without affecting interoperability.

Links

[1] Douglas O’Shaughnessy, “Speech Communication – Man and Machine”, Addison-Wesley 1990, page 146.
[2] Open Source Low Rate Codec Part 1
[3] Open Source Low Rate Codec Part 2
[4] Codec2 Web Page

11 thoughts on “Open Source Low Rate Speech Codec Part 3 – Phase and Male Speech”

  1. In your plots of generated waveforms for male and female speech, the second one is mislabeled.
    Your posts on speech codecs are interesting and I’m just about managing to follow with my college-level signal processing knowledge!

  2. Thanks Justyn for that correction. Don’t worry – I am just managing to follow the posts myself! Helps to write about this stuff as I go.

  3. Hi David,

    What would actually be nice if you ever decided to transmit the phase is to only transmit time deltas or something like that. That way, if there was a transmission error, the phase would still be consistent in time and the quality would be no worse than if the phase hadn’t been transmitted.

  4. Would it be possible to analyze how the decoder will reconstruct the waveform on the encoder side, and adjust the LPC quantization to avoid these spikes? I haven’t worked through the math, so I’m not sure how practical that would be. But it seems that any rule-based method for reconstructing the phases is sure to be wrong some of the time, and this would allow the encoder to compensate for it, again without changing the bitstream (or even requiring a firmware upgrade for receivers).

  5. The MBE codec doesn’t actually use LPC – I was just using LPC filtering to analyse the signal. In codecs like the one I am working on that do use the LPC phase spectra as part of the phase model what you are suggesting might be possible.

  6. Just for reference your pitches are a little low, 400Hs is pretty near the lower end of the normal female range (with some exceptions) and 100 Hz is the most common lower end for males (with some exceptions to as low as 50Hz.)

    You also might find better fidelity by modifying the generated waveform to include all the harmonics but produce a greatly reduced peak to average ratio. That will happen anyway in the (possibly very “cheap”) components following the generator. And you might as well preempt the issue so that what you generate is more likely to be what is heard.

    You also may have to pass along the phase information for the first octave or two above the base frequency of the voice. Of course, if you do you may find you are reinventing MELP and will run afoul of the patent.

    I’d also suggest a look for pitch modification technology. MELP seems to be fairly closely related to Fourier transform based pitch modification with a very sparse collection of Fourier components actually used.

    {^_^}

  7. Hi Joanne – The text I have in front of me (Ref [1] above) suggests average F0 values for males and females as 132 and 232 Hz, with a 2 octave range (e.g. males from 80 to 160Hz). Yes 50Hz is on the low side for males, but one of my tests samples hits this and lower (hts1a on the previous blog post).

    Yes reducing the peak-average ratio seems to be a big factor in natural sounding speech. Another way of looking at this is adjusting the harmonic phases to smear the onset of each glottal pulse a little in time.

    Yes the low order phases appear more important than high frequency, so transmitting them is one approach. However I think it’s possible come up with a rule based approach that generates phase information that sounds close to the original.

    I am currently working on a rule based phase model that doesn’t need to transmit any phases – it uses the phase information from a LPC synthesis filter. The algorithm is explained in the phase.c source code.This generates speech that sounds quite close to the original phases. Works well on most speakers.

    Re patents – MELP didn’t invent the idea of transmitting harmonic phases (that idea reaches back to the 1970’s), although I imagine they have patented some novel twist on the idea. I have some thoughts on codec patents on my codec2 page.

    Cheers,

    David

  8. Regarding the pitches I’m thinking in terms of measured pitches using spectrograms. Art Bell is one of the low pitches around and he only gets down to about 70Hz. And that’s not filtering hiding a lower pitch. Otherwise that lower pitch would appear as he runs up an octave. He has a well modulated voice.

    As a side note if they try most humans discover their natural vocal range is about two octaves without getting into vocal cord ruining falsetto or attempting the nice warm low pitched voice that left Walter Cronkite virtually speechless until he got some therapy for it. Of course, a head cold bass can get down well below 80 Hz and even below 50Hz at times.

    Play with the spectrogram waterfall as in the typical PSK tool. WinPSK has a particularly nice tool with some degree of calibration across its span. It’s fun to watch the speech frequencies of some of the “dorks” who try to achieve the “Voice of God” tonality. They never really get it down to 50Hz without a subharmonic generator.

    (I knew a prominent speech therapist here in the LA area for awhile. He was fascinating to converse with.)

    {^_-}

  9. Steve, that below a Russian Bass, a basso profondo, range for singing. That is about the bottom of the voice range that projects at all well. The normal bass bottoms out around 80Hz for singing. For “Art Belling” it is possible to misuse the voice to get below these frequencies if you close-mic. But dipping below 50Hz on ham radio is something I’ve never seen. And I measure both by the lowest spectral line and the frequency difference between spectral lines in the 300-400Hz range.

    And I do note a brain fart in the first message I posted. The bottom end of most good sounding female voices is maybe 150-200 Hz. I had it up twice that. (A real contralto can reach 130 Hz. But they tend to get husky voiced down there.)

    Note that I do like deep voices in men. So I pay attention to them. And even some men who fancy they have very deep sub-DC voices are, so far, never deeper than 65 Hz. This is monitoring with very narrow FFT bands so I can see nice spectral peaks. Voices do have a “low Q” quality to them which spreads their spectrum somewhat. I’m taking the pitch as the peak in the middle of that spread, not the bottom end of the low-Q tone.

    (My partner works on professional audio software as a sideline. I tend to work on professional video software more than audio. So I have a lot of equipment, test and production, at my disposal for making these kinds of tests.)

    {^_^}

Comments are closed.