This post describes a post filter that significantly improves the quality of Codec 2.
For the past month I have been working on developing a high quality, higher bit rate version of Codec 2. The target is 4000 bit/s with speech quality similar to todays 8000 bit/s codecs. The application is VOIP, rather than HF digital radio.
Codec 2 represents speech a sum of sine waves. In C code form:
for(m=1; m<=L; m++) s[n] += A[m]*cos(Wo*m*n + phi[m]);
A[m] is the amplitude of sinusoid m, and phi[m] is the phase. L is the number of sine waves, which can range from about 10 to 80. To further complicate my life L varies from frame to frame with the pitch of the speech.
The key to high quality is how to handle the phases. At 2400 bit/s and below, the phases are discarded and synthesised using a rule based approach at the decoder. Previous tests had shown that this produced a significant drop in quality compared to using the original phases. So I figured the key to a high quality version of Codec 2 was working out how to quantise and transmit the phases at a reasonable bit rate.
This is hard work - more "research" than development. I come up with a theory, write some code to test the idea (i.e perform an experiment), then usually it fails. You (hopefully) learn a little bit from the failure and try again. When this happens for weeks on end it gets frustrating.
However I do think it's fascinating that an individual like me can "do science" in a home office with just a laptop. Use the scientific process to find little bits of new knowledge, that can then be disseminated through blogging and publishing the source code in open source form.
I have had a few wins, including developing some interesting vector quantisation techniques for the phases and even amplitudes. However it was a case of interesting science not producing useful engineering. The techniques partially worked but when combined the speech quality was too rough, or the bit rate too high. I'll write these up in a blog post some day.
LPC and Phase Model Interactions
Last last week I returned to the phase model currently used at 2400 bit/s and below. I started looking at why the quality dropped off with this phase model.
Turns out the phase model alone actually produces reasonable quality speech, comparable with some of my better phase quantisation efforts of the last few weeks. The quality drop occurs when the phase model is combined with the LPC amplitude model. However the LPC amplitude model alone (i.e. with original phases) also sounds OK. So it was some sort of combination of the phase model and LPC amplitude modelling. Some sort of interaction.
Sidebar: the LPC amplitude model is used to represent the variable number of amplitude, A[m] above, with 10 Linear Prediction Coefficients (LPCs). LPC is commonly used in other time domain codecs like g729, Speex etc, and has nice properties for compressing and transmitting speech spectral information over a channel. This earlier post has some more information on LPCs and Codec 2.
So I started thinking on why the LPC and phase models would interact. I could hear distortion in the one part of a sample so I zoomed in on that using my Octave scripts (click for a larger image):
The red line is the original amplitudes A[m], the green line is the amplitudes we obtain through LPC modelling (all plotted on a dB scale). The green line is what the decoder uses to reconstruct the speech. The purple line is the difference, the error in the LPC modelling. Along the bottom axis is the frequency in Hz. Our ear can't hear some errors, for example the large error above 3500 Hz is at a low absolute level and probably knocked out by the D/A hardware.
We perceive speech based on "peaks" of high energy in the speech spectrum. The frame above is 20ms from the "a" vowell in the word "navy". Our ears, combined with our brains, interpret this spectrum as an "a". Energy in other regions of the spectrum can interfere with that - for example if due to coding errors or background noise energy turns up in the spectral valleys between the peaks, the speech is harder to understand or unpleasant.
Reducing LPC modelling errors
In the figure above you can see that the LPC model has made about a 10dB error in the spectral valley around 1000 Hz. I developed a theory that this error in the amplitude modelling, when combined with the phase model (which is also an approximation of the actual phases), causes the speech quality to drop. BTW this was just one of many theories I explored, the rest were dissapointing duds. At the time you have no idea which idea will work.
I ran a quick experiment to test this idea. I artificially lowered the amplitude in the spectral valleys for 3dB, and raised the amplitude of the formants by 3dB. This actually sounded pretty good - a definite improvement. Yaaay, a rare win! As a permenant fix I implemented a post filter, something I had read about for CELP codecs but not tried with Codec 2 before. This plot shows how it works (click for a larger image):
The red line is the LPC spectrum, which we use to model the amplitude samples A[m]. We process the LPC information to develop the LPC post filter, which is the blue line at the bottom. This filter enhances the spectral peaks, and supresses the spectral valleys. The green dotted line shows the final LPC spectrum after post filtering. I also added a 3dB boost to the first 1000 Hz, as I thought the post filtered speech needed a bit more "bass". The discontinuity at 1000 Hz doesn't seem to introduce any problems (yet).
Here are some samples of the LPC post filtered samples contrasted with the current 2400 bit/s codec and some reference 8000 bit/s codecs. The LPC post filtered samples aren't fully quantised and are running at a 10ms frame rate. Once quantised I anticipate the same quality at 4000 bit/s, and eventually at 2400 bit/s. The LPC post filter can also be applied to the lower bit rate modes. I'll clean up and release the code in the next few weeks.
|Codec 2 2400 bit/s||LPC Post Filter||8000 bit/s Codec|
I've focussed mainly on males as low pitched voices get distorted more by the phase models used in Codec 2. The mmt1 sample has periodic background noise, it's interesting to see how codecs handle this non-speech signal.
To my ears Codec 2 with the LPC Post Filter sounds pretty good through small loudpeakers, but some distortion is still evident through high quality headphones or loudspeakers. My target applications (digital radio and VOIP through telephone handsets) uses small speakers. To my ear Codec 2 with the LPC Post Filter is comparable with the 8000 bit/s reference samples.
The LPC post filter uses no extra bits, so can be added to any of the Codec 2 modes with zero impact on the bit rate. There are some constants in the post filter, for Ham radio applications these could be hooked up to sliders on a GUI as "tone" controls. Try doing that (tweaking codec performance on the fly) with closed source codecs......
I should mention another post filter I developed a few years back to make Codec 2 work better with background noise. This messes with the phases directly rather than the amplitudes. I am calling the new post filter desrcibed in this post a "LPC post filter" but yes I am going to have to work out a decent nomenclature for this. It's getting confusing.
I was sitting at Adelaide Airport the other day having a coffee and doing a little work while waiting for my flight. I processed some samples, could hear a few differences, then did some more coding. A little while later I listened again and couldn't hear any differences. Huh? Then I looked around and realised the gate next to me had filled with people boarding their flight. The ambient background noise had increased from when I first sat down. This noise was masking the differences I was attempting to hear.
Take this wave file, copy it across to your smart phone (or a laptop). Now try playing it using your media player through the phones little loudspeaker. Hear the difference? Now repeat in a noisy environments. You will notice that when there is background noise, it gets harder to tell the samples apart. The background noise tends to mask the little errors the Codec makes.
I spend my days going slightly crazy listening to small differences between samples in quiet rooms. The idea is that you remove all of the small differences the overall codec will be better. However the type of loudspeaker and background noise also have a huge impact on perceived quality.