In this post I talk about the Codec 2 alpha release, problems with DSP algorithms, some bugs in the voicing estimator, and why speech codec development is tough.
V0.1 Alpha Release
About a month ago I released V0.1 of Codec 2. The response has been amazing. An early release wasn’t my idea – I was tempted to keep messing around with the codec algorithm. However Bruce Perens and others on the Codec 2 mailing list encouraged me to release early. At about the same time I listened to an early MELP simulation. A few samples convinced me that the quality of Codec 2 was already getting close to that of MELP.
So I had a busy few weeks of C coding to get to the alpha code into releasable form. It was mainly refactoring, integration, and writing separate encoder and decoder programs. I wasn’t looking at DSP or codec issues. After 20 years of C this sort of coding is easy for me, relaxing even.
Soon after the alpha release came a flood of patches, PayPal and equipment donations, and the project was Slashdotted!
Just after the V0.1 release Bruce presented a cool talk at the 2010 ARRL and TAPR Digital Communications Conference. Here are some Codec 2 slides which explains the project and a little about the codec algorithm. There are some notes under each slide.
An important part of Codec 2 is making speech coding algorithms accessible to everyone, rather than locked up as “secret sauce” in binary blobs or patents. So please feel free to use these slides for presentations on Codec 2 at your local Linux group or Ham Radio club.
Some broad goals for the project are emerging:
- A toll quality codec at 2000 to 4000 bit/s. An open source, free codec that sounds as good as 8000 bit/s g.729 at a fraction of the bit rate.
- A communications quality codec at 1200-2400 bit/s. The speech quality should be roughly the same as xMBE and MELP at 1200 – 2400 bits/s.
- A digital radio “mode” for HF and VHF radio applications that combines Codec 2, FEC, and a modem. The target is better speech quality than Single Side Band (SSB) at equivalent SNR.
For the next few months I want to take another look at the codec algorithms, hunt down some bugs, and see if I can improve the quality. In particular I would like to work on voicing estimation and LSP quantisation.
Codec 2 uses a model based algorithm. Rather than sending the original speech waveform, it fits the incoming speech to a model, then transmits the model parameters. Codec 2 models speech as the sum of many sine waves:
Model parameters include the pitch, the amplitude of each sine wave, and a binary flag called “voicing”.
Speech can be broadly separated into voiced (vowels like “aaaaahhh”) and unvoiced sounds (consonants like “ssssss”). In Codec 2 the voicing estimator looks at the speech signal and makes a voiced/unvoiced decision every 10ms. However, it makes some mistakes.
Mistakes are a common problem with DSP algorithms that process real world signals. You read the scientific papers full of fancy math and the algorithms all sound fantastic. But in practice, with real world signals, they all make mistakes. DSP algorithms are about 20% math, 10% coding and 70% perspiration as you grind through all the real world exceptions.
Echo cancellation is a great example of this principal. The adaptive filters used are described in many books and papers but the devil is in the real world detail. For the Oslec echo canceller we worked through the real world problems using an open source approach of collecting echo samples from alpha testers all around the world.
But back to my voicing estimator problem. Bill Cowley spotted a problem in the “sssh” part of “dissh” in the synthesised speech from the Codec 2 decoder. Here is a plot of the input (top) and Codec 2 output (bottom) waveforms for the “sssh” part of “dissh”:
The output “shh” signal is distorted. Listen to this sample which combines the original and Codec 2 processed samples of “dish”. See if you can hear a difference between the “shh” sounds. One of the problems in speech codec development is hearing small differences in speech samples. In this case the problem (at least to my ear) is more obvious on the plot above than by listening to the samples. It depends a lot on your speakers, the speech you are processing, and your subjective preference.
Are these subtle problems worth tracking down? I think so. Sometimes small problems become more obvious after further processing, or on other speech material. Finding out why these small errors occur leads to a better understanding of the algorithms involved.
To track down this problem I dumped the voicing estimator output to a text file and wrote an Octave script to visualise the voicing decisions:
The voicing estimator (based on the MBE algorithm ) outputs a Signal to Noise (SNR) ratio in dB. These are plotted as the green crosses along the top of the speech waveform. Voiced speech should have a high SNR, unvoiced speech a low SNR. I apply a threshold to this SNR to obtain the voicing decisions, which are plotted along the bottom. I have used a 4dB threshold (red line) which sometimes declares unvoiced speech to be voiced (like the shhh in “dish”). This error is causing the spikes on the Codec 2 output waveform.
One alternative is using a higher threshold (green line) but this causes errors in the other direction – when I tested other samples some voiced speech was declared unvoiced. Like many DSP algorithms, the voicing estimation algorithm I am using is not perfect.
What to do? Well I tried another voicing estimator (the auto-correlation function). This had problems with similar areas of speech to the MBE algorithm. It’s output (and hence errors) was correlated with the MBE voicing estimator.
My next attempt is to try some sort of post processing or tracking algorithm. For example the pitch estimate is usually quite stable during voiced speech but jumps around randomly during unvoiced speech. We might be able to use the pitch estimator output to determine if the voicing estimate is correct.
Testing Speech Codecs is Tough
Normally when we develop a program we have some way of testing it. For example if I develop a DTMF decoder I can write a program to test the decoder under varying Signal to Noise Ratio (SNR) conditions. However testing a speech codec is really hard. The ear is an imprecise instrument. I have spent hours trying to listen to small differences between two speech samples processed in slightly different ways. Fatigue sets in after a while and everything sounds the same. People disagree over the same samples. Samples sound different depending on the headphones or loudspeaker used. Loudspeakers tend to hide small differences. Sometimes a profound difference on one day is inaudible the day after.
So visualising the operation of the codec can really help. For example the plots above helped visualise the operation of the voicing estimator and it’s effect on the output speech. Like a software oscillascope for DSP signals.
 The MBE voicing estimation algorithm is summarised in section 3.6 of my thesis. For Codec 2 we compare the first 1KHz to an all-voiced spectrum to obtain a single voiced/unvoiced decision. Also check out the function est_voicing_mbe here.