I’ve been working on the Linear Predictive Coding (LPC) modeling used in the Codec 2 700 bit/s mode to see if I can improve the speech quality. Given this mode was developed in just a few days I felt it was time to revisit it for some tuning.
LPC fits a filter to the speech spectrum. We update the LPC model every 40ms for Codec 2 at 700 bit/s (10 or 20ms for the higher rate modes).
Speech Codecs typically use a 10th order LPC model. This means the filter has 10 coefficients, and every 40ms we have to send them to the decoder over the channel. For the higher bit rate modes I use about 37 bits/frame for this information, which is the majority of the bit rate.
However I discovered I can get away with a 6th order model, if the input speech is filtered the right way. This has the potential to significantly reduce the bit rate.
Our ear perceives speech based on the frequency of peaks in the speech spectrum. When the peaks in the speech spectrum are indistinct, we have trouble understanding what is being said. The speech starts to sound muddy. With analog radio like SSB (or in a crowded room), the troughs between the peaks fill with noise as the SNR degrades, and eventually we can’t understand what’s being said.
The LPC model is pretty good at representing peaks in the speech spectrum. With a 10th order LPC model (p=10) you get 10 poles. Each pair of poles can represent one peak, so with p=10 you get up to 5 independent peaks, with p=6, just 3.
I discovered that LPC has some problems if the speech spectrum has big differences between the low and high frequency energy. To find the LPC coefficients, we use an algorithm that minimises the mean square error. It tends to “throw poles” at the highest energy part of signal (frequently near DC), while ignoring the still important, lower energy peaks at higher frequencies above 1000Hz. So there is a mismatch in the way LPC analysis works and how our ears perceive speech.
For example I found that samples like hts1a and ve9qrp code quite well, but cq_ref and kristoff struggle. The former have just 12dB between the LF and HF parts of the speech spectrum, the latter 40dB. This may be due to microphones, input filtering, or analog shaping.
Another problem with using an unconventionally low LPC order like p=6 is that the model “runs out of poles”. Some speech signals may have 4 or 5 peaks, so the poor LPC model gets all confused and tries to reach a compromise that just sounds bad.
I messed around with a bunch of band pass filters that I applied to the speech samples before LPC modeling. These filters whip the speech signal into a shape that the LPC model can work with. I ran various samples (hts1a, hts2a, cq_ref, ve9qrp_10s, kristoff, mmt1, morig, forig, x200_ext, vk5qi) through them to come up with the best compromise for the 700 bits/mode.
Even though the latter sample is band limited, it is easier to understand as the LPC model is doing a better job of clearly representing those peaks.
After some experimentation with sox I settled on two different filter types: a sox “bandpass 1000 2000″ worked on some, whereas on others with more low frequency content “bandpass 1500 2000″ sounded better. Some helpful discussions with Glen VK1XX had suggested that a two band AGC was common in broadcast audio pre-processing, and might be useful here.
However through a process of frustrated experimentation (I was stuck on cq_ref for a day) I found that a very sharp skirted filter between 300 and 2600Hz did a pretty good job. Like p=6 LPC, a 2600Hz cut off is quite uncommon for speech coding, but SSB users will find it strangely familiar…….
Note that for the initial version of the 700 bit/s mode (currently in use in FreeDV 700) I have a different band pass filter design I chose more or less at random on the day that sounds like this with p=6 LPC. This filter now appears to be a bit too severe.
Here is a little chunk of speech from hts1a:
Below are the original (red) and p=6 LPC models (green line) without and with a sox “bandpass 1000 2000″ filter applied. If the LPC model was perfect green and red would be superimposed. Open each image in a new browser tab then jump back and forth. See how the two peaks around 550 and 1100Hz are better defined with the bandpass filter? The error (purple) in the 500 – 1000 Hz region is much reduced, better defining the “twin peaks” for our long suffering ears.
Here are three spectrograms of me saying “D G R”. The dark lines represent the spectral peaks we use to perceive the speech. In the “no BPF” case you can see the spectral peaks between 2.2 and 2.3 seconds are all blurred together. That’s pretty much what it sounds like too – muddy and indistinct.
Note that compared to the original, the p=6 BPF spectrogram is missing the pitch fundamental (dark line near 0 Hz), and a high frequency peak at around 2.5kHz is indistinct. Turns out neither of these matter much for intelligibility – they just make the speech sound band limited.
OK, so over the last few weeks I’ve spent some time looking at the effects of microphone placement, and input filtering on p=6 LPC models. Now time to look at quantisation of the 700 mode parameters then try it again over the air and see if the speech quality is improved. To improve performance in the presence of bit errors I’d also like to get the trellis based decoding into a real world usable form. When the entire FreeDV 700 mode (codec, modem, error handling) is working OK compared to SSB, time to look at porting to the SM1000.
Command Line Magic
I’m working with the c2sim program, which lets me explore Codec 2 in a partially quantised or incomplete state. I pipe audio in and out between various sox stages.
Note these simulations sound a lot better than the final Codec 2 at 700 bit/s as nothing else is quantised/decimated, e.g. it’s all at a 10ms frame rate with original phases. It’s a convenient way to isolate the LPC modeling step with as much fidelity as we can.
If you want to sing along here are a couple of sample command lines. Feel free to ask me any questions:
sox -r 8000 -s -2 ../../raw/hts1a.raw -r 8000 -s -2 -t raw - bandpass 1000 2000 | ./c2sim - --lpc 6 --lpcpf -o - | play -t raw -r 8000 -s -2 -
sox -r 8000 -s -2 ../../raw/cq_ref.raw -r 8000 -s -2 -t raw - sinc 300 sinc -2600 | ./c2sim - --lpc 6 --lpcpf -o - | play -t raw -r 8000 -s -2 -