For the last few weeks I’ve been working on improving the quality of the 700 bit/s Codec 2 mode. It’s been back to “research and disappointment” however I’m making progress, even uncovering some mysteries that have eluded me for some time.
I’ve just checked in a “FreeDV 700B” mode, and ported it to the FreeDV API and FreeDV GUI application. Here are some samples:
|HF fast fading SNR=2dB||700||700B||1600||SSB|
I think 700B is an incremental improvement on 700, but still way behind FreeDV 1600 on clean speech. Then again it’s half the bit rate of 1600 and doesn’t fall over in HF channels. On air reports suggest the difference in robustness between 1600 and 700 on real HF channels is even more marked than above. Must admit I never thought I’d say the FreeDV 1600 quality sounds good!
FreeDV 700B uses a wider Band Pass Filter (BPF) than 700, and a 3 stage Vector Quantiser (VQ) for the LSPs, rather than scalar quantisers. VQ tends to deliver better quality for a given bit rate as it takes into account correlations between LSPs. This lets us lower the bit rate for a given quality. The down side is VQs tend to be more sensitive to bit errors, use more storage, and more MIPs. As HF is a high bit error rate channel, I have shied away from VQ until now.
Cumulative Quality, Additive Distortion
Using the c2sim program I can gradually build up the complete codec (see below for command line examples):
|p=6 LPC Amplitude Modelling|
|phase0 Phase Model|
|Decimated to 40ms|
|Fully Quantised 700B|
As each processing step is added, the quality drops. There is also another drop in quality when you get bit errors over the channel. It’s possible to mix and match c2sim command line options to homebrew your own speech codec, or test the effect of a particular processing step. It’s also a very good idea to try different samples, vk5qi and ve9qrp tend to code nicely, tougher samples are hts1a, hts2a, cq_ref, and kristoff.
I have a model in my mind “Overall Quality = 0BER quality – effect of bit errors”. Now 700B isn’t as robust as 700 as it uses Vector Quantisation (VQ). However as 700B starts from a higher baseline, when errors occur the two modes sound roughly the same. This was news to me, and a welcome development – as VQ is a powerful technique for high quality while lowering the bit rate. A good reason to release early and often – push a prototype through to a testable state.
The model also seems to apply for the various processing and quantisation steps. The BPF p=6 LPC model causes quite a quality drop, so you have to be very careful with LSP quantisation to maintain quality after that step. With p=10 LPC, the quality starts off high, so it appears less sensitive to quantisation.
Quantiser Design and Obligitory Plots to break up Wall of Text
A quantiser takes a floating point number, and represents it with a finite number of bits. For example it might take the pitch in the range of 50 to 500 Hz, and convert it to a 7 bit integer. Your sound card takes a continuous analog voltage and converts it to a 16 bit number. We can then send those bits over a channel. Less bits is better, as it lowers your bit rate. The trade off is distortion, as the number of bits drop you start to introduce quantisation noise.
A quantiser can be implemented as a look up table, there are a bunch of those in codec2-dev/src/codebook, please take a look.
Quantiser design is one of those “nasty details” of codec development. It reminds me of fixed point DSP, or echo cancellation. Lots of tricks no one has really documented, and the theory never seems to quite work without experienced-based tweaking. No standard engineering practice.
Anyhoo, I came up with a simple test so see if all quanister indexes are being used. In this case it was a 3 bit quantiser for the LPC energy. This is effectively the volume or gain of the coded speech in the current frame. So I ran a couple of speech samples through c2sim, dumped the energy quantiser index, and tabulated the results:
bin Fa Fr% Fc
1 306 22.58% 306
2 164 12.10% 470
3 298 21.99% 768
4 329 24.28% 1097
5 199 14.69% 1296
6 9 0.66% 1305
7 1 0.07% 1306
ve9qrp_10s (normal volume)
bin Fa Fr% Fc
1 73 7.30% 73
2 68 6.80% 141
3 88 8.80% 229
4 240 24.00% 469
5 328 32.80% 797
6 132 13.20% 929
7 44 4.40% 973
This looks reasonable to me. The louder sample has a distribution skewed towards the higher energy bins, as expected. Although I note the 8th bin is never used. This means we are effectively “wasting” bits. So perhaps we could reduce the range of the quantiser, or it could be a bug. The speech sounds pretty similar with/without the energy quantiser applied in c2sim. This is good.
Here are some plots I generated in the last few weeks that illustrate quantiser design. My first attempt to improve 700 involved new scalar quantisers for the 6 LSPs (click for a larger image):
These are histograms of the indexes of each quantiser. A graphical form of the tabulations above, except for the 6 LSP quantisers rather than energy. Note how some indexes are hardly used? This may indicate wasted bits. Or not. It could be those few samples that hardly register on the histogram REALLY matter for the speech quality. Welcome to speech coding…..
Here is a single frame frozen in time:
The “A enc” and “A dec” lines are the LPC spectrum before and after quantisation. This is represented by 6 LSP frequencies that are the vertical lines plotted at the bottom. Notice how the two spectrums and LSP frequencies are slightly different? This is the effect of quantisation.
Here is a another view of many frames, using a measure called Spectral Distortion (SD):
SD is the difference between the original and quantised LPC spectrums, averaged over all frames and measured in dB. The top plot shows a histogram of the SD for each frame. Most frames have a small SD but there is a long tail of outliers. Some of these matter to us, and some don’t. For example no one cares about a large SD when coding background noise.
The bottom plot shows how SD is distributed across frequency. The high SD at higher frequencies is intentional, as the ear is less sensitive there. The SD drops to zero after 2600 Hz due to the band pass filter roll off.
I’m pleased that with a few weeks work I could incrementally improve the codec quality. Lots more work could be done here, I had to skip over a bunch of ideas in order to get something usable early.
I was also very pleased that once I had the Codec 2 700B mode designed and tested in c2sim, I could quickly get it “on the air”. I “pushed” it through the FreeDV “stack” in a few hours. This involves adding the new mode to codec2.c as separate encoder and decoder functions, modifying c2enc/c2dec, modifying the FreeDV API (including freedv_tx/freedv_rx), testing it over a fading channel simulation with cohpsk_ch, and adding the new mode to the FreeDV GUI program. The speed of integration is very pleasing, and is a sign of a good set of tools, good design, and a well thought out and partitioned implementation.
OK, so back to work. After listening to a few samples on the 700/700B/1600 modes I had the brainstorm of trying p=10 LPC using the same VQ design as 700B. The “one small step leads to another” R&D technique that is the outcome of steady, consistent work. Initials results are very encouraging, as good as FreeDV 1600 for some samples. At half the bit rate. So I’ll hold off on a general release of 700B until I’ve had a chance to run this 700C candidate to ground.
Command Line Kung Fu
Here’s how I simulate operation over a HF channel:
~/codec2-dev/build_linux/src$ ./freedv_tx 700B ../../raw/ve9qrp_10s.raw - | ./cohpsk_ch - - -24 0 2 1 | ./freedv_rx 700B - - | play -t raw -r 8000 -s -2 -
Here’s how I use c2sim to “build” a fully quantised codec. I start with the basic sinusoidal model:
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
Lets add 6th order LPC modelling:
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --bpfb --lpc 6 --lpcpf -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
Hmm, lets compare to 10th order LPC (this doesn’t need the band pass filter as explained here):
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --lpc 10 --lpcpf -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
Unfortunately we can’t send the sinusoidal phases over the channel, so we replace them with the phase0 model.
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --phase0 --postfilter --bpfb --lpc 6 --lpcpf -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
Next step is to use a 18 bit VQ for the LSPs, decimate from 10ms to 40ms frames, and quantise the pitch and energy. Which gives us the fully quantised codec.
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --phase0 --postfilter --bpfb --lpc 6 --lpcpf --lspmel --lspmelvq --dec 4 --sq_pitch_e_low -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
To make a real world usable codec we then need to split the signal processing into a separate encoder and decoder (via functions in codec2.c). However this won’t change the quality, the decoded speech will sound exactly the same.
If you are keen enough to try any of the above and have any questions please email the codec2-dev list, or post a comment below.
Last Few Weeks Progress
Jotting this down for my own record, so I don’t forget any key points. However feel free to ask any questions:
- I have been working with LSPs using a warped frequency (mel scale) axis that models the log frequency response of our ear.
- Tried a new set of scalar quantisters but not happy with quality.
- Studied effect of microphones on low bit rate speech coding. Systemaically tracking down anything that can affect speech quality. Every little bit helps, and improves my understanding of the problems I need to solve.
- Finally worked out why pathological samples (cq_ref, kristoff, k6hx) don’t code well with LPC models.
- Explored p=6 LPC models, and found out just how important clear formant definition is for speech perception.
- Explored vector quantisation for p=6 LPC model, including Octave and C mbest search implementations.
- Engineered an improved 700 bit/s mode, implemented in Octave, C, integrated into FreeDV API and FreeDV GUI program.
- Extended my simulation and test software, c2sim, melvq.m
- Formed ideas on additive distortion in speech coding, importance of clear definition of formants
- Came up with an idea for a non-LPC model that preserves clear formants with less emphasis on factors that don’t affect intelligibility, such as HP/LP spectral slope, format widths. LPC/LSP has some weaknesses and only indirectly preserves the attributes of the speech spectrum that matter, such as formant/anti-formant structure.