FreeDV 700D has shown digital voice can outperform SSB at low SNRs over HF multipath channels. With practice the speech quality is quite usable, but it’s not high quality and is sensitive to different speakers and microphones.
However at times it’s like magic! The modem signal is buried in all sorts of HF noise, getting hammered by fading – then one of my friends 800km away breaks the FreeDV squelch (at -2dB!) and it’s like they are in the room. No more hunched-over-the-radio listening to my neighbours plasma TV power supply.
There is a lot of promise in FreeDV, and it’s starting to come together.
So now I’d like to see what we can do at a higher SNR, where SSB is an armchair copy. What sort of quality can we develop at 10dB SNR with light fading?
The OFDM modem and LDPC Forward Error Correction (FEC) used for 700D are working well, so the idea is to develop a high bit rate/high(er) quality speech codec, and adapt the OFDM modem and LPDC code.
So over the last few months I have been jointly designed a new speech codec, OFDM modem, and LDPC forward error correction scheme called “FreeDV 2200”. The speech codec mode is called “Codec 2 2200”.
Coarse Quantisation of Spectral Magnitudes
I started with the design used for Codec 2 700C. The resampling step is the key, it converts the time varying number of harmonic amplitudes to fixed number (K=20) of samples covering 100 to 4000 Hz. They are sampled using the “mel” scale, which means we take more finely spaced samples at low frequencies, with coarser steps at high frequencies. This matches the log frequency response of the ear. I arrived at K=20 by experiment.
This Mel-resampling is not perfect. Some samples, like hts1a, have narrow formants at higher frequencies that get undersampled by the Mel resampling, leading to some distortion. Linear Predictive Coding (LPC) (as used in many Codec 2 modes) does a better job for some of these samples. More work required here.
In Codec 2 700C, we vector quantise the K=20 Mel sample vectors. For this new Codec 2 mode, I experimented with directly quantising each sample. To my surprise, I found very little difference in speech quality with coarsely quantised 6dB steps. Furthermore, I found we can limit the rate of change between samples to a maximum of +/-12 dB. This allows each sample to be delta-coded in frequency using a step of 0, -6, +6, -12, or +12dB. Using a 2 or 3 bit Huffman coding approach it turns out we need 45-ish bit/s frame for good quality speech.
After some experimentation with bit errors, I ended up using a fixed bit allocation rather than Huffman coding. Huffman coding uses variable length symbols (in my case 2 and 3 bits). After a bit error you don’t know where the next variable length symbol in the string starts and ends. So a single bit error early in the bits describing the spectrum can corrupt the entire spectrum.
|Parameter||Bits Per Frame|
At a 25ms frame update rate this is 56/0.025 = 2240 bits/s.
Here are some samples that show the effect of the processing stages. During development it’s useful to listen to the effect each stage has on the codec speech quality, for example to diagnose problems with a sample that codes poorly.
|Codec 2 Unquantised 12.5ms frames||Listen|
|Quantised to 6dB steps||Listen|
|Fully Quantised with 44 bits/25ms frame||Listen|
Here is a plot of 50 frames of coarsely quantised amplitude samples, the 6dB steps are quite apparent:
To test the prototype 2200 mode I performed listening tests using 15 samples. I have collected these samples over time as examples that tend to challenge speech codecs and in particular code poorly using FreeDV 1600 (i.e. Codec 2 1300). During my listening tests I use a small set of powered speakers and for each sample chose which codec algorithm I prefer, then average the results over the 15 samples.
Over the 15 samples I felt 2200 was just in front of 2400 (on average), and well in front to 1300. There were some exceptions of course – it would useful to explore them some time. I was hoping that I could make 2200 significantly better than 2400, but ran out of time. However I am pleased that I have developed a new method of spectral quantisation for Codec 2 and come up with a fully quantised speech codec based on it at such an early stage.
Here are a few samples processed with various speech codecs. On some of them 2200 isn’t as good as 2400 or even 1300. I’m presenting some of the poorer samples on purpose – the poorly coded samples are the interesting ones worth investigating. Remember the aims of this work are significant improvements over (i) Codec 2 1300, and (ii) SSB at around 10dB SNR. See what you think.
I feel the coarse quantisation trick is a neat way to reduce the information in the speech spectrum and is worth exploring further. There must be more efficient, and higher quality ways to encode this information. I did try some vector quantisation but wasn’t happy with the results (I always seem to struggle with VQ). However I just don’t have the time to explore every R&D path this work presents.
Quantisers can be stored very efficiently, as the dynamic range of the spectral sample is low (a few bits/sample) compared to floating point numbers. This simplifies storage requirements, even for large VQs.
Machine learning techniques could also be applied, using some of the ideas in this post as “pre-processing” steps. However that’s a medium term project which I haven’t found time for yet.
In the next post, I’ll talk about the modem and LDPC codec required to build FreeDV 2200.