Codec 2 2200

FreeDV 700D has shown digital voice can outperform SSB at low SNRs over HF multipath channels. With practice the speech quality is quite usable, but it’s not high quality and is sensitive to different speakers and microphones.

However at times it’s like magic! The modem signal is buried in all sorts of HF noise, getting hammered by fading – then one of my friends 800km away breaks the FreeDV squelch (at -2dB!) and it’s like they are in the room. No more hunched-over-the-radio listening to my neighbours plasma TV power supply.

There is a lot of promise in FreeDV, and it’s starting to come together.

So now I’d like to see what we can do at a higher SNR, where SSB is an armchair copy. What sort of quality can we develop at 10dB SNR with light fading?

The OFDM modem and LDPC Forward Error Correction (FEC) used for 700D are working well, so the idea is to develop a high bit rate/high(er) quality speech codec, and adapt the OFDM modem and LPDC code.

So over the last few months I have been jointly designed a new speech codec, OFDM modem, and LDPC forward error correction scheme called “FreeDV 2200”. The speech codec mode is called “Codec 2 2200”.

Coarse Quantisation of Spectral Magnitudes

I started with the design used for Codec 2 700C. The resampling step is the key, it converts the time varying number of harmonic amplitudes to fixed number (K=20) of samples covering 100 to 4000 Hz. They are sampled using the “mel” scale, which means we take more finely spaced samples at low frequencies, with coarser steps at high frequencies. This matches the log frequency response of the ear. I arrived at K=20 by experiment.

This Mel-resampling is not perfect. Some samples, like hts1a, have narrow formants at higher frequencies that get undersampled by the Mel resampling, leading to some distortion. Linear Predictive Coding (LPC) (as used in many Codec 2 modes) does a better job for some of these samples. More work required here.

In Codec 2 700C, we vector quantise the K=20 Mel sample vectors. For this new Codec 2 mode, I experimented with directly quantising each sample. To my surprise, I found very little difference in speech quality with coarsely quantised 6dB steps. Furthermore, I found we can limit the rate of change between samples to a maximum of +/-12 dB. This allows each sample to be delta-coded in frequency using a step of 0, -6, +6, -12, or +12dB. Using a 2 or 3 bit Huffman coding approach it turns out we need 45-ish bit/s frame for good quality speech.

After some experimentation with bit errors, I ended up using a fixed bit allocation rather than Huffman coding. Huffman coding uses variable length symbols (in my case 2 and 3 bits). After a bit error you don’t know where the next variable length symbol in the string starts and ends. So a single bit error early in the bits describing the spectrum can corrupt the entire spectrum.

Parameter Bits Per Frame
Spectral Magnitude 44
Energy 4
Pitch 7
Voicing 1
Total 56

At a 25ms frame update rate this is 56/0.025 = 2240 bits/s.

Here are some samples that show the effect of the processing stages. During development it’s useful to listen to the effect each stage has on the codec speech quality, for example to diagnose problems with a sample that codes poorly.

Original Listen
Codec 2 Unquantised 12.5ms frames Listen
Synthesised phases Listen
K=20 Mel Listen
Quantised to 6dB steps Listen
Fully Quantised with 44 bits/25ms frame Listen

Here is a plot of 50 frames of coarsely quantised amplitude samples, the 6dB steps are quite apparent:

Listening Tests

To test the prototype 2200 mode I performed listening tests using 15 samples. I have collected these samples over time as examples that tend to challenge speech codecs and in particular code poorly using FreeDV 1600 (i.e. Codec 2 1300). During my listening tests I use a small set of powered speakers and for each sample chose which codec algorithm I prefer, then average the results over the 15 samples.

Over the 15 samples I felt 2200 was just in front of 2400 (on average), and well in front to 1300. There were some exceptions of course – it would useful to explore them some time. I was hoping that I could make 2200 significantly better than 2400, but ran out of time. However I am pleased that I have developed a new method of spectral quantisation for Codec 2 and come up with a fully quantised speech codec based on it at such an early stage.

Here are a few samples processed with various speech codecs. On some of them 2200 isn’t as good as 2400 or even 1300. I’m presenting some of the poorer samples on purpose – the poorly coded samples are the interesting ones worth investigating. Remember the aims of this work are significant improvements over (i) Codec 2 1300, and (ii) SSB at around 10dB SNR. See what you think.

Sample 1300 2400 2200 SSB 10dB
hts1a Listen Listen Listen Listen
ve9qrp Listen Listen Listen Listen
mmt1 Listen Listen Listen Listen
vk5qi Listen Listen Listen Listen
1 vk5local Listen Listen Listen Listen
2 vk5local Listen Listen Listen Listen

Further Work

I feel the coarse quantisation trick is a neat way to reduce the information in the speech spectrum and is worth exploring further. There must be more efficient, and higher quality ways to encode this information. I did try some vector quantisation but wasn’t happy with the results (I always seem to struggle with VQ). However I just don’t have the time to explore every R&D path this work presents.

Quantisers can be stored very efficiently, as the dynamic range of the spectral sample is low (a few bits/sample) compared to floating point numbers. This simplifies storage requirements, even for large VQs.

Machine learning techniques could also be applied, using some of the ideas in this post as “pre-processing” steps. However that’s a medium term project which I haven’t found time for yet.

In the next post, I’ll talk about the modem and LDPC codec required to build FreeDV 2200.

6 thoughts on “Codec 2 2200”

  1. cq_ref_quant.wav has some significant clipping (as viewed on Audacity). I could hear it, so took a look at the time domain display.

    1. Nice work Walter. Yes it would be great if you could blog on your work and publish the source code, and collaborate with me (if you like). The world needs more open source low bit rate speech codecs….

      – David

  2. Hi David,
    again, great work! The new HQ candidate is very promising. I did a survey among some OMs at my laptop on our latest fieldday recording some German texts by different speakers and encoding with 1300, 700C, 450 and 450PWB. They were impressed by the speech quality of 450PWB and rated it higher than 700C. (Ok, you have to consider that I used a ~50€ headset that was suited to the VQ)
    However, the opinion was that it would frighten inexperienced users and only higher quality would lead to an acceptance. But as a fallback, 450PWB would still be great, if the SNR drops.

    Listening to your new 2200 mode, I found a bug in 2400: Look at the spectrogram of “vk5qi” and compare it with 2200. The fundamental wave (= w0) is completely missing. Did you forget index 0 when porting the Octave code to C? This should only be an issue of the decoder, not the encoder, since the pitch estimation works fine.

    Best regards – you will hear from us regarding the 450 mode in some time…
    Stefan

    1. Hi Stefan,

      Nice work on the subjective testing and I look forward to working with you and Thomas some more :-)

      Yes the LPC/LSP based modes like 2400 intentionally remove anything beneath 150 Hz, as the LPC model has problems modeling low frequencies, IIRC the derivative of LPC spectra has to be 0 (a horizontal line) at 0 Hz.

      – David

Leave a Reply

Your email address will not be published. Required fields are marked *