Every time I start working on Deep Learning and Codec 2 I get side tracked! This time, I started developing a reference codec that could be used to explore machine learning, however the reference codec was sounding pretty good early in it’s development so I have pushed it through to a fully quantised state. For lack of a better name it’s called candidate D, as that’s where I am up to in a series of codec prototypes.
The previous Codec 2 2200 post described candidate C. That also evolved from a “quick effort” to develop a reference codec to explore my deep learning ideas.
Learning about Vector Quantisation
This time, I explored Vector Quantisation (VQ) of spectral magnitude samples. I feel my VQ skills are weak, so did a bit of reading. I really enjoy learning, especially in areas I have been fooling around for a while but never really understood. It’s a special feeling when the theory clicks into place with the practical.
So I have these vectors of K=40 spectral magnitude samples, that I want to quantise. To get a feel for the data I started out by looking at smaller 2 and 3 dimensional vectors. Forty dimensions is a bit much to handle, so I started out by plotting smaller slices. Here are 2D and 3D scatter plots of adjacent samples in the vector:
The data is highly correlated, almost a straight line relationship. An example of a 2-bit, 2D vector quantiser for this data might be the points (0,0) (20,20) (30,30) (40,40). Consider representing the same data with two 1D (scalar) quantisers over the same 2 bit range (0,20,30,40). This would take 4 bits in total, and be wasteful as it would represent points that would never occur, such as (60,0).
 helped me understand the relationship between covariance and VQ, using 2D vectors. For Candidate D I extended this to K=40 dimensions, the number of samples I am using for the spectral magnitudes. Then  (thirty year old!) paper how the DCT relates to vector quantisation and the eigenvector/value rotations described in . I vaguely remember snoring my way through eigen-thingies at math lectures in University!
My VQ work to date has used minimum Mean Square Error (MSE) to train and match vectors. I have been uncomfortable with MSE matching for a while, as I have observed poor choices in matching vectors to speech. For example if the target vector falls off sharply at high frequencies (say a LPF at 3500 Hz), the VQ will try to select a vector that matches that fall off, and ignore smaller, more perceptually important features like formants.
VQs are often trained to minimise the average error. They tend to cluster VQ points closer to those samples that are more likely to occur. However I have found that beneath a certain threshold, we can’t hear the quantisation error. In Codec 2 it’s hard to hear any distortion when spectral magnitudes are quantised to 6 dB steps. This suggest that we are wasting bits with fine quantiser steps, and there may be better ways to design VQs, for example a uniform grid of points that covers a few standard deviations of data on the scatter plots above.
I like the idea of uniform quantisation across vector dimensions and the concepts I learnt during this work allowed me to do just that. The DCT effectively lets me use scalar quantisation of each vector element, so I can easily choose any quantiser shape I like.
Candidate D uses a similar design and bit allocation to Candidate C. Candidate D uses K=40 resampling of the spectral magnitudes, to help preserve narrow high frequency formants that are present for low pitch speakers like hts1a. The DCT of the rate K vectors is computed, and quantised using a Huffman code.
There are not enough bits to quantise all of the coefficients, so we stop when we run out of bits, typically after 15 or 20 (out of a total of 40) DCTs. On each frame the algorithm tries direct or differential quantisation, and chooses the method with the lowest error.
I have a couple of small databases that I use for listening tests (about 15 samples in total). I feel Candidate D is better than Codec 2 1300, and also Codec 2 2400 for most (but not all) samples.
In particular, Candidate D handles samples with lots of low frequency energy better, e.g. cq_ref and kristoff in the table below.
For a high quality FreeDV mode I want to improve speech quality over FreeDV 1600 (which uses Codec 2 1300 plus some FEC bits), and provide better robustness to different speakers and recording conditions. As you can hear – there is a significant jump in quality between the 1300 bit/s codec and candidate D. Implemented as a FreeDV mode, it would compare well with SSB at high SNRs.
There are many aspects of Candidate D that could be explored:
- Wideband audio, like the work from last year.
- Back to my original aim of exploring deep learning with Codec 2.
- Computing the DCT coefficients from the rate L (time varying) magnitude samples.
- Better time/freq quantisation using a 2D DCT rather than the simple difference in time scheme used for Candidate D.
- Porting to C and developing a real time FreeDV 2200 mode.
The current candidate D 2200 codec is implemented in Octave, so porting to C is required before it is usable for real world applications, plus some more C to integrate with FreeDV.
If anyone would like to help, please let me know. It’s fairly straight forward C coding, I have already done the DSP. You’ll learn a lot, and be part of the open source future of digital radio.
 A geometric interpretation of the covariance matrix, really helped me understand what was going on with VQ in 2 dimensions, which can then be extended to larger dimensions.
 Vector Quantization in Speech Coding, Makhoul et al.
[3 Codec 2 Wideband, previous DCT based Codec 2 Work.