This post describes how the masking model frequency/amplitude pairs are quantised.
This work is very new. There are so many different areas to pursue. However I decided “release early and often” – push a first pass right through the quantisation process. That way I can write about it, publish the results, and get some feedback. This post presents a first pass including samples at 700 and 1000 bit/s.
Quantisation takes a floating point value (a real number) and represents it with a small number of bits. In this case we have 4 frequencies, and 4 amplitudes. Eight numbers in total that we must send over the channel. If we sent the floating point values that would be 8×32 = 256 bits/frame. With a 40ms frame update rate that is 256/0.04 = 6400 bit/s. Too high. So we need to come up with efficient quantisers to minimise the number of bits flowing over the channel, while keeping a reasonable speech quality. Speech coding is the art of what can I throw away?
There are a few tricks we can use. The dynamic range (maximum and minimum) values tend to be limited. A good way to look at the range is using a histogram of each value. I ran samples of 10 speakers through the simulations and logged the frequency and amplitudes to generate some histograms.
Here are the frequencies and differences between each frequency. The frequencies were first sorted into ascending order.
Here are the amplitudes, with the mean (frame energy) removed:
Voiced speech tends to have a “low pass” spectral slope – more energy at low frequencies than high frequencies. Unvoiced speech tends to be “high pass”. As discussed in the first post the ear is not very sensitive to fixed “filtering” of speech, ie the absolute value of the formant amplitudes. You can have a gentle band pass filter, some high pass or low pass filtering, and it all sounds fine.
So I reasoned I could fit a straight line the amplitudes, like this:
The first plot is the time domain speech, it’s a frame of Mark saying “five”. The second plot shows the spectral amplitudes Am (red) and the mask (purple) we have fit to them. The mask is described by just four frequencies/amplitude points. The frequencies are labelled by the black crosses.
The last plot shows the four frequency/amplitude points (red), and a straight line fit to them (blue line). The error in the straight line fit to each red point is also shown.
So keeping this straight line fit in mind, lets get back to the histograms. Here are the histograms of the amplitudes, followed by the histograms of the amplitude errors after the straight line fit:
Note how narrow the 2nd plot’s histograms are compared to the first? This makes the values “easier” to represent with a small number of bits. In statistics, we would say the variance of these variables is smaller.
OK, but now we need to send the parameters that describe the straight line. That would be the gradient and y-intercept, here are the histograms:
Notice the mean of the gradient is skewed to negative values? Speech contains more voiced (vowels) than unvoiced speech (constants), and voiced speech is “low pass” (a negative gradient).
The y-intercept is a fancy way of saying the “frame energy”. It goes up and down with the level of the speech.
Quantisation of Frequencies
The frequencies are found in random order by the AbyS algorithm. We can also transmit them in any order, and reconstruct the same spectral envelope at the decoder. For convenience we sort them into ascending order. This reduces the distance between each frequency sample, and lets us delta code the frequencies. You can see above that the histograms of the last 3 delta frequencies cover about the same range.
I used 3 bits for each frequency, giving a total of 12 bit/s frame.
Quantisation of Amplitudes
I used the straight line fit method for the amplitudes. I used 3 bits for the gradient, and 3 bits for the errors in 5dB steps over the range of -15 to 15 dB. I assumed the y intercept would require as many bits as the frame energy (5 bits/frame) that is used for the existing Codec 2 modes.
The quantisation work leads us to a simulation of quantised 1000 and 700 bit/s codecs with the following bit allocations. In the 700 bit/s mode, we don’t transmit the straight line fit errors.
|Parameter||Bits/frame (High Rate)||Bits/frame (Low Rate)|
|Mask amp gradient||3||3|
|Mask amp errors||12||0|
|Frame period (s)||0.04||0.04|
Here are some samples of the first pass 700 and 1000 bit/s codecs. Also provided are the samples from Part 3, the (unquantised AbyS, and Codec 2 700B and 1300). The synthetic phase spectra is derived from the decoded amplitude spectrum. “newamp” is the name for the simulations using the masking model.
|Sample||700B||1300||newamp AbyS||newamp AbyS 700||newamp AbyS 1000|
I can notice some “tinkles”, “running water” types sounds on the quantised newamp samples. This could be chunks of spectrum coming and going quickly. Also some roughness on long vowels, like “five” in the vk5qi samples.
I feel the newamp 700 samples are better than 700B, which is the direction I want to be heading. Please tell me what you think.
codec2-dev SVN revision 2716
octave:49> newamp_batch("../build_linux/src/vk5qi","../build_linux/src/vk5qi_am.out", "../build_linux/src/vk5qi_aw.out")
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --amread vk5qi_am.out --awread vk5qi_aw.out --phase0 --postfilter -o - | play -t raw -r 8000 -s -2 -
A bunch of ideas that come to mind:
- The roughness in long vowels could be frame to frame amplitude variations. It would be interesting to explore these errors, for example plotting them.
- Try delta-time approach. This might help with gentle evolution of the parameters and frame-frame noise. A gentle evolution of the slope of the straight line might sound better than the current scheme – as there will be less frame to frame noise.
- Plotting trajectories of the parameters over time would give us some more insight into quantisation, and help us determine if we can use Trellis Decoding for additional robustness.
- It may be possible to weight certain errors on the AbyS loop, for example steer it away from outlier amplitude points that have a poor straight line fit.
- Can we choose a set of amplitudes that fit exactly to a straight line but still sound OK? Can we modify the model in a way that doesn’t affect the speech quality but helps us quantisate to a compact set of bits?
- Look for quantiser overload – values outside of the quantiser range.
- Take another look at what the AbyS loop is doing, track down some problem frames.
- Try a higher and lower (e.g. 3) number of frequency/amplitude points.
- Frequencies don’t have to be on the harmonic amplitude mWo grid.
- Experiment with the shape of the masking functions.
- Try Vector Quantisation.
- The parameters are all quite orthogonal, which lets us modify them independently. For example we could move the amplitudes a little to aid quantisation, but keep the frequencies the same. Compressing the frame energy will leave intelligibility unchanged.
- Get something a little better than what we have here and put it on the air and get some real tests over real conversations.
- Samples like vk5qi have a lot of low pass energy. We might be wasting bits coding that. The 4th freq frequency histogram has a mean of 3.2kHz. This is very high and it could be argued we don’t need this sample, or it could be coded with low resolution.