Codec 2 Masking Model Part 4

This post describes how the masking model frequency/amplitude pairs are quantised.

This work is very new. There are so many different areas to pursue. However I decided “release early and often” – push a first pass right through the quantisation process. That way I can write about it, publish the results, and get some feedback. This post presents a first pass including samples at 700 and 1000 bit/s.

Histograms

Quantisation takes a floating point value (a real number) and represents it with a small number of bits. In this case we have 4 frequencies, and 4 amplitudes. Eight numbers in total that we must send over the channel. If we sent the floating point values that would be 8×32 = 256 bits/frame. With a 40ms frame update rate that is 256/0.04 = 6400 bit/s. Too high. So we need to come up with efficient quantisers to minimise the number of bits flowing over the channel, while keeping a reasonable speech quality. Speech coding is the art of what can I throw away?

There are a few tricks we can use. The dynamic range (maximum and minimum) values tend to be limited. A good way to look at the range is using a histogram of each value. I ran samples of 10 speakers through the simulations and logged the frequency and amplitudes to generate some histograms.

Here are the frequencies and differences between each frequency. The frequencies were first sorted into ascending order.

Here are the amplitudes, with the mean (frame energy) removed:

Voiced speech tends to have a “low pass” spectral slope – more energy at low frequencies than high frequencies. Unvoiced speech tends to be “high pass”. As discussed in the first post the ear is not very sensitive to fixed “filtering” of speech, ie the absolute value of the formant amplitudes. You can have a gentle band pass filter, some high pass or low pass filtering, and it all sounds fine.

So I reasoned I could fit a straight line the amplitudes, like this:

The first plot is the time domain speech, it’s a frame of Mark saying “five”. The second plot shows the spectral amplitudes Am (red) and the mask (purple) we have fit to them. The mask is described by just four frequencies/amplitude points. The frequencies are labelled by the black crosses.

The last plot shows the four frequency/amplitude points (red), and a straight line fit to them (blue line). The error in the straight line fit to each red point is also shown.

So keeping this straight line fit in mind, lets get back to the histograms. Here are the histograms of the amplitudes, followed by the histograms of the amplitude errors after the straight line fit:

Note how narrow the 2nd plot’s histograms are compared to the first? This makes the values “easier” to represent with a small number of bits. In statistics, we would say the variance of these variables is smaller.

OK, but now we need to send the parameters that describe the straight line. That would be the gradient and y-intercept, here are the histograms:

Notice the mean of the gradient is skewed to negative values? Speech contains more voiced (vowels) than unvoiced speech (constants), and voiced speech is “low pass” (a negative gradient).

The y-intercept is a fancy way of saying the “frame energy”. It goes up and down with the level of the speech.

Quantisation of Frequencies

The frequencies are found in random order by the AbyS algorithm. We can also transmit them in any order, and reconstruct the same spectral envelope at the decoder. For convenience we sort them into ascending order. This reduces the distance between each frequency sample, and lets us delta code the frequencies. You can see above that the histograms of the last 3 delta frequencies cover about the same range.

I used 3 bits for each frequency, giving a total of 12 bit/s frame.

Quantisation of Amplitudes

I used the straight line fit method for the amplitudes. I used 3 bits for the gradient, and 3 bits for the errors in 5dB steps over the range of -15 to 15 dB. I assumed the y intercept would require as many bits as the frame energy (5 bits/frame) that is used for the existing Codec 2 modes.

Bit Allocation

The quantisation work leads us to a simulation of quantised 1000 and 700 bit/s codecs with the following bit allocations. In the 700 bit/s mode, we don’t transmit the straight line fit errors.

Parameter Bits/frame (High Rate) Bits/frame (Low Rate)
Pitch (Wo) 7 7
Voicing 1 1
Energy 5 5
Mask freqs 12 12
Mask amp gradient 3 3
Mask amp errors 12 0
Bits/frame 40 28
Frame period (s) 0.04 0.04
Bits/s 1000 700

Samples

Here are some samples of the first pass 700 and 1000 bit/s codecs. Also provided are the samples from Part 3, the (unquantised AbyS, and Codec 2 700B and 1300). The synthetic phase spectra is derived from the decoded amplitude spectrum. “newamp” is the name for the simulations using the masking model.

Sample 700B 1300 newamp AbyS newamp AbyS 700 newamp AbyS 1000
ve9qrp_10s Listen Listen Listen Listen Listen
mmt1 Listen Listen Listen Listen Listen
vkqi Listen Listen Listen Listen Listen

I can notice some “tinkles”, “running water” types sounds on the quantised newamp samples. This could be chunks of spectrum coming and going quickly. Also some roughness on long vowels, like “five” in the vk5qi samples.

I feel the newamp 700 samples are better than 700B, which is the direction I want to be heading. Please tell me what you think.

Command Lines
codec2-dev SVN revision 2716
octave:49> newamp_batch("../build_linux/src/vk5qi","../build_linux/src/vk5qi_am.out", "../build_linux/src/vk5qi_aw.out")
~/codec2-dev/build_linux/src$ ./c2sim ../../wav/vk5qi.wav --amread vk5qi_am.out --awread vk5qi_aw.out --phase0 --postfilter -o - | play -t raw -r 8000 -s -2 -

Further Work

A bunch of ideas that come to mind:

  • The roughness in long vowels could be frame to frame amplitude variations. It would be interesting to explore these errors, for example plotting them.
  • Try delta-time approach. This might help with gentle evolution of the parameters and frame-frame noise. A gentle evolution of the slope of the straight line might sound better than the current scheme – as there will be less frame to frame noise.
  • Plotting trajectories of the parameters over time would give us some more insight into quantisation, and help us determine if we can use Trellis Decoding for additional robustness.
  • It may be possible to weight certain errors on the AbyS loop, for example steer it away from outlier amplitude points that have a poor straight line fit.
  • Can we choose a set of amplitudes that fit exactly to a straight line but still sound OK? Can we modify the model in a way that doesn’t affect the speech quality but helps us quantisate to a compact set of bits?
  • Look for quantiser overload – values outside of the quantiser range.
  • Take another look at what the AbyS loop is doing, track down some problem frames.
  • Try a higher and lower (e.g. 3) number of frequency/amplitude points.
  • Frequencies don’t have to be on the harmonic amplitude mWo grid.
  • Experiment with the shape of the masking functions.
  • Try Vector Quantisation.
  • The parameters are all quite orthogonal, which lets us modify them independently. For example we could move the amplitudes a little to aid quantisation, but keep the frequencies the same. Compressing the frame energy will leave intelligibility unchanged.
  • Get something a little better than what we have here and put it on the air and get some real tests over real conversations.
  • Samples like vk5qi have a lot of low pass energy. We might be wasting bits coding that. The 4th freq frequency histogram has a mean of 3.2kHz. This is very high and it could be argued we don’t need this sample, or it could be coded with low resolution.

Links

Codec 2 Masking Model Part 1
Codec 2 Masking Model Part 2
Codec 2 Masking Model Part 3

7 thoughts on “Codec 2 Masking Model Part 4”

  1. Have you checked what the errors look like with respect to a linear fit in numeric amplitude/frequency rather than log(amplitude)/frequency? Given the errors for the one frame you show it looks like you might get better modeling.

      1. Ear works log(amp)/log(freq). What I’m asking is whether the linear fit is better if you’re fitting linear (amp/freq or log(amp)/log(freq)) rather than exponential (log(amp)/freq) models. It might allow for less bits being assigned to the amplitude fit errors.

  2. This is fantastic work. The new method sounds so much better than before. I find it far easier to understand.

  3. It’s how the ear hears. I always wondered when listening to ham radio am or ssb through static crashes and interference like lightening or neon lamps, when voice vowels or consonants are lost, the mind replaces or fills in the lost information. My question, is a language with long vowels like Italian or Spanish, get more information throughput over a noisy channel, than a highly staccato language like Japanese where I believe that the static crashes wipe out the information contained in that language versus the long vowels sounds in Italian. Example English word “its” versus an Italian opera word sung out with several long vowels. “Figaro” from barber of seville. “Its” sounds like a static crash. German language had those hard guttural sounds. Maybe those are harder to encode than Italian sounds? Ie is your Codec2 optimized for English or spanish, but horrible for encoding japanese and German sounds? I don’t know the answer to my question, but did wonder during high level of QRM & QRN noise would Italian be understood but Japanese not be understood because the static crash covered over the meaning needed for a Japanese ear & mind to process the audio to recover the language details. The slowly changing sinusoidal harmonics in Italian word is easy to process during static crashes, but Japanese would be difficult to process because a static crash sounded so similar to the hard consonants in the language. Just wondered how difficult it is to understand Japanese during many static crashes.

    Glad that you have many years of education to reach that PhD level, David. Thank you for creating open source Codec2. Like reading your blog too.

    1. Thanks Fred. I enjoy writing the blog too. I am not sure about the robustness of different languages to Codec 2 and bit errors. My experience of Italian is it being spoken rather quickly. This is based on 25 years of experiment – my former wife is of Italian stock and would regularly get quite animated when talking to me!

      I would guess that all languages send information at approximately the same bit rate, however some frequency domain “symbols” (phonemes) may indeed be more amenable to coding that others.

      Cheers,

      David

  4. The unquantised AbyS and AbyS 1000 samples are really very good. I struggle to understand DMR, Dstar etc. There’s always something missing that just makes copy really hard for me and that’s true of the 700B and 1300 samples here too. But the quality of these samples is almost as good as a quietening FM signal. Even the “tinkles” are preferable to the mushed up sound of the 700B and 1300 for me.

    Could this work be applied to to the FreeDV 2400A?

Comments are closed.