Codec 2 Masking Model Part 3

I’ve started working on this project again. It’s important as progress will feed into both the HF and VHF work. It’s also kind of cool as it’s very unlike what anyone else is doing out there in speech coding land where it’s all LPC/LSP.

In Part 1 I described how the spectral amplitudes can be modelled by masking curves. The next step is to (i) decimate the model to a small number of samples and (ii) quantise those samples using a modest number of bits/frame.

This post describes the progress I have made in decimating the masking model parameters, the top yellow box here:

Analysis By Synthesis

Back when I was just a wee slip of speech coder, I worked on Code Excited Linear Prediction (CELP). These codecs use a technique called Analysis by Synthesis (AbyS). To choose the speech model parameters, a bunch of them are tried, the resulting speech synthesised, and the results evaluated. The set of parameters that minimises the difference between the input speech and synthesised output speech are transmitted to the decoder.

Trying every possible set of parameters keeps the encoder DSPs rather busy, and just getting them to run in real time was quite a challenge at the time (late 1980’s on 10MIP DSPs).

Time goes by, and it’s now 30 years later. After a few dead ends, I’ve worked out a way to use AbyS to select the best 4 amplitude/frequency pairs to describe the speech spectrum. It works like this:

  1. In each frame there are L possible frequency positions, each position being the frequency of each harmonic. For each frequency there is a corresponding harmonic amplitude {Am}.
  2. At each harmonic position, I generate a masking function, and measure the error between that and the target spectral envelope.
  3. After all possible masking functions are evaluated, I choose the one that minimises the error to the target.
  4. The process is then repeated for the next stage, until we have used 4 masking functions in total. As each masking function is “fitted”, the total error gradually reduces.
  5. The output is four frequencies and four amplitudes. These must be sent to the decoder, where they can be used to generate a spectral envelope that approximates the original.

The following plots show AbyS in action for frame 50 of hts1a:

The red line is the spectral envelope defined by the harmonic amplitudes {Am}. Magenta is the model the decoder uses based on 4 frequency/amplitude samples, and found using AbyS. The black crosses indicate the frequencies found using AbyS.

Here is a plot of the error (actually Mean Square Error) for each mask position at each stage. As we add more samples to the model, the error compared to the target decreases. You can see a sharp dip in the first (blue top curve) around 2500Hz. That is the frequency chosen for the first mask sample. With the first sample fixed, we then search for the best position for the next sample (dark green), which occurs around 500Hz.

Samples

Here are some samples from the AbyS model compared to the Codec 2 700B and 1300 modes. The AbyS frequency/amplitude pairs are unquantised, but other parameters (synthetic phase, pitch, voicing, energy, frame update rate) are the same as Codec 2 700B/1300.

Sample 700B 1300 newamp AbyS
ve9qrp_10s Listen Listen Listen
mmt1 Listen Listen Listen
vk5qi Listen Listen Listen

At 700 bits/s we have 28 bit/s frame available. Assuming 7 bits for pitch, 1 for voicing, and 5 for frame energy that leaves us a budget of 15 bits/frame for the AbyS freq/amp pairs. At 1300 bit/s we have 52 bit/s frame total with 39 bits/frame available for AbyS freq/amp pairs.

My goal is to get 1300 bit/s quality at 700 bit/s using the AbyS masking model technique. That would significantly boost the quality at 700 bits/s and let us use the COHPSK modem that works really well on HF channels.

Command Lines

newamp.m was configured with decimation_in_time on and set to 4 (40ms frame update rate with interpolation at 10ms intervals). This is the same frame update rate as Codec 2 700B and 1300 modes. The phase0 model was enabled in c2sim to use synthetic phases and a single voicing bit, just like the Codec 2 modes. The synthetic phases were derived from a LPC model but can also be synthesised from any amplitude spectra, such as the AbyS masking model.

octave:20> newamp_batch("../build_linux/src/vk5qi")
$ ./c2sim ../../raw/vk5qi.raw --amread vk5qi_am.out --phase0 --postfilter -o - | sox -t raw -r 8000 -s -2 - ~/Desktop/abys/vk5qi.wav

Happy Birthday to Me

This is my 300th blog post in 10 years! Yayyyyyy. That’s about one rather detailed post every two weeks. I started with this one in April 2006 just after I hung up my trousers and departed the corporate world.

This blog currently gets visited by 3500 unique IPs/day although it regularly hits 5000/day. I type posts up in Emacs, then paste them into WordPress for final editing. I draw figures in LibreOffice Impress, and plots using GNU Octave.

I quite like writing, it gives me a chance to exercise the teacher inside me. Reporting on what I have done helps get it straight in my head. If I solve a problem I figure the solution might be useful for others.

I hope this blog has been useful for you too.

Links

Codec 2 Masking Model Part 1
Codec 2 Masking Model Part 2

2 thoughts on “Codec 2 Masking Model Part 3”

  1. Hi David,

    Well I am certainly MOST IMPRESSED with your ‘teaching’.

    Your electric car saga still has a lot of meaning for me :). Wish I lived next door.

    Recently, I was offered ‘free’ solar panels for my house. They MUST be making money. So I was pricing their panels and

    thinking about storage. It seems to me that solar panels and two electric cars and one is off grid. Now does that

    make $$ sense? Maybe not quite, but it MUST be getting close? Otherwise why would I be offered ‘free’ solar panel installation [they get the power].

    Just gotta stop being so lazy :)…. Start writing some code…. Join the party :) :). Contribute !!

    Glad you are on the web!

    Lots of fun :).

    John

  2. WOW, this sounds so much better to my ears than Codec2 and LPC in general.
    I know you target the lowest bit-rates but please don’t neglect the possibility of making “hifi” quality speech at 2400/3200baud.

    Sharing your process is really inspiring and makes it so much easier to understand complicated technology like this.
    Keep up the good work! :)

Comments are closed.