I’ve been making steady progress on my new ideas for amplitude quantisation for Codec 2. The goal is to increase speech quality, in particular for very low bit rate 700 bits/ modes.
Here are the signal processing steps I’m working on:
The signal processing algorithms I have developed since Part 1 are coloured in blue. I still need to nail the yellow work. The white stuff has been around for years.
Actually I spent a few weeks on the yellow steps but wasn’t satisfied so looked for something a bit easier to do for a while. The progress has made me feel like I am getting somewhere, and pumped me up to hit the tough bits again. Sometimes we need to organise the engineering to suit our emotional needs. We need to see (or rather “feel”) constant progress. Research and Disappointment is hard!
Transformations and Sample Rate Changes
The goal of a codec is to reduce the bit rate, but still maintain some target speech quality. The “quality bar” varies with your application. For my current work low quality speech is OK, as I’m competing with analog HF SSB. Just getting the message through after a few tries is a lower bar, the upper bar being easy conversation over that nasty old HF channel.
While drawing the figure above I realised that a codec can be viewed as a bunch of processing steps that either (i) transform the speech signal or (ii) change the sample rate. An example of transforming is performing a FFT to convert the time domain speech signal into the frequency domain. We then decimate in the time and frequency domain to change the sample rate of the speech signal.
Lowering the sample rate is an effective way to lower the bit rate. This process is called decimation. In Codec 2 we start with a bunch of sinusoidal amplitudes that we update every 10ms (100Hz sampling rate). We then throw away every 3 out of 4 to give a sample rate of 25Hz. This means there are less samples to every second, so the bit rate is reduced.
At the decoder we use interpolation to smoothly fill in the missing gaps, raising the sample rate back up to 100Hz. We eventually transform back to the time domain using an inverse FFT to play the signal out of the speaker. Speakers like time domain signals.
In the figure above we start with chunks of speech samples in the time domain, then transform into the frequency domain, where we fit a sinusoidal, then masking model.
The sinusoidal model takes us from a 512 point FFT to 20-80 amplitudes. Its fits a sinusoidal speech model to the incoming signal. The number of sinusoidal amplitudes varies with the pitch of the incoming voice. It is time varying, which complicates our life if we desire a constant bit rate.
The masking model fits a smoothed envelope that represents the way we produce and hear speech. For example we don’t talk in whistles (unless you are R2D2) so no point wasting bits in being able to code very narrow bandwidths signals. The ear masks weak tones near strong ones so no point coding them either. The ear also has a log frequency and amplitude response so we take advantage of that too.
In this way the speech signal winds it’s way through the codec, being transformed this way and that, as we carve off samples until we get something that we can send over the channel.
Need to sort out those remaining yellow blocks, and come up with a fully quantised codec candidate.
An idea that occurred to me while drawing the diagram – can we estimate the mask directly from the FFT samples? We may not need the intermediate estimation of the sinusoidal amplitudes any more.
It may also be possible to analyse/synthesise using filters modeling the masks running in the time domain. For example on the analysis side look at the energy at the output at a bunch of masking filters spaced closely enough that we can’t perceive the difference.
Writing stuff up on a blog is cool. It’s “the cardboard colleague” effect: the process of clearly articulating your work can lead to new ideas and bug fixes. It doesn’t matter who you articulate the problems too, just talking about them can lead to solutions.