Many speech codecs use Linear Predictive Coding (LPC) to model the short term speech spectrum. For very low bit rate codecs, most of the bit rate is allocated to this information.
While working on the 700 bit/s version of Codec 2 I hit a few problems with LPC and started thinking about alternatives based on the masking properties of the human ear. I’ve written Octave code to prototype these ideas.
I’ve spent about 2 weeks on this so far, so thought I better write it up. Helps me clarify my thoughts. This is hard work for me. Many of the steps below took several days of scratching on paper and procrastinating. The human mind can only hold so many pieces of information. So it’s like a puzzle with too many pieces missing. The trick is to find a way in, a simple step that gets you a working algorithm that is a little bit closer to your goal. Like evolution, each small change needs to be viable. You need to build a gentle ramp up Mount Improbable.
Problems with LPC
We perceive speech based on the position of peaks in the speech spectrum. These peaks are called formants. To clearly perceive speech the formants need to be distinct, e.g. two peaks with a low level (anti-formant) region between them.
LPC is not very good at modeling anti-formants, the space between formants. Effectively an all pole filter, it can only explicitly model peaks in the speech spectrum. This can lead to unwanted energy in the anti-formants which makes speech muffled and hard to understand. The Codec 2 LPC postfilter improves the quality of the decoded speech by suppressing interformant-energy.
LPC attempts to model spectral slope and other features of the speech spectrum which are not important for speech perception. For example “flat”, high pass or low pass filtered speech is equally easy for us to understand. We can pass speech through a Q=1 bandpass or notch filter and it will still sound OK. However LPC wastes bits on these features, and get’s into trouble with large spectral slope.
LPC has trouble with high pitched speakers where it tends to model individual pitch harmonics rather than formants.
LPC is based on “designing” a filter to minimise mean square error rather than the properties of the human ear. For example it works on a linear frequency axis rather than log frequency like the human ear. This means it tends to allocates bits evenly across frequency, whereas an allocation weighted towards low frequencies would be more sensible. LPC often produces large errors near DC, an important area of human speech perception.
LPC puts significant information into the bandwidth of filters or width of formants, however due to masking the ear is not very sensitive to formant bandwidth. What is more important is sharp definition of the formant and anti-formant regions.
So I started thinking about a spectral envelope model with these properties:
- Specifies the location of formants with just 3 or 4 frequencies. Focuses on good formant definition, not the bandwidth of formants.
- Doesn’t care much about the relative amplitude of formants (spectral slope). This can be coarsely quantised or just hard coded using, e.g. voiced speech has a natural low pass spectral slope.
- Works in the log amplitude and log frequency domains.
Auditory Masking
Auditory masking refers to the “capture effect” of the human ear, a bit like an FM receiver. If you hear a strong tone, then you cant hear slightly weaker tones nearby. The weaker ones are masked. If you can’t hear these masked tones, there is no point sending them to the decoder. So we can save some bits. Masking is often used in (relatively) high bit rate audio codecs like MP3.
I found some Octave code for generating masking curves (Thanks Jon!), and went to work applying masking to Codec 2 amplitude modelling.
Masking in Action
Here are some plots to show how it works. Lets take a look at frame 83 from hts2a, a female speaker. First, 40ms of the input speech:
Now the same frame in the frequency domain:
The blue line is the speech spectrum, the red the amplitude samples {Am}, one for each harmonic. It’s these samples we would like to send to the decoder. The goal is to encode them efficiently. They form a spectral envelope, that describes the speech being articulated.
OK so lets look at the effect of masking. Here is the masking curve for a single harmonic (m=3, the highest one):
Masking theory says we can’t hear any harmonics beneath the level of this curve. This means we don’t need to send them over the channel and can save bits. Yayyyyyy.
Now lets plot the masking curves for all harmonics:
Wow, that’s a bit busy and hard to understand. Instead, lets just plot the top of all the masking curves (green):
Better. We can see that the entire masking curve is dominated by just a few harmonics. I’ve marked the frequencies of the harmonics that matter with black crosses. We can’t really hear the contribution from other harmonics. The two crosses near 1500Hz can probably be tossed away as they just describe the bottom of an anti-formant region. So that leaves us with just three samples to describe the entire speech spectrum. That’s very efficient, and worth investigating further.
Spectral Slope and Coding Quality
Some speech signals have a strong “low pass filter” slope between 0 an 4000Hz. Others have a “flat” spectrum – the high frequencies are about the same level as low frequencies.
Notice how the high frequency harmonics spread their masking down to lower frequencies? Now imagine we bumped up the level of the high frequency harmonics, e.g. with a first order high pass filter. Their masks would then rise, masking more low frequency harmonics, e.g. those near 1500Hz in the example above. Which means we could toss the masked harmonics away, and not send them to the decoder. Neat. Only down side is the speech would sound a bit high pass filtered. That’s no problem as long as it’s intelligible. This is an analog HF radio SSB replacement, not Hi-Fi.
This also explains why “flat” samples (hts1a, ve9qrp) with relatively less spectral slope code well, whereas others (kristoff, cq_ref) with a strong spectral slope are harder to code. Flat speech has improved masking, leaving less perceptually important information to model and code.
This is consistent with what I have heard about other low bit rate codecs. They often employ pre-processing such as equalisation to make the speech signal code better.
Putting Masking to work
Speech compression is the art of throwing stuff away. So how can we use this masking model to compress the speech? What can we throw away? Well lets start by assuming only the samples with the black crosses matter. This means we get to toss quite a bit of information away. This is good. We only have to transmit a subset of {Am}. How I’m not sure yet. Never mind that for now. At the decoder, we need to synthesise the speech, just from the black crosses. Hopefully it won’t sound like crap. Let’s work on that for now, and see if we are getting anywhere.
Attempt 1: Lets toss away any harmonics that have a smaller amplitude than the mask (Listen). Hmm, that sounds interesting! Apart from not being very good, I can hear a tinkling sound, like trickling water. I suspect (but haven’t proved) this is because harmonics are coming and going quickly as the masking model puts them above and below the mask, which makes them come and go quickly. Little packets of sine waves. I’ve heard similar sounds on other codecs when they are nearing their limits.
Attempt 2: OK, so how about we set the amplitude of all harmonics to exactly the mask level (Listen): Hmmm, sounds a bit artificial and muffled. Now I’ve learned that muffled means the formants are not well formed. Needs more difference between the formats and anti-formant regions. I guess this makes sense if all samples are exactly on the masking curve – we can just hear ALL of them. The LPC post filter I developed a few years ago increased the definition of formants, which had a big impact on speech quality. So lets try….
Attempt 3: Rather than deleting any harmonics beneath the mask, lets reduce their level a bit. That way we won’t get tinkling – harmonics will always be there rather than coming and going. We can use the mask instead of the LPC post filter to know which harmonics we need to attenuate (Listen).
That’s better! Close enough to using the original {Am} (Listen), however with lots of information removed.
For comparison here is Codec 2 700B (Listen and Codec 2 1300 (aka FreeDV 1600 when we add FEC) Listen. This is the best I’ve done with LPC/LSP to date.
The post filter algorithm is very simple. I set the harmonic magnitudes to the mask (green line), then boost only the non-masked harmonics (black crosses) by 6dB. Here is a plot of the original harmonics (red), and the version (green) I mangle with my model and send to the decoder for synthesis:
Here is a spectrogram (thanks Audacity) for Attempt 1, 2, and 3 for the first 1.6 seconds (“The navy attacked the big”). You can see the clearer formant representation with Attempt 3, compared to Attempt 2 (lower inter-formant energy), and the effect of the post filter (dark line in center of formants).
Command Line Kung Fu
If you want to play along:
~/codec2-dev/build_linux/src$ ./c2sim ../../raw/kristoff.raw --dump kristoff
octave:49> newamp_batch("../build_linux/src/kristoff");
~/codec2-dev/build_linux/src$ ./c2sim ../../raw/kristoff.raw --amread kristoff_am.out -o - | play -t raw -r 8000 -e signed-integer -b 16 - -q
The “newamp_fbf” script lets you single step through frames.
Phases
To synthesise the speech at the decoder I also need to come up with a phase for each harmonic. Phase and speech is still a bit of a mystery to me. Not sure what to do here. In the zero phase model, I sampled the phase of the LPC synthesis filter. However I don’t have one of them any more.
Lets think about what the LPC filter does with the phase. We know at resonance phase shifts rapidly:
The sharper the resonance the faster it swings. This has the effect of dispersing the energy in the pitch pulse exciting the filter.
So with the masking model I could just choose the center of each resonance, and swing the phase about madly. I know where the center of each resonance is, as we found that with the masking model.
Next Steps
The core idea is to apply a masking model to the set of harmonic magnitudes {Am} and select just 3-4 samples of that set that define the mask. At the decoder we use the masking model and a simple post filter to reconstruct a set of {Am_} that we use to synthesise the decoded speech.
Still a few problems to solve, however I think this masking model holds some promise for high quality speech at low bit rates. As it’s completely different to conventional LPC/LSP I’m flying blind. However the pieces are falling into place.
I’m currently working on i) how to reduce the number of samples to a low number ii) how to determine which ones we really need (e.g. discarding interformant samples); and iii) how to represent the amplitude of each sample with a low or zero number of bits. There are also some artifacts with background noise and chunks of spectrum coming and going.
I’m pretty sure the frequencies of the samples can be quantised coarsely, say 3 bits each using scalar quantisation, or perhaps 8 bit/s frame using VQ. There will also be quite a bit of correlation between the amplitudes and frequencies of each sample.
For voiced speech there will be a downwards (low pass) slope in the amplitudes, for unvoiced speech more energy at high frequencies. This suggests joint VQ of the sample frequencies and amplitudes might be useful.
The frequency and amplitude of the mask samples will be highly correlated in time (small frame to frame variations) so will have good robustness to bit errors if we apply trellis decoding techniques. Compared to LPC/LSP the bandwidth of formants is “hard coded” by the masking curves, so the dreaded LSPs-too-close due to bit errors R2D2 noises might be a thing of the past. I’ll explore robustness to bit errors when we get to the fully quantised stage.
I’m not really sure that you can effectively use curves for simultaneous masking in a vocoder like codec2. The way masking is typically used is to reduce the number of bits spent on the “waveform details”, which AFAIK codec2 doesn’t code at all.
One area you could look at is how you obtain the LP coefficients. There’s many different algorithms you can use, including:
1) Straight Levinson-Durbin on the windowed auto-correlation
2) Burg method (like in Opus)
3) Compute the spectral envelope in any way you like and convert that to LPC (inverse FFT of the power spectral envelope gives you the auto-correlation)
On top of this, there are several “regularization” parameters to play with:
1) Bandwidth expansion (https://en.wikipedia.org/wiki/Bandwidth_expansion)
2) Lag windowing (https://en.wikipedia.org/wiki/Lag_windowing)
3) Adding a “noise floor” to the first auto-correlation coefficient
4) Adding a pre-emphasis filter before the LPC computation
Another thing you may want to explore is warped LPC, which gives you better resolution in low frequencies and worse resolution at high frequency. I’ve never played with that myself though.
Last random thought, what if you used different LSP codebooks (or different subsets of a larger codebook) depending on the pitch period you encode (e.g. male vs female)?
Re “I’m not really sure that you can effectively use curves for simultaneous masking in a vocoder like codec2.”:
I’ve done it! Just listen to the Attempt 3 sample above. This result is consistent across a range of samples I have tested.
In my latest models I am use just 3-4 samples of the masking curve to synthesise reasonable quality speech. I’m currently working on how to represent the amplitudes of those samples efficiently, for example fit them to a straight line in the log amplitude domain.
Yes I agree there are many techniques we can use with LPC that will overcome some of it’s issues. The LPC/LSP paradigm has done a good job for me so far.
However I can’t help feeling that a lot of these tricks, like LSP weighting factors, pre-processing of speech to keep LPC happy etc, are hacks to support a model that has some inherent problems.
I also think it’s a case of “if you have a hammer everything is a nail”. The speech coding community has trouble thinking outside the LPC/LSP box. AMBE is a notable exception, IIRC they use VQ of de-correlated {delta-Am} samples in the log domain.
So rather than play with more LSP weighting factors etc I’m spending a few weeks trying something a bit different.
Cheers,
David
David, what I mean is that the effect you’re getting isn’t what you think. If you were transmitting (e.g.) phase for the sinusoids, then masking can tell you which phases will actually be audible. In this case, as far as I understand, you’re using the masking curve simply as a way to widen the formants in the LPC analysis. In some sense, it’s actually not so far from bandwidth expansion. To me, the actual thing you want to model here isn’t psychoacoustic masking, but the human perception of loudness. Unfortunately, it’s not well understood (compared to masking), so you might have to play around with it. In the end, the way you modify the spectrum may look somewhat similar to your masking calculations, or it may not.
Hi Jean-Marc,
There is no LPC analysis anymore. I’m taking the set of sinusoidal amplitudes {Am}, decimating them to about 4 samples, which are then used to create {Am_} using a masking model. So the entire spectrum is represented by the frequency and amplitude of 4 Am samples.
This is a much more direct and intuitive representation than LSPs. The amplitude and frequency of these samples can be quantised at different resolutions.
One problem with bandwidth expansion of LPCs is it raises the level of the anti-format regions. So there is a trade off between formant width (which the ear is not very sensitive too), and good formant definition (i.e. deep anti-formants) that LPC struggles with.
The issues with LSPs being close together, peaky LPC poles etc is an artifact of LPC – it’s not something that occurs naturally in speech articulation or perception. We put a lot of effort into making LPC work.
The speech samples above use the original phases, I still need a way to generate phases for voiced speech at the decoder. Actually I have tried using a LPC model derived phases and that sounds pretty good. However I don’t have the LPC model at the decoder at this stage, unless perhaps I use the {Am_} to construct it. However it does show a simple phase model will be adequate. Anyway, will look into that problem later.
Thanks!
David
Hi David,
Even when coding four A_m values, you may not be as far as LPC as you think. The way line spectral pairs work, you usually end up with one pair of poles around each formant, with their distance determining the amplitude of the formant (the closer the poles, the higher the amplitude). So when you’re coding four A_m values (amplitude and valud of m four each), you’re coding 8 parameters that aren’t that var from representing the same information as 8 LSPs. It doesn’t mean that you shouldn’t keep track of formants directly (rather than LSPs), but there’s nothing that prevents you from using bits of both representations. In the end, it looks like what you really want is a representation for which the mean squared error between the quantized and unquantized is as close as possible to the actual perceived distortion.
On a different topic, you talked about problems with spectral valleys (anti-formants) not being deep enough. This is something that’s actually relatively easy to fix in many cases. Many speech codecs apply so called formant enhancement that mostly de-emphasise valleys. If you have a filter A(z), you can compute a post-filter A(z/g1)/A(z/g2) with g1<g2 (both are bandwidth expansion parameters) that you apply after your normal A(z).
Thanks Jean-Marc, yes I agree with your comparison with LPC/LSP. Naturally there are similarities – both schemes are aiming to represent the same information. By exploring the similarities and differences we can push the art forward.
Your last para is exactly what I did with the LPC post filter. It’s helped with my understanding of LPC and where it breaks down. It got me thinking – rather than apply all these hacks to make LPC perform – is there a better way?
So I’m investing a few weeks to explore these ideas. Last few days I have quantised the amplitudes of the samples at 8 bit/s frame with fair results. I’m also experimenting with “analysis by synthesis” to locate the decimated {Am} samples, possibly including the amplitude quantization in the A by S loop.
Think I’ll make a first pass at a fully quantised codec in the 700 bit/s range. Release early and often. That should highlight any problem areas, demonstrate the potential of this scheme (or conversely draw out any show stoppers), and if it works give us an incremental improvement on Codec 2 700B for FreeDV.
Hi David,
I think at these rates, it may be better to forget about LSPs vs {Am} and just quantize *spectral envelopes* directly. Why not just have a table of N different spectral envelopes and just code which one to use for each frame? Of course, part of that is just shifting the problem to the way you come up with this table and the metric you use to pick the closest, but at least you’re not artificially constraining yourself with limited models.
Yes something like direct VQ of the spectral envelopes would be interesting to try (again). Actually I’ve tried it a few times but couldn’t get good results. Should be possible in principle. xMBE uses a variation of that (or they did some time ago at least).
Couple of issues, with my codec the number of samples describing the envelope is time varying. One can interpolate to a fixed number of samples. Then the second problem is the dimension of that vector is quite large (40-80), so storage and search complexity becomes an issue.
Another issue is perceptually irrelevant variations in the signal (e.g. spectral slope, HP/LP filtering) can mean big differences to spectral envelope matching in a MSE sense. So it’s not a great match to the ear.
I also had issues with “noise” – interformant VQ errors have a big effect on perceptual speech, like CELP. As direct matching of spectral envelopes doesn’t take into account masking.
But sure, these issues could be “engineered out” in principle. For example doing a weighted match, like we match codebook entries with gain in CELP. Making the vector a function of a first order polynomial would allow us to match spectral slope:
y(f,i) = m*se(f,i) + c
where se(f,i) is the i-th spectral envelope vector from our codebook (a function of frequency), c our frame energy, and m the spectral slope. Try a bunch of se candidate vectors, solving for m and c with each one.
Every model has it’s limitations I guess.
The masking model I am playing with has several nice properties: its sounds good early (well, good enough); the amplitudes and frequencies are orthogonal and can be coded independently; small number of samples; and it addresses many of the limitations of LPC and other schemes.
Hi David,
Good to see that you are trying new approaches – the 3rd sample does sound pretty good. But you haven’t talked much about determining the phases of the selected harmonics. I assume this is an advantage of the LPC approach, and with a small subset of harmonics you will need to combine them in a manner that avoids discontinuities at the frame boundaries? Maybe there is a neat solution for picking the set of phases to achieve this(?).
In any case, whether it is LPC or a subset of the harmonics, I like the idea of adapting to the specific speaker. This could be done via different VQ codebooks as mentioned above, or just by adjusting the quantisation schemes for male/ female, young/old etc speakers. This could include other parameters like F0 as well. Obviously the encoder would need to transmit a small amount of extra side information at regular intervals to identify the speaker type for the decoder.
Hi Bill,
As per comment with Jean-Marc above, yep, still working on how to determine the phase spectra of voice {Am_}. Attempt 3 uses the full set of L {Am_}, I’ve just tweaked the level of a few (postfiltering). Using the original phases for synthesis of {Am_} at this stage, but I have tried some synthetic phase spectra with comparable results.
The issue with real time adaption is of course the effect of bit errors. I’d like to try real time adaption with the Trellis Decoding scheme, even with say one error in 100 the transition probabilities would still be about the same. I’m a bit nervous about using real time adaption with other quantisers, as I can’t work out a way to make it robust to bit errors.
Sending a male/female bit (for example) to select a set of codebooks could also be covered by an extra bit in the VQ.
Cheers,
David
Well the 3rd attempt distorts the sound, it distorts it in a good way making it more audible(than the original). The 700b sample had a hint of robotic sound to it which wasn’t present in the 3rd attempt. I would like to see how this sounds with other samples, this is truly interesting results and could be a very big game changer when it comes to low bitrate voice coding. Maybe an automated script that auto generates samples every-time you comitt could be useful in comparing. This truly is amazing work and will stay tuned for any updates, I would love to hear how this sounds under interference / bit loss conditions.
Attempt 1 sure sounds a lot like
http://www.mrc-cbu.cam.ac.uk/people/matt-davis/personal/sine-wave-speech/
… because, from my understanding, eliminating formants results in fewer sinusoids trying to recreate the original speech.
Very interesting examples on that site. Quite remarkable what we can adapt to. I’m told that non-HF radio people can’t understand SSB at SNRs < 6 dB.
This could be used for a novel low bit rate speech codec, as the excitation (pitch and voicing) bits could be removed (150 bit/s). The lack of speaker recognizability could be useful in some applications.