Low Order LPC and Bandpass Filtering

I’ve been working on the Linear Predictive Coding (LPC) modeling used in the Codec 2 700 bit/s mode to see if I can improve the speech quality. Given this mode was developed in just a few days I felt it was time to revisit it for some tuning.

LPC fits a filter to the speech spectrum. We update the LPC model every 40ms for Codec 2 at 700 bit/s (10 or 20ms for the higher rate modes).

Speech Codecs typically use a 10th order LPC model. This means the filter has 10 coefficients, and every 40ms we have to send them to the decoder over the channel. For the higher bit rate modes I use about 37 bits/frame for this information, which is the majority of the bit rate.

However I discovered I can get away with a 6th order model, if the input speech is filtered the right way. This has the potential to significantly reduce the bit rate.

The Ear

Our ear perceives speech based on the frequency of peaks in the speech spectrum. When the peaks in the speech spectrum are indistinct, we have trouble understanding what is being said. The speech starts to sound muddy. With analog radio like SSB (or in a crowded room), the troughs between the peaks fill with noise as the SNR degrades, and eventually we can’t understand what’s being said.

The LPC model is pretty good at representing peaks in the speech spectrum. With a 10th order LPC model (p=10) you get 10 poles. Each pair of poles can represent one peak, so with p=10 you get up to 5 independent peaks, with p=6, just 3.

I discovered that LPC has some problems if the speech spectrum has big differences between the low and high frequency energy. To find the LPC coefficients, we use an algorithm that minimises the mean square error. It tends to “throw poles” at the highest energy part of signal (frequently near DC), while ignoring the still important, lower energy peaks at higher frequencies above 1000Hz. So there is a mismatch in the way LPC analysis works and how our ears perceive speech.

For example I found that samples like hts1a and ve9qrp code quite well, but cq_ref and kristoff struggle. The former have just 12dB between the LF and HF parts of the speech spectrum, the latter 40dB. This may be due to microphones, input filtering, or analog shaping.

Another problem with using an unconventionally low LPC order like p=6 is that the model “runs out of poles”. Some speech signals may have 4 or 5 peaks, so the poor LPC model gets all confused and tries to reach a compromise that just sounds bad.

My Experiments

I messed around with a bunch of band pass filters that I applied to the speech samples before LPC modeling. These filters whip the speech signal into a shape that the LPC model can work with. I ran various samples (hts1a, hts2a, cq_ref, ve9qrp_10s, kristoff, mmt1, morig, forig, x200_ext, vk5qi) through them to come up with the best compromise for the 700 bits/mode.

Here is what p=6 LPC modeling sounds like with no band pass filter. Here is a sample of p=6 LPC modeling with a 300 to 2600Hz input band pass filter with very sharp edges.

Even though the latter sample is band limited, it is easier to understand as the LPC model is doing a better job of clearly representing those peaks.

Filter Implementation

After some experimentation with sox I settled on two different filter types: a sox “bandpass 1000 2000” worked on some, whereas on others with more low frequency content “bandpass 1500 2000” sounded better. Some helpful discussions with Glen VK1XX had suggested that a two band AGC was common in broadcast audio pre-processing, and might be useful here.

However through a process of frustrated experimentation (I was stuck on cq_ref for a day) I found that a very sharp skirted filter between 300 and 2600Hz did a pretty good job. Like p=6 LPC, a 2600Hz cut off is quite uncommon for speech coding, but SSB users will find it strangely familiar…….

Note that for the initial version of the 700 bit/s mode (currently in use in FreeDV 700) I have a different band pass filter design I chose more or less at random on the day that sounds like this with p=6 LPC. This filter now appears to be a bit too severe.

Plots

Here is a little chunk of speech from hts1a:

Below are the original (red) and p=6 LPC models (green line) without and with a sox “bandpass 1000 2000” filter applied. If the LPC model was perfect green and red would be superimposed. Open each image in a new browser tab then jump back and forth. See how the two peaks around 550 and 1100Hz are better defined with the bandpass filter? The error (purple) in the 500 – 1000 Hz region is much reduced, better defining the “twin peaks” for our long suffering ears.

Here are three spectrograms of me saying “D G R”. The dark lines represent the spectral peaks we use to perceive the speech. In the “no BPF” case you can see the spectral peaks between 2.2 and 2.3 seconds are all blurred together. That’s pretty much what it sounds like too – muddy and indistinct.

Note that compared to the original, the p=6 BPF spectrogram is missing the pitch fundamental (dark line near 0 Hz), and a high frequency peak at around 2.5kHz is indistinct. Turns out neither of these matter much for intelligibility – they just make the speech sound band limited.

Next Steps

OK, so over the last few weeks I’ve spent some time looking at the effects of microphone placement, and input filtering on p=6 LPC models. Now time to look at quantisation of the 700 mode parameters then try it again over the air and see if the speech quality is improved. To improve performance in the presence of bit errors I’d also like to get the trellis based decoding into a real world usable form. When the entire FreeDV 700 mode (codec, modem, error handling) is working OK compared to SSB, time to look at porting to the SM1000.

Command Line Magic

I’m working with the c2sim program, which lets me explore Codec 2 in a partially quantised or incomplete state. I pipe audio in and out between various sox stages.

Note these simulations sound a lot better than the final Codec 2 at 700 bit/s as nothing else is quantised/decimated, e.g. it’s all at a 10ms frame rate with original phases. It’s a convenient way to isolate the LPC modeling step with as much fidelity as we can.

If you want to sing along here are a couple of sample command lines. Feel free to ask me any questions:
sox -r 8000 -s -2 ../../raw/hts1a.raw -r 8000 -s -2 -t raw - bandpass 1000 2000 | ./c2sim - --lpc 6 --lpcpf -o - | play -t raw -r 8000 -s -2 -

sox -r 8000 -s -2 ../../raw/cq_ref.raw -r 8000 -s -2 -t raw - sinc 300 sinc -2600 | ./c2sim - --lpc 6 --lpcpf -o - | play -t raw -r 8000 -s -2 -
Reading Further

Open Source Low Rate Speech Codec Part 2
LPC Post Filter for Codec 2

16 thoughts on “Low Order LPC and Bandpass Filtering”

  1. Multi-band compressors, like those used to make AM broadcast stations sound loud, punchy, and “clear” such as those from Orban like their Optimod line: http://www.orban.com/products/radio/fm/6300/ might inspire some ideas about normalising voice audio for maximum intelligability.

    I think they use something called “smart clipping” but the effect is extra psychological loudness. I have no idea is the idea of multi-band limiters has any patent problems but it’s pretty widespread in broadcasting.

    Keep up the great work!

    Peter
    vk2tpm

    1. Thanks Peter. Yes Glen VK1XX and Andrew VK5XFG also pointed me at some other AM preprocessing tricks such as all-pass filters to change the peak location – to avoid clipping it suits some PAs to have a bigger peak in one direction than another, or an equal peak in both directions.

  2. A nice demo David — the perceptual improvements from the 6th order LPC model + BPF are quite impressive! The ability of the LP model to represent the signal spectrum was noted in the classic 1975 paper by John Makhoul. He gave a frequency domain interpretation of the LP model as providing the best match between the signal spectrum and the LP model spectrum.

    BTW if you need a couple of bandpass filter responses to suit a wide range of speakers, that is probably not a big deal. If required one of two bits could be allocated periodically to carry this ‘side information’.

    Best wishes with the next step of quantisation.
    73, Bill

    1. Hi Bill,

      Yes getting it down to just a couple of filters was acceptable but would have meant needing to develop a way of estimating which filter was ideal. These estimators always fall over under some conditions so nice to avoid them where possible.

      I’d been trying to work out why some samples (cq_ref and kristoff) sounded poor when coded for quite some time (even at the higher rates) so developing this framework has been very useful.

      Thanks,

      David

  3. Hmm, wouldn’t “bandpass 1000 2000” be a center frequency of 1kHz and a width of 2kHz?

    Maybe “bandpass 1000 1000” for a 1kHz bandwidth centered on 1kHz.

    I tried the lpc 6 with the 1200 rate and it sounded like a broken speaker :-)

    1. Not sure if the c2sim “–rate 1200” and “–lpc 6” options will play well together, for example the LSP VQ is designed for 10th order LPC so you might have buffer overflow issues (rather than codec design issues).

  4. The improvement in intelligibility of the 300-2600 band pass filtered sample is dramatically better than the muddier, non-band pass filtered example.

    People with sensorineural (i.e. noise induced) hearing loss, where the loss vs frequency is downwards and rightwards, i.e. increasing loss beyond ~ 2kHz for example see – http://www.cybersight.org/data/1/rec_imgs/6202_fig.%207.12.jpg , struggle to cope with understanding speech in noisy environments or groups of people. Their cochlea is low pass filtering the information their brain gets, and it is this higher frequency information that the brain seems to rely on to tease speech out of the noise.

    By band pass filtering the audio 300-2600, and not running out of poles too soon with lower frequency encoding, and thereby not swamping the ear with low freqency energy, you may well be preserving the ability of the codec to cater to the needs of the brain which needs more of the higher frequency information relative to the lower frequency information (where a lot of the energy was being found and encoded by the algorithm in the non band pass filtered audio) to extract the speech effectively.

    Perhaps it is the improvement in the ratio of higher frequency energy to lower frequency energy that is responsible for the improvement in intelligibility. Maybe, beyond a certain point, more lower frequency energy relative to the higher frequency components supplied by the codec is hindering the ear and brain’s ability to extract speech, much like someone with hearing loss in a noisy room.

    It’s about time I put up a dedicated dipole cut to 14.236MHz!

    Regards,

    Erich.

    1. Hi Erich,

      Interesting, I hadn’t thought about the effects of hearing loss on speech perception. Given we have an open source codec, I wonder if we can come up with algorithms to make it work better for the hearing impaired.

      One idea is to frequency shift or warp the frequency axis such that all the perceptually important spectral information is shifted into the range of perception left to a given listener.

      Cheers,

      David

      1. When it comes to hearing aid selection, the goal appears to be achieving something resembling perceived loudness equalisation across the speech spectrum, the rationale being:

        “Intelligibility is assumed to be maximized when all bands of speech are perceived by the listener to have the same loudness; that is, when the goal of loudness equalization for speech bands has been achieved.”

        http://www.nal.gov.au/hearing-rehabilitation_tab_prescriptive-procedures-readmore.shtml

        It would be interesting to see to what extent codec intelligibility is a function of the low bit rate codec’s ability to achieve a degree of “equalisation of perceived loudness” across the speech band frequencies. John, below, has probably put the right label on the technique: pre-emphasis.

        Erich

      2. Some SSB users, especially hearing impaired, will do just that- ‘off tube’ their SSB.

        also, my grandmother used to tune her AM radio deliberately vestigial sideband as to generate a little more HF and HF distortion (which added HF response )

  5. Hi David,

    I gotta wonder about 3db/octave preemphasis prior to analysys. With the reverse on receive?

    John

    1. Hi John,

      You can easily try that out with sox preemphasis added to the command lines above. Run the samples I listed above and see how it goes. I think c2sim also has –pre –de options, as I experimented with preemphasis in the past while trying to work out why some samples (cq_ref and kristoff in particular) didn’t code well.

      – David

  6. Good stuff.
    David, if you need any really oddball fancy digital filters synthesised, I have some sophisticated packages, providing for arbitrary amplitude and delay response for FIR and IIR filters.

    I agree with Bill’s point about it being acceptable to have a few different prefilter shapes to suit different speakers , and/or their different equipment.

  7. An interesting anecdote from the declassified NSA archives:

    https://www.nsa.gov/public_info/_files/cryptologic_histories/cold_war_ii.pdf

    “For strategic systems, NSA developed two devices in the 1960s. The KY-9 was a
    narrow-band digital system using a vocoder, and it was the first speech system to use
    transistors. The advantage of the KY-9 was that it could be used on a standard Bell
    System 3 kHz-per-channel telephone system without modification. The disadvantages
    were many, however. It was big and heavy, encased in a safe that had to be unlocked every
    morning before the system could be activated. It was also expensive (over $40,000 per
    copy) and was a true “Donald Duck” system which required the users to speak slowly to be
    understood. Only about 260 sets were deployed, all to high-level users, mostly Air Force.”

Comments are closed.