Open Source Low Rate Speech Codec Part 2

Since the last post I have ported a bunch of my old software from DOS across to Linux. Most of it ran with minor changes, although it could use some refactoring from a software engineering point of view.

The simulations and vector quantiser training sure runs faster than back in 1995 – we must have 10 times the CPU power now. Back in “the day” speech coding used to be very limited by CPU power, I remember it was only in the mid-late 1980′s that it was possible to code speech signals using CELP in real time. Until then the appropriate DSP chips didn’t exist. In 1990 my co-researchers gasped in amazement at my mega-beasty 486 33MHz with 8M RAM – bought especially for the purpose of speech coding research.

Open Source in Academia

Now I have spent the last few years trying out the idea of open hardware for VOIP which is a twist on the more commonly accepted idea of open software. Now that I am revisiting academic style research for this project I am wondering about the use of Open Source in academia. The common way to spread knowledge is via papers, for example conference or journal papers. I remember reading 100′s of papers and often struggling to understand the math which was used to communicate the details of the paper. It was also really hard to reproduce any of the claimed results as the really important fine detail of the algorithms were usually left out.

Even large steps in the math were often left out (like how equation C was derived from equation B) for the sake of brevity in a 4 page paper. When you struggle at math like me it made many of these papers really hard to follow. I suspect I am not alone!

It’s a bit like some one in the Linux community saying “we built a multitasking operating system with innovation X” but leaving out the source. Math has it’s place but I figure in many cases source code is an excellent alternative, and can’t be beat for capturing the finer details.

It would have saved me incredible amounts of work if the speech coding dudes I was following had published source. Surely the field would have advanced much faster as well. Now it’s common to distribute standardised algorithms like g.729 in source code format. However what I am advocating is publishing the source for DSP algorithms while they are still in the research phase. Full marks to people like the CELT project who are doing this already.

Open Source Sinusoidal Codec

To the best of my knowledge the source code for this project is the only sinusoidal speech codec source code available on the Internet. Perhaps it will be useful to people performing research in the area of sinusoidal and MBE codecs (the Multi-Band Excitation or MBE codec is a close relative). Makes me wonder why I wasn’t sharing all this source code back in the 90s. Sure would have developed a better codec if I did.

There is also a k-means vector quantiser program (vqtrain.c) I dug out of the PhD time capsule that might be useful for people designing scalar and vector quantisers.

Software Oscilloscope

DSP algorithms are complex beasties. I like to develop tools for looking inside them, so I can see the gears turning and the signals flowing. This was really useful when developing Oslec, for problem signals I could dump the algorithm variables and analyse what was going on.

So I have written some scripts in GNU Octave to look inside the speech coding algorithm as it develops, for example plamp.m. A simple command line interface lets me single step back and forward through a file of speech that I am processing:

It’s a bit like an oscilloscope for software signal processing. As the sinusoidal codec (written in C) runs it dump many of the internal signals to a text file which are then loaded into Octave.

The plot below is a short chunk (about 40ms) of female speech:

The next plot shows the same speech in the frequency domain (green line) with a bunch of model parameters superimposed for comparison. Click for a larger image.

I also wrote a little shell script utility menu.sh for playing multiple files (Press 1 for file1, 2 for file2 etc), this is useful for comparing speech samples coded by different algorithms.

Spectral Magnitude Quantisation

This is where I make your eyes glaze over. The technical bit. I do have some ideas how to explain this in a less technical way (good topic for a later post), but for now I am going to assume readers already understand speech coding.

The plot above shows the magnitude of each sinusoid sample Am as the continuous red line. It is this red line that we are trying to quantise and send over the channel. First we fit a p=10 LPC model using time domain LPC. This gives us a constant number of parameters (10 LPC coefficients) which we can transform to LSPs and quantise. The LSP frequencies are the little red lines along the top of the plot – note how they bracket spectral peaks.

The LPC power spectra after LSP quantisation is the purple line Pw. The decoded Amq values are determined by the RMS spectral average of Pw over the m-th harmonic. Using the RMS spectra worked much better than direct sampling of Pw at the harmonic centre, due to reasons I don’t quite understand. I think it’s something to do with the error minimising properties of LPC modelling. The energy of the m-th harmonic and Pw are approximately the same, for example here is a close up of the m=2 harmonic. The area under Pw (purple) is about the same as the area under the original speech spectra (green).

The LPC spectra is higher at the edges of the m-th band. A low order LPC model can’t follow the spectra shape exactly so comes up with this approximation which just happens to suit us. Some more about this in [1] (Section 5.3) if you are really interested.

The cyan line at the bottom plots the errors in magnitude quantisation for each sample. Note I am using a Signal to Noise Ratio (SNR) objective measure – this weights the high energy harmonics more. These tend to be the low frequency harmonics which the ear is more sensitive to. An alternative is Spectral distortion (SD).

As a first pass (a few hours messing about) I have designed (using the k-means vector quantiser design algorithm) a simple split VQ for the LSPs. For LSPs 1-2, 2-3, 3-7, 8-10 I use 10/9/9/9 bit VQs for a total of 37 bits/frame. There is a little perceptual distortion using this VQ on the couple of test samples I tried compared to LPC modeling alone. For CELP codecs 37 bits/frame is rather high, but in a sinusoidal coder the LSPs represent all of the spectral information – unlike CELP that can also “correct” any errors in the spectrum using the excitation. So it’s critical to have good spectral quantisation in sinusoidal codecs.

The up side with sinusoidal codecs is that the excitation is very compactly encoded compared to CELP, which uses about 80% of the bit rate for encoding the excitation.

Here are some speech samples:

Original Male Female
Sinusoidal Male Female
p=10 LPC Male Female
37 bit LSP Male Female

I allocated more bits to low order LSPs as I could hear the decoded speech quality was more sensitive to errors. Due to the higher energy of the low frequencies the LSPs tend to be closer together which makes them more sensitive to quantisation. I found that I could get away with coarser quantisation for the higher LSPs. The VQ search is not weighted in any way (for example biased towards preserving closely spaced LSPs).

This is just a first step, and doesn’t even take into account frame-frame correlations. The {Am} change only slowly from one frame to the next, so there is considerable coding gain in frame-frame prediction of LSPs. Pretty sure we can get much lower than 37 bits/frame (1850 bit/s at a 20ms frame rate). But it’s a good start.

Mailing List

The experience of developing Oslec taught me the importance of bouncing ideas off people and the power of open DSP development compared to trying to figure it all out by myself. Over this week I have had some great discussions with Jean-Marc that has helped me get back into speech coding mode. Blogging really helps too, it forces me to express my ideas clearly in written form.

To allow others to join in and preserve our discussions on the Internet I have started a codec2 mailing list. I would appreciate anyone interested in the codec (e.g. Hams interested in digital comms, DSP developers, people from proprietary codec companies :-) ) joining up and encouraging the development effort.

links

[1] Open Source Low Rate Codec Part 1
[2] Techniques for Harmonic Sinusoidal Coding
[3] Open Source Low Rate Codec Part 3 – Phase and Male Speech
[4] Codec2 Web Page

7 comments to Open Source Low Rate Speech Codec Part 2

  • We also started some experiments in sinusoidal coding at http://svn.xiph.org/trunk/ghost/

    It’s still very early in the experimental stages (and has mostly been on hold while we focus on CELT and Theora). It isn’t targeted specifically at speech, and it isn’t as far along as your codec (there’s currently no bitstream at all, just a core sinusoidal modeler), but you might find it interesting.

  • david

    That’s excellent Timothy – thanks for pointing out the link to Ghost.

  • The paper on the main sinusoidal estimation algorithm, “Low-Complexity Iterative Sinusoidal Parameter Estimation”, is here: http://people.xiph.org/~jm/papers/valin_sinusoids.pdf

    This may be a little easier to dive into than the code, though it is filled with math. But at least this way you have both.

  • Hi David,

    I’m not clear what your audio samples represent. I think I understand original :-) , but I’m not clear about the total content of the bit stream for other three sets?

    Steve

  • david

    Good point, Here is some more explanation:

    “Sinusoidal” is the original speech coded with an unquantised sinusoidal model. So all the model parameters are in float form. Like LPC-10 before anything has been quantised.

    “p=10 LPC” means the spectral amplitude samples have been fitted to a 10th order LPC model, but this model is unquantised (floating point LPC coeffs).

    “37 bit LSP” is like p=10 LPC but now we have quantised the spectral magnitude information to 37 bits/frame. The other model parameters like pitch, LPC energy, and harmonic phases remain unquantised but will likely require far fewer bits than the spectral magnitudes (aka LSPs).

  • Frank

    37 bits for LSP are too high for 2.4 kbit speech coder. Typically, 20-24 bits are enough to get about 1 dB spectrum distortion if the multi-stage vector quantization (MSVQ) and predictive vector quantization are used. More than 10 bits are saved and could be used to quantize the spectrum or phase of the speech signal or excitation.

  • ZL4JY

    Good work David, but have to say that I hate the sound of these low rate codec even GSM gives me a headache.
    In noise limited environments such as satellite, EME, and DX the use of a highly efficient CODEC can be justified but most communications occurs in environments where there is plenty of signal (although admittedly often plagued by propagation impairments).
    My plea is for a quality CODEC. I am a fan of the Skype audio with its rich 8 kHz bandwidth. While Skype often suffers from its own ‘propagation’ faults (delays and packet loss) and some poor sound card implementations, on a good circuit it is a pleasure to use allowing better appreciate of the subtle nuances of good conversation.
    So while you toil on a good communications codec perhaps there might be room for a conversation mode?

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>