Open Source Low Rate Speech Codec Part 1

August 21st, 2009

I have decided to start working on a free (as in speech) low bit rate speech codec. The initial target is 2400 kbit/s communications quality speech. Communications quality means something between synthetic, robotic sounding speech and a mobile phone. The application is voice over low bandwidth digital radio, like VHF/HF radio channels and Ham radio, for example an open version of D-star.

Common proprietary closed codecs in this space are MELP and AMBE. Due to patents and the amount of confidential information surrounding these codecs I don’t think it is possible to make an open codec compatible with these closed codecs. It is however possible to develop a open source, free-as-in-speech codec with similar performance at similar bit rates.

A lot of development has gone into these codecs, so I won’t claim that we can make an open codec with the same speech quality immediately. Therefore as a first milestone I will set the modest aim of speech quality between LPC-10 and MELP/AMBE at 2400 bits/s.

This project has been simmering on the back burner for a while now, and a couple of factors have come together to prompt me into action:

  1. Last year Bruce Perens contacted Jean-Marc Valin (of Speex fame) and myself regarding the problem of closed, patented, proprietary voice codecs in the sub-5 kbit/s range. Bruce has summarised the problem of low bit rate codecs and a possible development approach on the codec2 site.
  2. I have been following a proposed IETF standardisation of a free-as-in-speech high quality (e.g. 64 kbit/s), low latency codec with great interest. To help the effort along a little I blogged on Royalty Free Codecs which got the mental gears back into codec mode - especially the benefits of royalty free codecs. Curiously, most of the comments on this post where from Hams talking about the problems of closed codecs in the low bandwidth digital radio space.
  3. I have recently been messing with Ham radio again after a break of 25 years, so have become interested in Ham Radio issues such as the use of proprietary codecs.
  4. In the 1990’s I worked on low bit rate codecs so I have some of the necessary know-how. Where I am a little weak (e.g. vector quantisation) I have access to a bunch of very clever people in the open source community who are motivated to work on free software. So unlike my work of 10 years ago, I am not alone when I hit any tough bugs.
  5. Speex and the other open video and audio codecs have proven it’s possible to create a patent free, high quality codec. There is an important social theme behind these technical projects, which I discussed in the Royalty Free Codec post. A free codec helps a large amount of people and promotes development and innovation. A closed codec helps a small number people make money at the expense of stifled business and technical development for the majority.
  6. A low rate codec has applications in developing world communications which suffer from low bandwidth. For example 4 voice channels over a 14,400 baud dial up, or digital voice over non line of sight radio links (n.b. for VOIP the overhead of the IP protocol would be prohibitive for such a low rate codec so alternative protocols may be required).
  7. My experience with developing Oslec has been very positive. Oslec is a free line echo canceller that was developed to solve a similar problem in the VOIP space - a lack of a high quality, patent free, open source DSP algorithm. Until Oslec arrived you had to pay for “hardware” echo cancellation (DSP chips with proprietary code) or pay software license fees on otherwise free and open source systems. Oslec is now included in many Asterisk and Linux distributions and even the Linux Kernel. Along the way the Oslec project has helped demolish a bunch of echo canceller FUD, similar to what I see surrounding codecs. One important part of the Oslec experience was the use of Open source and community development techniques. The net result was access to a world wide “brains trust” and network of beta testers that resulted in swift development of effective DSP algorithms.

The Algorithm

The codec algorithm will be based around a generic sinusoidal coder I developed in the 1990s. To get started I am re-reading my Thesis, which was published about 10 years ago. A lot of the techniques I used pre-date that (1970s and 1980’s technology) and much of my thesis work was original so it’s a good patent free starting point. The earliest paper I can find introducing sinusoidal coding is from 1984.

Here is the 1 minute explanation of sinusoidal coding. Below is a plot of the spectrum of short segment (about 20ms) of female speech:

See how the speech spectrum is made up of peaks spaced by about 240Hz? Well 240Hz happens to be the pitch of the speech at this instant in time. Each peak can be thought of as a sine wave. A sinusoidal codec models the speech as set of sine waves, each with it’s own frequency, and phase, and amplitude. So instead of sending the speech waveform like a regular telephone, a sinusoidal encoder sends the sinusoid parameters over the channel to the decoder which then reconstructs the speech. The parameters change over time so we update them at regular intervals, like every 20ms. It turns out that if you do all of this right the speech at the decoder sounds pretty close to the original.

There are a couple of tricks. The first is accurately estimating the parameters. For example the little dots near the centre of each harmonic in the above plot are our estimates of the amplitudes. Not all of them are 100% accurate. Another problem is estimating the frequency of the sinusoids. A rather big challenge is how to represent all of the model parameters (amplitudes, phases, frequencies) in a small number of bits. This is called quantisation.

The trick with real world DSP algorithms is in the detail. It’s never as simple as the basic mathematical model would suggest. The many real world factors are where the work lies.

For the brave there is a more detailed introduction to sinusoidal coding in Chapter 3 (Page 33) of my Thesis. It even has some equations and lots of rather intimidating Greek letters.

Unquantised Samples of Sinusoidal Coding

Here is a sample of original speech, and speech encoded and decoded using the sinusoidal model. You can hear that they sound fairly close to each other. However bear in mind that the 2nd sample is not quantised yet - for example all of the model parameters are floating point numbers. It’s going to sound a whole lot worse by the time we reduce it down to 2400 bits/s!

The Plan

First step is to sort through the code and convert it all from DOS Turbo C (man I liked that IDE!) to a modern gcc project. There is also some code to convert from Matlab to C. Gasp - when I look at some of the C functions I wrote they date back to 1990!

Then next I need work out the best way to quantise the various model parameters. As a first step I will try using the Speex vector LSP quantiser for the harmonic magnitudes, then figure out a first pass way of quantising the other model parameters.

Update - Source Code

SVN repository containing source code and instructions for running the unquantised codec on Linux/gcc (all one line):

$ svn co
https://freetel.svn.sourceforge.net/svnroot/freetel/codec2 codec2

Links

Open Source Low Rate Codec Part 2 - Spectral Amplitides
Open Source Low Rate Codec Part 3 - Phase and Male Speech
Codec2 Web Page

17 Responses to “Open Source Low Rate Speech Codec Part 1”

  1. Alexander Chemeris Says:

    Great thing to work on!
    As a VoIP developer I also follow all this open-source codecs effort and it seems now is a big point of change in the minds around he issue. Seems Speex (and Vorbis) made their job, cracking the ice. Have a nice hacking!

  2. Humberto Figuera Says:

    Nice one! :D You rocks man!!!

  3. John Laur Says:

    Good luck and kudos to all working on this effort; this is a very important thing to have in the open source space. I spent a little time with Iridium phones recently (AMBE, I understand) and was quite surprised at the quality. Actually, as I had only ever heard LPC-10 it was really remarkable.

    I have a question though; if this quality is possible at 2400bps, has there been any open work done on 1200, 900, 300bps? How low can you actually go and have recognizable speech? Some googling revealed some samples as low as 600bps from DSP Innovations, but nothing lower that you could listen to –

  4. david Says:

    John - I am not aware of any open source codecs beneath Speex at 4 kbit/s. I am not sure what the lower limits are, but I did read somewhere that the actual information content of speech (source entropy) is around 50 bit/s, i.e. if all you were sending were bits representing the words.

    In general as the bit rate drops the intelligibility remains but the quality sounds more synthetic, this is OK for some applications.

    The nice thing about having an open source codec in this space is that people can experiment with different codecs for different bit rates. For example if your channel can tolerate a lot of delay and has a low bit error rate you can push your bit rate right down by exploiting correlation in adjacent frames and not using FEC.

    There are also some very interesting possibilities when the speech codec, FEC, and modulation scheme are combined. For example more important information in the coded speech (say the pitch) could be transmitted at higher power levels and less power allocated to less important information. This would help the speech quality degrade gradually as the channel conditions degrade. These sorts of combinations are difficult to achieve when the codec algorithm is locked up.

  5. Steve Underwood Says:

    The last time I looked (which was quite early on) Iridium used PSELP. I never found the details of what PSELP actually is, but I know it came from Motorola’s government systems group. I assume that means its derived from some military or security work. Its something like 2.5kbps.

    LPC10 is much maligned, because most people have only ever heard a broken codec. The are numerous broken implementations of LPC10 around the web. Try the one in spandsp. While not exactly hi-fi, it isn’t nearly as bad as what most people have heard as LPC10 quality.

  6. david Says:

    Hi Steve,

    I just listened to the spandsp LPC-10 files, and I agree, in particular through a speaker they sound reasonable. More synthetic through headphones. Makes me wonder why Hams haven’t done more work with this vocoder, on a clear channel it would sound better than many SSB signals. Perhaps a full packaged solution like D-start is required, including FEC and a modem.

    The Wikipedia entry for Iridium says the service uses AMBE. IIRC when first proposed about 20 year ago Iridium was backed by Motorola which might explain the early use of PSELP.

    - David

  7. Steve Underwood Says:

    Gee. my memory is bad. When I tried Googling for PSELP, to see if there was any good material on it, I found an email from me on the speex mailing list saying PSELP was replaced in the Iridium system. :-\

    Although I don’t know the details of PSELP I know its not a million miles from MELP. Its patented, and so not of any real interest here.

  8. Brian Says:

    This is excellent news. I’m not a fan of digital radio but I wish you well with your project.

  9. Tony Langdon Says:

    Good to see some work being done on open source low speed codecs. One question. Do you have any plans to go below 2400 bps, say down to 1200 bps (for example, for inclusion into programs such as FDMDV)? I feel 1200 bps is getting close to the sweet spot for HF DV systems. 2400 works for VHF, as D-STAR demonstrates.

    Anyway, good luck with the venture, looking forward to hearing the results. :)

  10. david Says:

    Hi Tony. Sure I think there is potential for a 1200 bit/s codec. One approach might be to exploit the high degree of correlation between adjacent frames of speech, for example buffer up 80ms of speech and just transmit the small frame-frame changes. This would introduce some delay but this is no big deal for simplex (push to talk) telephony.

    Another issue with bit rates is the overhead for Forward Error Correction (FEC). For example at 2400 bit/s you might want to add 800-1200 bit/s second for FEC. It depends on how many bit errors you are likely to get over your given channel.

  11. Steve Underwood Says:

    Hi David,

    On what do you base the balance of 800-1200 FEC bits for 2400 bits of codec data? Most radio systems use more. GSM, for example, adds about 9k to 13k in full rate mode, or 5k to 6k in half rate mode.

  12. david Says:

    Just a gut feel - the low bit rate speech coding work I have been involved with in the past used roughly that amount of FEC. D-star using 1200 bps FEC on top of 2400 bps voice. There are some HF Ham systems that use no FEC with 1200 bit/s voice.

    Appropriate application of FEC depends on the application, the channel, error distribution, modulation scheme, sensitivity of codec bits, correcting power of the FEC code plus a few other factors. There is also a lot that can be done with no FEC overhead for example packet repetition and tracking codec parameters like energy.

  13. Steve Underwood Says:

    My impression is that as time has passed the percentage of the available bandwidth assigned to FEC has increased, to emphasize voice consistency over maximum voice quality. I think part of the drive for that came from public perception of early GSM. People complained about the voice quality, and a lot of effort went into better codecs (e.g. the move from FR to EFR) as a knee jerk reaction. However, their complaints appear to have had little to do with the codec. People seem fairly happy with the original FR codec when used in an error free manner. It was the average quality, with real world BER that they didn’t like.

  14. John in ZL4 Says:

    WRT to amateur radio let’s not forget that most systems are still 25 kHz channelled with IF filters to handle 16F3. For this application rather than busting a boiler to do 2400 bps it might be better to work on improving the modem. I’m thinking GMSK but going from the G3RUH style to the AIS standard ITU-R M.1371-1 which both seem to has proven to perform rather well in terrestrial and space applications. The original G3RUH design was a BT=0.5 design at 9,600 bps. At BT=0.3 16,000 bps is possible but needs a good randomizer i.e. K9NG or GRAPES or a bi-phase approach to avoid problems with receiver sync. But one needs to bear in mind adjacent channel performance (splatter). In summary with data rates of at least 9,600 bps, and perhaps a little more, possibly 4,800 bps for an amateur radio CODEC might be a better sounding option while leaving plenty for FEC – perhaps a concatenated code of rate ¾ or 7/8 convolution followed by a suitable block code.

  15. Sven Vogel Says:

    This is a very cool and ambitious project - can’t wait for it :-)
    Think about Mesh Networking and an Open Source/Wifi community village phone provider…

  16. Ben Witvliet PA5BW 5R8DS Says:

    Hello David,

    Very good cause! I read in a TAPR bulleting about your effort, and cannot help putting some praising and encouraging words in here. Indeed commercial closed codecs have frustrated widespread use and innovation in Ham Radio and third world point-to-point radio telephony (a very similar environment).
    So be sure this work will make a difference once completed!

    Kindest regards,

    Ben

  17. david Says:

    Thanks Sven and Ben for your kind words of encouragement, it really helps motivate me on the project :-)

Leave a Reply