Open Source Low Rate Speech Codec Part 1

I have decided to start working on a free (as in speech) low bit rate speech codec. The initial target is 2400 bit/s communications quality speech. Communications quality means something between synthetic, robotic sounding speech and a mobile phone. The application is voice over low bandwidth digital radio, like VHF/HF radio channels and Ham radio, for example an open version of D-star.

Common proprietary closed codecs in this space are MELP and AMBE. Due to patents and the amount of confidential information surrounding these codecs I don’t think it is possible to make an open codec compatible with these closed codecs. It is however possible to develop a open source, free-as-in-speech codec with similar performance at similar bit rates.

A lot of development has gone into these codecs, so I won’t claim that we can make an open codec with the same speech quality immediately. Therefore as a first milestone I will set the modest aim of speech quality between LPC-10 and MELP/AMBE at 2400 bits/s.

This project has been simmering on the back burner for a while now, and a couple of factors have come together to prompt me into action:

  1. Last year Bruce Perens contacted Jean-Marc Valin (of Speex fame) and myself regarding the problem of closed, patented, proprietary voice codecs in the sub-5 kbit/s range. Bruce has summarised the problem of low bit rate codecs and a possible development approach on the codec2 site.
  2. I have been following a proposed IETF standardisation of a free-as-in-speech high quality (e.g. 64 kbit/s), low latency codec with great interest. To help the effort along a little I blogged on Royalty Free Codecs which got the mental gears back into codec mode – especially the benefits of royalty free codecs. Curiously, most of the comments on this post were from Hams talking about the problems of closed codecs in the low bandwidth digital radio space.
  3. I have recently been messing with Ham radio again after a break of 25 years, so have become interested in Ham Radio issues such as the use of proprietary codecs.
  4. In the 1990’s I worked on low bit rate codecs so I have some of the necessary know-how. Where I am a little weak (e.g. vector quantisation) I have access to a bunch of very clever people in the open source community who are motivated to work on free software. So unlike my work of 10 years ago, I am not alone when I hit any tough bugs.
  5. Speex and the other open video and audio codecs have proven it’s possible to create a patent free, high quality codec. There is an important social theme behind these technical projects, which I discussed in the Royalty Free Codec post. A free codec helps a large amount of people and promotes development and innovation. A closed codec helps a small number people make money at the expense of stifled business and technical development for the majority.
  6. A low rate codec has applications in developing world communications which suffer from low bandwidth. For example 4 voice channels over a 14,400 baud dial up, or digital voice over non line of sight radio links (n.b. for VOIP the overhead of the IP protocol would be prohibitive for such a low rate codec so alternative protocols may be required).
  7. My experience with developing Oslec has been very positive. Oslec is a free line echo canceller that was developed to solve a similar problem in the VOIP space – a lack of a high quality, patent free, open source DSP algorithm. Until Oslec arrived you had to pay for “hardware” echo cancellation (DSP chips with proprietary code) or pay software license fees on otherwise free and open source systems. Oslec is now included in many Asterisk and Linux distributions and even the Linux Kernel. Along the way the Oslec project has helped demolish a bunch of echo canceller FUD, similar to what I see surrounding codecs. One important part of the Oslec experience was the use of Open source and community development techniques. The net result was access to a world wide “brains trust” and network of beta testers that resulted in swift development of effective DSP algorithms.

The Algorithm

The codec algorithm will be based around a generic sinusoidal coder I developed in the 1990s. To get started I am re-reading my Thesis, which was published about 10 years ago. A lot of the techniques I used pre-date that (1970s and 1980’s technology) and much of my thesis work was original so it’s a good patent free starting point. The earliest paper I can find introducing sinusoidal coding is from 1984.

Here is the 1 minute explanation of sinusoidal coding. Below is a plot of the spectrum of short segment (about 20ms) of female speech:

See how the speech spectrum is made up of peaks spaced by about 240Hz? Well 240Hz happens to be the pitch of the speech at this instant in time. Each peak can be thought of as a sine wave. A sinusoidal codec models the speech as set of sine waves, each with it’s own frequency, and phase, and amplitude. So instead of sending the speech waveform like a regular telephone, a sinusoidal encoder sends the sinusoid parameters over the channel to the decoder which then reconstructs the speech. The parameters change over time so we update them at regular intervals, like every 20ms. It turns out that if you do all of this right the speech at the decoder sounds pretty close to the original.

There are a couple of tricks. The first is accurately estimating the parameters. For example the little dots near the centre of each harmonic in the above plot are our estimates of the amplitudes. Not all of them are 100% accurate. Another problem is estimating the frequency of the sinusoids. A rather big challenge is how to represent all of the model parameters (amplitudes, phases, frequencies) in a small number of bits. This is called quantisation.

The trick with real world DSP algorithms is in the detail. It’s never as simple as the basic mathematical model would suggest. The many real world factors are where the work lies.

For the brave there is a more detailed introduction to sinusoidal coding in Chapter 3 (Page 33) of my Thesis. It even has some equations and lots of rather intimidating Greek letters.

Unquantised Samples of Sinusoidal Coding

Here is a sample of original speech, and speech encoded and decoded using the sinusoidal model. You can hear that they sound fairly close to each other. However bear in mind that the 2nd sample is not quantised yet – for example all of the model parameters are floating point numbers. It’s going to sound a whole lot worse by the time we reduce it down to 2400 bits/s!

The Plan

First step is to sort through the code and convert it all from DOS Turbo C (man I liked that IDE!) to a modern gcc project. There is also some code to convert from Matlab to C. Gasp – when I look at some of the C functions I wrote they date back to 1990!

Then next I need work out the best way to quantise the various model parameters. As a first step I will try using the Speex vector LSP quantiser for the harmonic magnitudes, then figure out a first pass way of quantising the other model parameters.

Update – Source Code

SVN repository containing source code and instructions for running the unquantised codec on Linux/gcc (all one line):
$ svn co
https://freetel.svn.sourceforge.net/svnroot/freetel/codec2 codec2

Links

Open Source Low Rate Codec Part 2 – Spectral Amplitides
Open Source Low Rate Codec Part 3 – Phase and Male Speech
Codec2 Web Page

32 thoughts on “Open Source Low Rate Speech Codec Part 1”

  1. Great thing to work on!
    As a VoIP developer I also follow all this open-source codecs effort and it seems now is a big point of change in the minds around he issue. Seems Speex (and Vorbis) made their job, cracking the ice. Have a nice hacking!

  2. Good luck and kudos to all working on this effort; this is a very important thing to have in the open source space. I spent a little time with Iridium phones recently (AMBE, I understand) and was quite surprised at the quality. Actually, as I had only ever heard LPC-10 it was really remarkable.

    I have a question though; if this quality is possible at 2400bps, has there been any open work done on 1200, 900, 300bps? How low can you actually go and have recognizable speech? Some googling revealed some samples as low as 600bps from DSP Innovations, but nothing lower that you could listen to —

  3. John – I am not aware of any open source codecs beneath Speex at 4 kbit/s. I am not sure what the lower limits are, but I did read somewhere that the actual information content of speech (source entropy) is around 50 bit/s, i.e. if all you were sending were bits representing the words.

    In general as the bit rate drops the intelligibility remains but the quality sounds more synthetic, this is OK for some applications.

    The nice thing about having an open source codec in this space is that people can experiment with different codecs for different bit rates. For example if your channel can tolerate a lot of delay and has a low bit error rate you can push your bit rate right down by exploiting correlation in adjacent frames and not using FEC.

    There are also some very interesting possibilities when the speech codec, FEC, and modulation scheme are combined. For example more important information in the coded speech (say the pitch) could be transmitted at higher power levels and less power allocated to less important information. This would help the speech quality degrade gradually as the channel conditions degrade. These sorts of combinations are difficult to achieve when the codec algorithm is locked up.

  4. The last time I looked (which was quite early on) Iridium used PSELP. I never found the details of what PSELP actually is, but I know it came from Motorola’s government systems group. I assume that means its derived from some military or security work. Its something like 2.5kbps.

    LPC10 is much maligned, because most people have only ever heard a broken codec. The are numerous broken implementations of LPC10 around the web. Try the one in spandsp. While not exactly hi-fi, it isn’t nearly as bad as what most people have heard as LPC10 quality.

  5. Hi Steve,

    I just listened to the spandsp LPC-10 files, and I agree, in particular through a speaker they sound reasonable. More synthetic through headphones. Makes me wonder why Hams haven’t done more work with this vocoder, on a clear channel it would sound better than many SSB signals. Perhaps a full packaged solution like D-start is required, including FEC and a modem.

    The Wikipedia entry for Iridium says the service uses AMBE. IIRC when first proposed about 20 year ago Iridium was backed by Motorola which might explain the early use of PSELP.

    – David

  6. Gee. my memory is bad. When I tried Googling for PSELP, to see if there was any good material on it, I found an email from me on the speex mailing list saying PSELP was replaced in the Iridium system. :-\

    Although I don’t know the details of PSELP I know its not a million miles from MELP. Its patented, and so not of any real interest here.

  7. Good to see some work being done on open source low speed codecs. One question. Do you have any plans to go below 2400 bps, say down to 1200 bps (for example, for inclusion into programs such as FDMDV)? I feel 1200 bps is getting close to the sweet spot for HF DV systems. 2400 works for VHF, as D-STAR demonstrates.

    Anyway, good luck with the venture, looking forward to hearing the results. 🙂

  8. Hi Tony. Sure I think there is potential for a 1200 bit/s codec. One approach might be to exploit the high degree of correlation between adjacent frames of speech, for example buffer up 80ms of speech and just transmit the small frame-frame changes. This would introduce some delay but this is no big deal for simplex (push to talk) telephony.

    Another issue with bit rates is the overhead for Forward Error Correction (FEC). For example at 2400 bit/s you might want to add 800-1200 bit/s second for FEC. It depends on how many bit errors you are likely to get over your given channel.

  9. Hi David,

    On what do you base the balance of 800-1200 FEC bits for 2400 bits of codec data? Most radio systems use more. GSM, for example, adds about 9k to 13k in full rate mode, or 5k to 6k in half rate mode.

  10. Just a gut feel – the low bit rate speech coding work I have been involved with in the past used roughly that amount of FEC. D-star using 1200 bps FEC on top of 2400 bps voice. There are some HF Ham systems that use no FEC with 1200 bit/s voice.

    Appropriate application of FEC depends on the application, the channel, error distribution, modulation scheme, sensitivity of codec bits, correcting power of the FEC code plus a few other factors. There is also a lot that can be done with no FEC overhead for example packet repetition and tracking codec parameters like energy.

  11. My impression is that as time has passed the percentage of the available bandwidth assigned to FEC has increased, to emphasize voice consistency over maximum voice quality. I think part of the drive for that came from public perception of early GSM. People complained about the voice quality, and a lot of effort went into better codecs (e.g. the move from FR to EFR) as a knee jerk reaction. However, their complaints appear to have had little to do with the codec. People seem fairly happy with the original FR codec when used in an error free manner. It was the average quality, with real world BER that they didn’t like.

  12. WRT to amateur radio let’s not forget that most systems are still 25 kHz channelled with IF filters to handle 16F3. For this application rather than busting a boiler to do 2400 bps it might be better to work on improving the modem. I’m thinking GMSK but going from the G3RUH style to the AIS standard ITU-R M.1371-1 which both seem to has proven to perform rather well in terrestrial and space applications. The original G3RUH design was a BT=0.5 design at 9,600 bps. At BT=0.3 16,000 bps is possible but needs a good randomizer i.e. K9NG or GRAPES or a bi-phase approach to avoid problems with receiver sync. But one needs to bear in mind adjacent channel performance (splatter). In summary with data rates of at least 9,600 bps, and perhaps a little more, possibly 4,800 bps for an amateur radio CODEC might be a better sounding option while leaving plenty for FEC – perhaps a concatenated code of rate ¾ or 7/8 convolution followed by a suitable block code.

  13. This is a very cool and ambitious project – can’t wait for it 🙂
    Think about Mesh Networking and an Open Source/Wifi community village phone provider…

  14. Hello David,

    Very good cause! I read in a TAPR bulleting about your effort, and cannot help putting some praising and encouraging words in here. Indeed commercial closed codecs have frustrated widespread use and innovation in Ham Radio and third world point-to-point radio telephony (a very similar environment).
    So be sure this work will make a difference once completed!

    Kindest regards,

    Ben

  15. Thanks Sven and Ben for your kind words of encouragement, it really helps motivate me on the project 🙂

  16. David,
    The few left on HF digital voice (we’re on 20m almost every day) anxiously look forward to using your CODEC2. This handful of hardcore DV users will be available for any on-the-air testing feedback you may need.
    Thank you very much for taking on this project.
    73,
    Mel and the .236 DV users

  17. Thanks for your encouragement Mel, it means a lot. I’ll try and get a an alpha version released in the next few months.

    Cheers,

    David

  18. Maybe a bit late, but I’ve just noticed at the very start you talk about a 2400 kbits/s codec (i.e. 2.4 megabits/s), I suspect that’s a typo given you then talk about 2400 bits/s. 🙂

  19. Hello all,

    I have followed the Codec2 effort for some time now, and I am impressed with the current quality as presented in the speech samples. The comparison with MELP gives equal or subjectively better performance, with one small negative, with the female “ss” sounds, that
    have a slight crackle.

    I will try to set up a testing platform, to be able to make longer test sequences, and try
    other speakers. I have been contemplating the possibility to port the code to Microchip PIC32,
    but since that only has integer mul / div , maybe anofter chip / DSP would be a better choice.

    Thanks also for the good explanation of the “innards”, I believe I actually get the picture!

    Keep up the good work.

    Marks Amateur Radio Klub

    through

    Gullik /SM6FBD

  20. Great work!

    But what I am missing in the milestones is UEP.
    If it is itended to use such codecs for transmitting over very noisy radio channels a forward error correction is mandatory! The optimization of needed redundancy (almost) allways leads to a Unequel Error Protection (UEP). Usually the streamed 51Bit of a 20ms block are divided in a bit groups A, B and C. The bits of group A need to be protected very strongly. The bits of group B are protected “normally” and bit of group C do not need to be protected.

    BTW, AMBE seems to use only group A and C.

    Please take this idea into account by your great development work.

    73 de Denis DL3OCK

    1. Thanks Denis. Yes for a complete digital radio system I agree some form of forward error correction is a good idea, and unequal error protection is a good approach. However I don’t think FEC should be part of the speech codec – there are some applications (like VOIP) that won’t need it. It’s a separate layer, like the modem. There are also techniques that involve no extra bits, like packet repetition.

      1. Hi David,

        generally I agree with you!

        But already in codec design you can foresee the division of encoded bits for a potentially (but of course not needed) UEP! I would always take this into account.

        An integrator of your codec in a noisy system (e.g. D-STAR) will be happy do not simply to spend the double of bits for a protection. It is always better to achieve the same protection quality using less redundancy as in the “doubled case”.

        If you like I can provide you the UEP of currently used codec in D-STAR via a PM.

        73 de Denis DL3OCK

        1. I would suggest that, since this is an open source project, it may be best to simply make it as easy as possible for others to develop patches for UEP support (whether by implementing FEC at this layer or providing some mechanism for lower layers to be aware of the bit groupings).

  21. hi David, i was wondering a problem always seams to be noise.
    maybe its an idea to adjust the to be recorded voice, by doing it minus the noise you receive when one doesnt speak. when i sometimes worked to improve recorded sound using goldwave, i often copied empty zones; where there was no speaking person active. copied that and then had some trick to use that as a filter, and do it minus the whole recording.
    Result silences became really silent; and voice more clear.
    i think its an area where is there is often improvement possible.

    also voice profiles male/female and their usage of the sound spectrum, could be calculated the more one speaks, the more you would know about the tone of his or her voice, which is for every person unique. i think your codec would benefit if it could detect that, and optimize, or get near to it.

    Also just a note if it could detect sentences; they often end a bit more silent; if the codec is not real time, you might take advantage of that and use slow fadeout signal.
    In real time that might be hard..maybe a small neural network could do such analyzing.
    In general sentences have these slow ups and downs in volume (after hearing your samples, i missed those) it tended to hissss instead of to fadeout.

    OK wel those are just ideas, i’m not a good coder but i think its great you do this

  22. I recently read a short note about your project in the german cq-dl. A quick walk though the blog showed me that this project could help to solve my project. I’m very disapointed to D-Star and like to support a better solution for digital amateur radio. Therfore I build up a 23cm FM repeater which has the ability to open the analog signal path by a command and insert a codec in one or both dicrections. While this is still not a baseband appoach it should help me to find a practical solution for the codec-hardware. However the licence will include an experimental baseband use on a single channel. The idea is to use time mutiplex to establish a full duplex communication without the need of hf-duplexers. This only to tell you my motivation.

    Back to your approach: I’m not really shure that the hardwawe neets to be a DSP or FPGA. I like to search a solution to do that with a fast microcontroller. This might be an ARM or even better only an 8bit AVR. Doing all the computation only in integer must not neccessarily be a drawback even when clock speed is only at 20MHz. It depends of cause on the complexity of the algorithm.

    This is actually my point of view as I found your work. May be that your c implementation of codec2 will fit in an AVR. I would like to check this later. But my further interest would be to do that much more optimized in assembler. The codec must not only be intelligent designed but should also be smart and small coded. I’m shure there is a way to do that. But may me after understanding your code I’ll be much more cautious 😉

    73 Tom, DC7GB

    1. Hi Tom,

      In future I will work on reducing the MIPs required for the codec. With a fixed point port I think it would run OK on a low end DSP or 200MHz router CPU. I am not sure about an 8 bit microcontroller, as many DSP operations really need a machine that had a native word length of 16 bit.

  23. Hi David

    Many thanks for the work on the free codec from the Ham side.

    Although I am a dstar user and have two ID-880’s -having experimented with P25 years ago – I too had many objections to the AMBE which is one of the excuses for equipment being priced so high .

    We have a success rate of less than 4% of hams who have to date obtained dstar equipment .

    With my experience the P25 mixed mode was a great advantage and probably one of the reasons for P25 giving dstar a good counter attack in the USA.

    My question /suggestion : We have many hams that will never go for the dstar due to the expense .

    Can your codec not be combined onto the ALL STAR interface so we can one day have the mixed mode for the migration at the individuals own time .Currently less than 1% have data going on dStar.

    An add-on board would be the way to go .

    Wonder if you have had any other like suggestions in this regards.

    Kind regards

    From S.A

    Brad ZS5BG

Comments are closed.