I have decided to start working on a free (as in speech) low bit rate speech codec. The initial target is 2400 bit/s communications quality speech. Communications quality means something between synthetic, robotic sounding speech and a mobile phone. The application is voice over low bandwidth digital radio, like VHF/HF radio channels and Ham radio, for example an open version of D-star.
Common proprietary closed codecs in this space are MELP and AMBE. Due to patents and the amount of confidential information surrounding these codecs I don’t think it is possible to make an open codec compatible with these closed codecs. It is however possible to develop a open source, free-as-in-speech codec with similar performance at similar bit rates.
A lot of development has gone into these codecs, so I won’t claim that we can make an open codec with the same speech quality immediately. Therefore as a first milestone I will set the modest aim of speech quality between LPC-10 and MELP/AMBE at 2400 bits/s.
This project has been simmering on the back burner for a while now, and a couple of factors have come together to prompt me into action:
- Last year Bruce Perens contacted Jean-Marc Valin (of Speex fame) and myself regarding the problem of closed, patented, proprietary voice codecs in the sub-5 kbit/s range. Bruce has summarised the problem of low bit rate codecs and a possible development approach on the codec2 site.
- I have been following a proposed IETF standardisation of a free-as-in-speech high quality (e.g. 64 kbit/s), low latency codec with great interest. To help the effort along a little I blogged on Royalty Free Codecs which got the mental gears back into codec mode – especially the benefits of royalty free codecs. Curiously, most of the comments on this post were from Hams talking about the problems of closed codecs in the low bandwidth digital radio space.
- I have recently been messing with Ham radio again after a break of 25 years, so have become interested in Ham Radio issues such as the use of proprietary codecs.
- In the 1990’s I worked on low bit rate codecs so I have some of the necessary know-how. Where I am a little weak (e.g. vector quantisation) I have access to a bunch of very clever people in the open source community who are motivated to work on free software. So unlike my work of 10 years ago, I am not alone when I hit any tough bugs.
- Speex and the other open video and audio codecs have proven it’s possible to create a patent free, high quality codec. There is an important social theme behind these technical projects, which I discussed in the Royalty Free Codec post. A free codec helps a large amount of people and promotes development and innovation. A closed codec helps a small number people make money at the expense of stifled business and technical development for the majority.
- A low rate codec has applications in developing world communications which suffer from low bandwidth. For example 4 voice channels over a 14,400 baud dial up, or digital voice over non line of sight radio links (n.b. for VOIP the overhead of the IP protocol would be prohibitive for such a low rate codec so alternative protocols may be required).
- My experience with developing Oslec has been very positive. Oslec is a free line echo canceller that was developed to solve a similar problem in the VOIP space – a lack of a high quality, patent free, open source DSP algorithm. Until Oslec arrived you had to pay for “hardware” echo cancellation (DSP chips with proprietary code) or pay software license fees on otherwise free and open source systems. Oslec is now included in many Asterisk and Linux distributions and even the Linux Kernel. Along the way the Oslec project has helped demolish a bunch of echo canceller FUD, similar to what I see surrounding codecs. One important part of the Oslec experience was the use of Open source and community development techniques. The net result was access to a world wide “brains trust” and network of beta testers that resulted in swift development of effective DSP algorithms.
The codec algorithm will be based around a generic sinusoidal coder I developed in the 1990s. To get started I am re-reading my Thesis, which was published about 10 years ago. A lot of the techniques I used pre-date that (1970s and 1980’s technology) and much of my thesis work was original so it’s a good patent free starting point. The earliest paper I can find introducing sinusoidal coding is from 1984.
Here is the 1 minute explanation of sinusoidal coding. Below is a plot of the spectrum of short segment (about 20ms) of female speech:
See how the speech spectrum is made up of peaks spaced by about 240Hz? Well 240Hz happens to be the pitch of the speech at this instant in time. Each peak can be thought of as a sine wave. A sinusoidal codec models the speech as set of sine waves, each with it’s own frequency, and phase, and amplitude. So instead of sending the speech waveform like a regular telephone, a sinusoidal encoder sends the sinusoid parameters over the channel to the decoder which then reconstructs the speech. The parameters change over time so we update them at regular intervals, like every 20ms. It turns out that if you do all of this right the speech at the decoder sounds pretty close to the original.
There are a couple of tricks. The first is accurately estimating the parameters. For example the little dots near the centre of each harmonic in the above plot are our estimates of the amplitudes. Not all of them are 100% accurate. Another problem is estimating the frequency of the sinusoids. A rather big challenge is how to represent all of the model parameters (amplitudes, phases, frequencies) in a small number of bits. This is called quantisation.
The trick with real world DSP algorithms is in the detail. It’s never as simple as the basic mathematical model would suggest. The many real world factors are where the work lies.
For the brave there is a more detailed introduction to sinusoidal coding in Chapter 3 (Page 33) of my Thesis. It even has some equations and lots of rather intimidating Greek letters.
Unquantised Samples of Sinusoidal Coding
Here is a sample of original speech, and speech encoded and decoded using the sinusoidal model. You can hear that they sound fairly close to each other. However bear in mind that the 2nd sample is not quantised yet – for example all of the model parameters are floating point numbers. It’s going to sound a whole lot worse by the time we reduce it down to 2400 bits/s!
First step is to sort through the code and convert it all from DOS Turbo C (man I liked that IDE!) to a modern gcc project. There is also some code to convert from Matlab to C. Gasp – when I look at some of the C functions I wrote they date back to 1990!
Then next I need work out the best way to quantise the various model parameters. As a first step I will try using the Speex vector LSP quantiser for the harmonic magnitudes, then figure out a first pass way of quantising the other model parameters.
Update – Source Code
SVN repository containing source code and instructions for running the unquantised codec on Linux/gcc (all one line):
$ svn co