I’m in the process of releasing a 1400 bit/s version of Codec 2. Through efficient quantisation of the LSPs I have reduced the bit rate from 2500 to 1400 bit/s with only a small change in quality. This bit rate makes the codec very useful for digital voice over HF radio channels.
Here are some samples:
|Codec 2 V0.1A 2550 bit/s||male||female|
|Codec 2 V0.2 1400 bit/s||male||female|
|GSM Full rate 13000 bit/s||male||female|
GSM full rate is what you might have been using on your mobile phone a few years ago. It’s a good example of “communications quality” speech. Compared to GSM, Codec 2 does a reasonable job at just 10% of the bit rate. There are some more samples on the Codec 2 page.
I think it’s possible to eventually push Codec 2 beneath 1000 bit/s with the same quality level. Improvements in the speech quality of Codec 2 at 1400 and 2500 bit/s are also possible with further algorithm development.
- At 1400 bit/s you can send 45 phone calls in the same bandwidth required for a standard 64 kbit/s phone channel.
- 1400 bit/s is 175 bytes/second.
- A 30 second voice mail can be stored in 5250 bytes.
- A 30 minute pod cast can be stored in 308 kbytes.
- At 1400 bit/s Codec 2 uses 56 bit (7 byte) packets, sent every 40ms. If used for VOIP the RTP+UDP+IP overhead is 40 bytes/packet. So the payload is just 15 % of the total VOIP packet.
Building a 1400 bit/s communications quality speech codec is a highlight of my career.
My interest in speech coding started in 1989 just after I graduated from engineering. I was working with a team or researchers on Mobilesat – one of the first satellite-based mobile phone services. We were working mainly on the modems for the Mobilesat system. Just at that point in history, it became possible to use digital speech rather than analog systems such as FM or SSB. During the 1980′s breakthrough speech coding algorithms were developed that could deliver communications quality speech at less than 10 kbit/s. At the same time, the invention of DSP chips meant we could (just) run these complex algorithms in real time. Prior to that it took 30 minutes to process 3 seconds of speech on the PCs or workstations of the day.
Although I was meant to be working on sat-com modems, I was fascinated by speech coding; first the DSP hardware, then the challenges of real time implementation, then the speech coding algorithms themselves.
The CELP based speech codecs I built in the early 90′s could deliver communications quality speech at 9600 bit/s, or (with a significant drop in quality) run at 4800 bit/s.
So managing to get about the same quality at 1400 bit/s is a nice personal achievement for me. Giving it away to the world is even cooler.
2500 to 1400 bits/s
Here is the bit allocation of Codec 2 running at 2500 bit/s:
|Spectral magnitudes (LSPs)||36|
|Voicing (updated each 10ms)||2|
|Fundamental Frequency (Wo)||7|
The Line Spectrum Pairs (LSPs) dominate the bit rate so I focused my attentions there for a couple of weeks. Here is the bit allocation of Codec 2 running at 1400 bit/s:
|Spectral magnitudes (LSPs)||25||7||32|
|Voicing (updated each 10ms)||2||2||4|
|Fundamental Frequency (Wo)||7||3||10|
A Graphical Explanation of LSPs
Rather than going into the math of LSPs let me explain graphically. Here is a plot of 10 LSPs over 400ms of male speech:
The LSPs were extracted from this section of male speech:
The segment is the word “force” from the male sample at the top of this post.
In our case there are 10 LSPs. They are spread over 0 to 3500Hz. Together they represent the spectrum of the speech signal at any given point in time. They evolve over time as the speech signal changes, so we have to keep sending updated versions every 20ms or so.
Speech coding is the art of “what can we throw away”. So the idea is to send each LSP frequency across the channel to the decoder with the minimum number of bits, but still maintain good speech quality.
Closely spaced LSPs represent peaks in the speech spectrum. Our ear is very sensitive to these peaks so we must take special care with closely spaced LSPs. In the example above, you can see LSPs 1 and 2 close together between frames 7 and 25, at the same time we have a high energy vowel. You can also see LSPs 8&9, and 9&10 coming together during a consonant between frame 25 and 30. This indicates two peaks in the spectrum at high audio frequencies.
The ear is more sensitive to low frequencies, so it turns out we can use a coarser representation (less bits) for higher frequency LSPs.
There is another property that helps us. Perceptually important frames of voiced speech like vowels tend to change slowly. This suggests that coding the difference between frames will lead to coding efficiencies, as the frame-frame differences are very small.
Scalar and Vector Quantisation
The 2500 bit/s version of Codec 2 uses scalar quantisers. For example LSP 8 is “quantised” to one of 8 values represented by a table or array:
To quantise LSP 8, we find the closest value in the table, then send the index of that value. For LSP 8 this requires 3 bits/frame for the 8 possible values. The 2500 bit/s version of Codec 2 uses 36 bits total, with one table for each LSP. Because each LSP has it’s own quantiser table, it is known as scalar quantisation.
A Vector Quantiser (VQ) can be more efficient as it quantises several values at once, which can be referred to with just one index:
|Index||LSP 1||LSP 2||LSP 3||LSP 4|
This example vector quantises LSPs 1 to 4 using a 12 bit (4096 entry) table. VQs can be very efficient, as they quantise several values at once and can take into account correlations in the input data. The trade off is that VQs tend to be noisy, as the table entries may not quite match all of the values in the input vector. Also I have found designing a VQ that works well to be quite a challenge.
From 2500 to 1400 bit/s
Here are some highlights of a couple of weeks of trial and error. I am not claiming any of this is particularly original, just new or important to me so worth logging here:
I developed a 25 bit/frame quantiser using scalar quantisers for LSPs 1-4, then a 12 bit (4096 entry) Vector Quantiser (VQ) for LSPs 5-10. I used VQ for the higher LSPs, as they are less sensitive to quantisation noise. I freely admit I don’t completely understand VQ, so there may be room for improvement here.
I found that I could “pre-quantise” or bandwidth expand the LSPs without any drop in quality. For example if LSPs 5-10 are quantised to 100Hz steps there is no perceptual difference in the decoded speech. This suggests that quantising the LSPs to any finer resolution is a waste of bits – we can’t hear the difference. I used this effect to design the 12 bit Vector Quantiser (VQ) for LSPs 5-10, that (for me at least) worked better than the same size VQ I designed with traditional minimum mean square error (MSE) training methods.
Then I started playing with delta-time quantisation of LSPs. During high energy, strongly voiced speech, the LSPs change slowly from frame to frame. So I experimented with just transmitting this small change.
Perceptually important, closely spaced LSPs are really sensitive to quantisation noise. Small changes in closely spaced LSPs can have a big effect on the decoded speech quality. Fortunately, during this sort of speech the frame to frame changes are very small. So for coding delta changes in LSPs I designed a VQ codebook by hand with the properties I wanted. I took the approach of constraining the VQ codebook to very small changes.
Here are some of the delta-time codebook entries:
|Index||LSP 1||LSP 2||LSP 3||LSP 4|
This full codebook is here.
It’s a bit like counting in binary, except the base changes for each element of the vector. It’s probably not optimal, but it works. As there are 81 values in the codebook it can be transmitted with 7 bits (entries 82 to 127 are not used).
Another twist – I discovered was that I could get away with just updating the lowest LSPs 1-4 on the delta frames. On odd delta frames I just copy the previous values for the LSPs 5-10. This means just 7 bits/frame are required on the odd frames for LSPs.
A further reduction came from delta-coding the pitch (fundamental frequency), as it also changes slowly during perceptually important voiced speech frames. Just 3 bits on odd frames resulted in no loss of quality.
Over the next few weeks I will release a separate encoder/decoder version of 1400 bit/s Codec 2. At the moment you can run the same algorithm using the “c2sim” simulation program:
$ svn -r 306 co https://freetel.svn.sourceforge.net/svnroot/freetel/codec2-dev
$ cd codec2-dev && ./configure && make && cd src
$ ./c2sim ../raw/hts1a.raw --1400 -o hts1a_1400.raw
$ ../script/playraw.sh hts1a_1400.raw
I’d really like to see Codec 2 combined with a modem and running over the HF bands. Some early experimentation to get real world user feedback and rapid development of an “open” digital voice mode would be great. This mode could be implemented as a Linux or Windows PC application that uses two sound cards to connect to a SSB radio and head set.
A key issue to explore is robustness to bit errors. I favour unequal error protection modes, for example just a small amount of FEC that protects just a few key bits.
There are many areas where Codec 2 could be improved. The LSP quantisation could be developed further to improve the quality or lower the bit rate. I’d also like to work on the model used to synthesis phases at the decoder, and track down some issues with different speakers.