OK, so FreeDV 700 was released a few weeks ago and I’m working on some ideas to improve it. Especially those annoying R2D2 noises due to bit errors at low SNRs.
I’m trying some ideas to improve the speech quality without the use of Forward Error Correction (FEC).
Speech coding is the art of “what can I throw away”. Speech codecs remove a bunch of redundant information. As much as they can. Hopefully with whats left you can still understand the reconstructed speech.
However there is still a bit of left over redundancy. One sample of a model parameter can look a lot like the previous and next sample. If our codec quantisation was really clever, adjacent samples would look like noise. The previous and next samples would look nothing like the current one. They would be totally uncorrelated, and our codec bit rate would be minimised.
This leads to a couple of different approaches to the problem of sending coded speech over channel with bit errors:
The first, conventional approach is to compress the speech as much as we can. This lowers the bit rate but makes the coded speech very susceptible to bit errors. One bit error might make a lot of speech sound bad. So we insert Forward Error correction (FEC) bits, raising the overall bit rate (not so great), but protecting the delicate coded speech bits.
This is also a common approach for sending data over dodgy channels. For data, we cannot tolerate any bit errors, so we use FEC, which can correct every error (or die trying).
However speech is not like data. If we get a click or a pop in the decoded speech we don’t care much. As long as we can sorta make out what was said. Our “Brain FEC” will then work out what the message was.
Which leads us to another approach. If we leave a little redundancy in the coded speech, we can use that to help correct or at least smooth out the received speech. Remember that for speech, it doesn’t have to be perfect. Near enough is good enough. That can be exploited to get us gain over a system that uses FEC.
Turns out that in the Bit Error Rate (BER) ranges we are playing with (5-10%) it’s hard to get a good FEC code. Many of the short ones break – they introduce more errors than they correct. The really good ones are complex with large block sizes (1000s of bits) that introduce unacceptable delay. For example at 700 bit/s, a 7000 bit FEC codeword is 10 seconds of coded speech. Ooops. Not exactly push to talk. And don’t get me started on the memory, MIPs, implementation complexity, and modem synchronisation issues.
These ideas are not new, and I have been influenced by some guys I know who have worked in this area (Philip and Wade if you’re out there). But not influenced enough to actually look up and read their work yet, lol.
So the idea is to exploit the fact that each codec model parameter changes fairly slowly. Another way of looking at this is the probability of a big change is low. Take a look at the “trellis” diagram below, drawn for a parameter that is represented by a 2 bit “codeword”:
Lets say we know our current received codeword at time n is 00. We happen to know it’s fairly likely (50%) that the next received bits at time n+1 will be 00. A 11, however, is very unlikely (0%), so if we receive a 11 after a 00 there is very probably an error, which we can correct.
The model I am using works like this:
- We examine three received codewords: the previous, current, and next.
- Given a received codeword we can work out the probability of each possible transmitted codeword. For example we might BPSK modulate the two bit codeword 00 as -1 -1. However when we add noise the receiver will see -1.5 -0.25. So the receiver can then say, well … it’s most likely -1 -1 was sent, but it also could have been a -1 1, and maybe the noise messed up the last bit.
- So we work out the probability of each sequence of three codewords, given the probability of jumping from one codeword to the next. For example here is one possible “path”, 00-11-00:
total prob =
(prob a 00 was sent at time n-1) AND
(prob of a jump from 00 at time n-1 to 11 at time n) AND
(prob a 11 was sent at time n) AND
(prob of a jump from 11 at time n to 00 at time n+1) AND
(prob a 00 was sent at time n+1)
- All possible paths of the three received values are examined, and the most likely one chosen.
The transition probabilities are pre-computed using a training database of coded speech. Although it is possible to measure these on the fly, training up to each speaker.
I think this technique is called maximum likelihood decoding.
Demo and Walk through
To test this idea I wrote a GNU Octave simulation called trellis.m
Here is a test run for a single trellis decode. The internal states are dumped for your viewing pleasure. You can see the probability calculations for each received codeword, the transition probabilities for each state, and the exhaustive search of all possible paths through the 3 received codewords. At the end, it get’s the right answer, the middle codeword is decoded as a 00.
For convenience the probability calculations are done in the log domain, so rather than multiplies we can use adds. So a large negative “probability” means really unlikely, a positive one likely.
Here is a plot of 10 seconds of a 4 bit LSP parameter:
You can see a segments where it is relatively stable, and some others where it’s bouncing around. This is a mesh plot of the transition probabilities, generated from a small training database:
It’s pretty close to a “eye” matrix. For example, if you are in state 10, it’s fairly likely the next state will be close by, and less likely you will jump to a remote state like 0 or 15.
Here is test run using data from several seconds of coded speech:
loading training database and generating tp .... done
loading test database .... done
Eb/No: 3.01 dB nerrors 28 29 BER: 0.03 0.03 std dev: 0.69 1.76
We are decoding using trellis based decoding, and simple hard decision decoding. Note how the number of errors and BER is the same? However the std dev (distance) between the transmitted and decoded codewords is much better for trellis based decoding. This plot shows the decoder errors over 10 seconds of a 4 bit parameter:
See how the trellis decoding produces smaller errors?
Not all bit errors are created equal. The trellis based decoding favours small errors that have a smaller perceptual effect (we can’t hear them). Simple hard decision decoding has a random distribution of errors. Sometimes you get the Most Significant Bit (MSB) of the binary codeword flipped which is bad news. You can see this effect above, with a 4 bit codeword, a MSB error means a jump of +/- 8. These large errors are far less likely with trellis decoding.
Hear are some samples that compare trellis based decoding to simple hard decision decoding, when applied to Codec2 at 700 bit/s on a AWGN channel using PSK. Only the 6 LSP parameters are tested (short term spectrum), no errors or correction are applied to the excitation parameters (voicing, pitch, energy).
|Eb/No (dB)||BER||Trellis||Simple (hard dec)|
At 3dB, the trellis based decoding removes most of the effects of bit errors, and it sounds similar to the no error reference. Compared to simple decoding, the bloops and washing machine noises have gone away. At 0dB Eb/No, the speech quality is improved, with some exceptions. Fast changes, like the “W” in double-you, and the “B” in Bruce become indistinct. This is because when the channel noise is high, the probability model favours slow changes in the parameters.
Still – getting any sort of speech at 8% bit error rates with no FEC is pretty cool.
These techniques could also be applied to FreeDV 1600, improving the speech quality with no additional overhead. Further work is required to extend these ideas to all the codec parameters, such as pitch, energy, and voicing.
I need to train the transition probabilities with a larger database, or make it train in real time using off air data.
We could include other information in the model, like the relationship of adjacent LSPs, or how energy and pitch change slowly in strongly voiced speech.
Now 10% BER is an interesting, rarely explored area. The data guys start to sweat above 1E-6, and assume everyone else does. At 10% BER FEC codes don’t work well, you need a really long block size or a low FEC rate. Modems struggle due to syncronisation issues. However at 10% the Eb/No versus BER curves start to get flat, so a few dB either way doesn’t change the BER much. This suggests small changes in intelligibility (not much of a threshold effect). Like analog.
For speech, we don’t need to correct all errors; we just need to make it sound like they are corrected. By leaving some residual redundancy in the coded speech parameters we can use probability models to correct errors in the decoded speech with no FEC overhead.
This work is another example of experimental work we can do with an open source codec. It combines knowledge of the channel, the demodulator and the codec parameters to produce a remarkable result – improved performance with no FEC.
This work is in it’s early stages. But the gains all add up. A few more dB here and there.