Following on from the subset vector quantiser [1] I’ve been working on improving speech quality in the presence of bit errors without the use of Forward Error Correction (FEC). I’ve tried two techniques:
- Optimisation of VQ indexes [2][3]
- Trellis based maximum likelihood decoding of a sequence of VQ indexes [4]
Digital speech and FEC isn’t a great mix. Speech codecs often work fine with a few percent BER, which is unheard of for data applications that need zero bit errors. This is because humans can extract intelligible information from payload speech data with errors. On the channels of interest we like to play at low SNRs and high BERs (around 10%). However not many FEC codes work at 10% BER, and the ones that do require large block sizes, introducing latency which is problematic for Push To Talk (PTT) speech. So I’m interested in exploring alternatives to FEC that allow gradual degradation of speech in channels with high bit error rates.
Vector Quantisation
Here is a plot that shows a 2 dimensional Vector Quantiser (VQ) in action. The cloud of small dots is the source data we wish to encode. Each dot represents 2 source data samples, plotted for convenience on a 2D plot. The circles are the VQ entries, which are trained to approximate the source data. We arrange the VQ entries in a table, each with a unique index. To encode a source data pair, we find the nearest VQ pair, and send the index of that VQ entry over the channel. At the decoder, we reconstruct the pair using a simple table look up.
When we get a bit error in that index, we tend to jump to some random VQ entry, which can introduce a large error in the decoded pair. Index optimisation re-arranges the VQ indexes so a bit error in the received index will result in jumping to a nearby VQ entry, minimising the effect of the error. This is shown by the straight lines in the plot above. Each line shows the decoded VQ value for a single bit error. They aren’t super close (the nearest neighbours), as I guess it’s hard to simultaneously optimise all VQ entries for all single bit errors.
For my experiments I used a 16-th order VQ, so instead of two pairs there were 16 samples encoded for each VQ entry. These 16 samples represent the speech spectrum. Hard to plot a 16 dimensional value but the same ideas apply. We can optimise the VQ indexes so that a single bit error will lead to a decoded value “close” to the desired value. This gives us some robustness to bit errors for free. Unlike Forward Error Correction, no additional bits need to be sent. Also unlike FEC – the errors aren’t corrected – just masked to a certain extent (the decoded speech sounds a bit better).
The Trellis system is described in detail in [4]. This looks at a sequence of received VQ indexes, and the trajectory they take on the figure above. It makes the assumption that the speech signal changes fairly slowly in time, so the trajectory we trace on the figure above tends to be small jumps across the VQ “space”. A large jump means a possible bit error. Simultaneously, we look at the likelihood of receiving each vector index. In a noisy channel, we are more sure of some bits, and less sure of others. These are all arranged on a 2D “trellis” which we search to find the most likely path. This tends to work quite well when the vectors are sampled at a high rate (10 or 20ms), less so when we sample them less often (say 40ms).
The details of the two algorithms tested are in the GitHub PRs [5][6].
Results
This plot compares a few methods. The x axis is normalised SNR, and the y-axis spectral distortion. Results were an average of 30 seconds of speech on an AWGN channel.
- No errors (blue, bottom), is the basic VQ quantiser with no channel errors. Compared to the input samples, it has an average distortion of 3dB. That gets you rough but usable “communications quality” speech.
- Vanilla AWGN (green) is the spectral distortion as we lower the Eb/No (SNR), and bit errors gradually creep in. FreeDV 700C [8] uses no FEC so would respond like this to bit errors.
- The red curve is similar to FreeDV 700D/E – we are using a rate 0.5 LDPC code to protect the speech data. This works well until it doesn’t – you hit a threshold in SNR and the code falls over, introducing more errors than it corrects.
- The upper blue curve is the index optimised VQ (using the binary switch algorithm). This works pretty well (compare to green) and zero cost – we’ve just shuffled a few indexes in the VQ table.
- Black is when we combine the FEC with index optimisation. Even better at high Eb/No, and the “knee” where the FEC falls over is much less obvious than “red”.
- Cyan is the Trellis decoder, quite a good result at low Eb/No, but a “long tail” – it makes a few mistakes even at high Eb/No.
Here are some speech samples showing the index optimisation and trellis routines in action. They were generated at an Eb/No = 1dB (6% BER) operating point on the plot above. The Codec 2 700C mode is provided as a control. In these tests of spectral distortion the 700C mode uses 22 bits/frame at a 40ms frame rate, the “600” mode just 12 bits/frame at a 30ms frame rate. I’ve just applied the index optimisation and trellis decoding to the 600 mode.
Mode | BER | VQ | Decoder | Sample1 | Sample2 | Sample3 |
---|---|---|---|---|---|---|
700C | 0.00 | Orig | Normal | Listen | Listen | Listen |
700C | 0.06 | Orig | Normal | Listen | Listen | Listen |
600 | 0.00 | Orig | Normal | Listen | Listen | Listen |
600 | 0.06 | Orig | Normal | Listen | Listen | Listen |
600 | 0.06 | Opt | Normal | Listen | Listen | Listen |
600 | 0.06 | Opt | Trellis | Listen | Listen | Listen |
The index optimisation seems effective, especially on samples 1 and 2. The improvements are less noticeable on the longer sample3, although the longer sample makes it harder to do a quick A/B test. The Trellis scheme is even better at reducing the pops and clicks, but I feel on sample1 at least it tends to “smooth” the speech, so it becomes a little less intelligible.
Discussion
In this experiment I compared the spectral distortion of two non-redundant techniques to FEC based protection on a Spectral Distortion versus Eb/No scale.
While experimenting with this work I found an interesting trade off between update rate and error protection. With a higher update rate, we notice errors less. Unfortunately this increased the bit rate too.
The non-FEC techniques have a gradual “fuzzy” degradation versus a knee. This is quite useful for digital speech systems, e.g. at the bottom of a fade we might get “readability 3” speech, that bounce back up to “readability 5” after the fade. The ear will put it all back together using “Brain FEC”. With FEC based schemes you get readability 5 – R2D2 noises in the fade – then readability 5.
So non-FEC schemes have some potential to lower the “minimum SNR” the voice link can handle.
It’s clear that index optimisation does help intelligibility, with or without FEC.
At low Eb/No, the PER is 50%! So every 2nd 12-bit vector index has at least 1 bit error, and yet we are getting (readability 3 – readable with difficulty) speech. However the rule of thumb I have developed experimentally still applies – you still need PER=0.1/BER=0.01 for “readability 5” speech.
There are some use cases where not using FEC might be useful. A rate 0.5 FEC system requires twice the RF bandwidth, and modem synchronisation is harder as each symbol has half the energy. It introduces latency as the FEC codewords for a decent code are larger than the vocoder frame size. When you lose a FEC codeword, you tend to lose a large chunk of speech. Frame sync is slower, as it happens at the FEC codeword rate, making recovery after a deep fade or PTT sync slower. So having a non-FEC alternative in our toolkit for low SNR digital speech is useful.
On a personal note I quite enjoyed this project. It was tractable, and I managed to get results in a reasonable amount of time without falling down too many R&D rabbit holes. It was also fun and rewarding to come to grips with at least some of the math in [3]. I am quite pleased that (after the usual fight with the concept of coding gain) I managed to reconcile FEC and non-FEC results on a single plot that roughly compares to the perceptual quality of speech.
Ideas for further work:
- Test across a larger number of samples to get a better feel for the effectiveness of these algorithms.
- The index optimisation could be applied to Codec 2 700C (and hence FreeDV 700C/D/E). This would however break compatibility.
- More work with the trellis scheme might be useful. In a general sense, this is a model that takes into account various probabilities, e.g. how likely is it that we received a certain codeword? We could also include source probability information – e.g. for a certain speaker (or globally across all speakers) some vectors will be more likely than others. The probability tables could be updated in real time, as when the channel is not faded, we can trust that each index is probably correct.
- The “600” mode above is a prototype Codec 2 mode based on this work and [1]. We could develop that into a real world Codec 2/FreeDV mode and see how it goes over the air.
- The new crop of Neural Net vocoders use the same parameter set and VQ, so index optimisation/FEC trade offs may also be useful there (they may do this already). For example we could run FreeDV 2020 without any FEC, freeing up RF bandwidth for more pilot symbols so it handles HF channels better.
Reading Further
[1] Subset Vector Quantiser
[2] Codec 2 at 450 bit/s
[3] Pseudo-Gray coding, K. Zeger; A. Gersho, 1990
[4] Trellis Decoding for Codec 2
[5] PR – Vector Quantiser Index Optimisation
[6] PR – Trellis decoding of VQ
[7] RST Scale
[8] FreeDV Technology
Would it be possible sending a datastream with fec but that could also be decoded without the fec bits, so that the RX can decide if the black or the cyan line should be applied, depending on the snr, or switchable by the operator dependingg on what he prefers/yields the better quality at the very moment?
Hi Diego – sure some FEC decoders can tell you if they have successfully corrected bits, or the channel Eb/No estimate can be used to select a certain scheme. User-controlled is also easy with an open source codec.
That’s what I love so much about open source/open development.
Everyone can change, comment, brainstorm, configure and also contribute and use/apply in any kind of way.
Really appreciating your efforts thanks David