Trellis Decoding for Codec 2

OK, so FreeDV 700 was released a few weeks ago and I’m working on some ideas to improve it. Especially those annoying R2D2 noises due to bit errors at low SNRs.

I’m trying some ideas to improve the speech quality without the use of Forward Error Correction (FEC).

Speech coding is the art of “what can I throw away”. Speech codecs remove a bunch of redundant information. As much as they can. Hopefully with whats left you can still understand the reconstructed speech.

However there is still a bit of left over redundancy. One sample of a model parameter can look a lot like the previous and next sample. If our codec quantisation was really clever, adjacent samples would look like noise. The previous and next samples would look nothing like the current one. They would be totally uncorrelated, and our codec bit rate would be minimised.

This leads to a couple of different approaches to the problem of sending coded speech over channel with bit errors:

The first, conventional approach is to compress the speech as much as we can. This lowers the bit rate but makes the coded speech very susceptible to bit errors. One bit error might make a lot of speech sound bad. So we insert Forward Error correction (FEC) bits, raising the overall bit rate (not so great), but protecting the delicate coded speech bits.

This is also a common approach for sending data over dodgy channels. For data, we cannot tolerate any bit errors, so we use FEC, which can correct every error (or die trying).

However speech is not like data. If we get a click or a pop in the decoded speech we don’t care much. As long as we can sorta make out what was said. Our “Brain FEC” will then work out what the message was.

Which leads us to another approach. If we leave a little redundancy in the coded speech, we can use that to help correct or at least smooth out the received speech. Remember that for speech, it doesn’t have to be perfect. Near enough is good enough. That can be exploited to get us gain over a system that uses FEC.

Turns out that in the Bit Error Rate (BER) ranges we are playing with (5-10%) it’s hard to get a good FEC code. Many of the short ones break – they introduce more errors than they correct. The really good ones are complex with large block sizes (1000s of bits) that introduce unacceptable delay. For example at 700 bit/s, a 7000 bit FEC codeword is 10 seconds of coded speech. Ooops. Not exactly push to talk. And don’t get me started on the memory, MIPs, implementation complexity, and modem synchronisation issues.

These ideas are not new, and I have been influenced by some guys I know who have worked in this area (Philip and Wade if you’re out there). But not influenced enough to actually look up and read their work yet, lol.

The Model

So the idea is to exploit the fact that each codec model parameter changes fairly slowly. Another way of looking at this is the probability of a big change is low. Take a look at the “trellis” diagram below, drawn for a parameter that is represented by a 2 bit “codeword”:

Lets say we know our current received codeword at time n is 00. We happen to know it’s fairly likely (50%) that the next received bits at time n+1 will be 00. A 11, however, is very unlikely (0%), so if we receive a 11 after a 00 there is very probably an error, which we can correct.

The model I am using works like this:

  1. We examine three received codewords: the previous, current, and next.
  2. Given a received codeword we can work out the probability of each possible transmitted codeword. For example we might BPSK modulate the two bit codeword 00 as -1 -1. However when we add noise the receiver will see -1.5 -0.25. So the receiver can then say, well … it’s most likely -1 -1 was sent, but it also could have been a -1 1, and maybe the noise messed up the last bit.
  3. So we work out the probability of each sequence of three codewords, given the probability of jumping from one codeword to the next. For example here is one possible “path”, 00-11-00:

    total prob =
    (prob a 00 was sent at time n-1) AND
    (prob of a jump from 00 at time n-1 to 11 at time n) AND
    (prob a 11 was sent at time n) AND
    (prob of a jump from 11 at time n to 00 at time n+1) AND
    (prob a 00 was sent at time n+1)

  4. All possible paths of the three received values are examined, and the most likely one chosen.

The transition probabilities are pre-computed using a training database of coded speech. Although it is possible to measure these on the fly, training up to each speaker.

I think this technique is called maximum likelihood decoding.

Demo and Walk through

To test this idea I wrote a GNU Octave simulation called trellis.m

Here is a test run for a single trellis decode. The internal states are dumped for your viewing pleasure. You can see the probability calculations for each received codeword, the transition probabilities for each state, and the exhaustive search of all possible paths through the 3 received codewords. At the end, it get’s the right answer, the middle codeword is decoded as a 00.

For convenience the probability calculations are done in the log domain, so rather than multiplies we can use adds. So a large negative “probability” means really unlikely, a positive one likely.

Here is a plot of 10 seconds of a 4 bit LSP parameter:

You can see a segments where it is relatively stable, and some others where it’s bouncing around. This is a mesh plot of the transition probabilities, generated from a small training database:

It’s pretty close to a “eye” matrix. For example, if you are in state 10, it’s fairly likely the next state will be close by, and less likely you will jump to a remote state like 0 or 15.

Here is test run using data from several seconds of coded speech:
octave:130> trellis
loading training database and generating tp .... done
loading test database .... done
Eb/No: 3.01 dB nerrors 28 29 BER: 0.03 0.03 std dev: 0.69 1.76

We are decoding using trellis based decoding, and simple hard decision decoding. Note how the number of errors and BER is the same? However the std dev (distance) between the transmitted and decoded codewords is much better for trellis based decoding. This plot shows the decoder errors over 10 seconds of a 4 bit parameter:

See how the trellis decoding produces smaller errors?

Not all bit errors are created equal. The trellis based decoding favours small errors that have a smaller perceptual effect (we can’t hear them). Simple hard decision decoding has a random distribution of errors. Sometimes you get the Most Significant Bit (MSB) of the binary codeword flipped which is bad news. You can see this effect above, with a 4 bit codeword, a MSB error means a jump of +/- 8. These large errors are far less likely with trellis decoding.

Listen

Hear are some samples that compare trellis based decoding to simple hard decision decoding, when applied to Codec2 at 700 bit/s on a AWGN channel using PSK. Only the 6 LSP parameters are tested (short term spectrum), no errors or correction are applied to the excitation parameters (voicing, pitch, energy).

Eb/No (dB) BER Trellis Simple (hard dec)
big 0.00 Listen Listen
3.0 0.02 Listen Listen
0.0 0.08 Listen Listen

At 3dB, the trellis based decoding removes most of the effects of bit errors, and it sounds similar to the no error reference. Compared to simple decoding, the bloops and washing machine noises have gone away. At 0dB Eb/No, the speech quality is improved, with some exceptions. Fast changes, like the “W” in double-you, and the “B” in Bruce become indistinct. This is because when the channel noise is high, the probability model favours slow changes in the parameters.

Still – getting any sort of speech at 8% bit error rates with no FEC is pretty cool.

Further Work

These techniques could also be applied to FreeDV 1600, improving the speech quality with no additional overhead. Further work is required to extend these ideas to all the codec parameters, such as pitch, energy, and voicing.

I need to train the transition probabilities with a larger database, or make it train in real time using off air data.

We could include other information in the model, like the relationship of adjacent LSPs, or how energy and pitch change slowly in strongly voiced speech.

Now 10% BER is an interesting, rarely explored area. The data guys start to sweat above 1E-6, and assume everyone else does. At 10% BER FEC codes don’t work well, you need a really long block size or a low FEC rate. Modems struggle due to syncronisation issues. However at 10% the Eb/No versus BER curves start to get flat, so a few dB either way doesn’t change the BER much. This suggests small changes in intelligibility (not much of a threshold effect). Like analog.

Summary

For speech, we don’t need to correct all errors; we just need to make it sound like they are corrected. By leaving some residual redundancy in the coded speech parameters we can use probability models to correct errors in the decoded speech with no FEC overhead.

This work is another example of experimental work we can do with an open source codec. It combines knowledge of the channel, the demodulator and the codec parameters to produce a remarkable result – improved performance with no FEC.

This work is in it’s early stages. But the gains all add up. A few more dB here and there.

References

  1. I found this description of Soft Decision Viterbi decoding useful.
  2. Last year I did some related work on natural versus gray coding.

10 thoughts on “Trellis Decoding for Codec 2”

  1. I’m still playing with QAM, and if I leave my perfect test program and use an actual radio, it of course fails miserably.

    Which brings up the need for AGC. So, I’ve been translating the liquid-dsp agc algorithm to Java and am now going to implement that.

    This is probably more important to QAM then PSK, but I wonder if a fairly dynamic agc could also help QPSK?

    I would think that with just four quadrants, the amplitude wouldn’t matter, but maybe on deep fades, and maybe an agc per carrier might help.

    Thinking out loud… 🙂

    1. Hi Steve,

      QPSK has the neat property that it is level independent – no need for AGC at all. Makes set up and good perf really easy. This makes a SDR rx for FreeDV easier than SSB. Also you avoid issues with static crashes messing with your rx audio levels.

      For QAM, the right way to handle varying signal levels is use pilot symbols. It has to be “feed forward”, no messy and slow feedback loops. The cohpsk modem would be a useful starting point for this. It already corrects phase based on the pilot symbols, and the amplitude information is there if needed.

      – David

      1. Ah, well I think you saved me a step in my experiments. I do need a pilot(s) anyway to equalize. My original idea was a preamble, but that sounds so old school these days. Thanks, and interesting reading about the trellis decoding experiments!

  2. One possibility for additional states in the trellis, without the exponential blowup of merging two frames to consider jointly, is to add hyper-states:

    E.g. for N hyper states you multiply the set of states by N, one set for each hyperstate, then your take your training data and randomly assign every frame a hyperstate, then you learn the probabilities from the training data, and using the learned probabilities you find the most likely hyperstates for each frame, and you iterate until convergence. The training should be relatively fast because you’re holding the actual code model parameters constant, and only performing dynamic programming over the hyperstates.

    You may get a better model starting from an initial classification (e.g. run a speech recognizer and attach phonemes to the frames and hash the result into your hyperstates), but you don’t have to do that just to increase the size of the model, the random classification should get you something quite useful.

    Other ways to add states would be to add states for modem/channel parameters, e.g. tuning error, ISI under various equalizer conditions. But training for this can be harder, as you have to simulate the channel.

    I think this could be pretty powerful for codec2– for robustness reasons the codec itself is limited in its ability to exploit interframe redundancy, but a fancy decoder suffers no such limitation. For low bitrates some pretty impressive stuff can be done KA9Q’s ISEE decoder (http://www.ka9q.net/isee.html) considers 2^24 states per step and still manages to decode at 225 bits per second on a laptop. As I’ve opined before, considering what people spend on radios and antennas computing power is cheap. A 24 core machine could apply a _lot_ of computing power to fancier decoding… but still leave a single that basic devices can pick up and work with too.

    1. Hi Greg,

      I’m not quite following you in yr 2nd para above – could you pls give me a simple example?

      Thanks,

      David

      1. Communication is hard, and I’m not sure where I lost you. So I’ll walk though this in a way that might be fairly redundant to your understanding.

        Lets imagine you are optimizing an element in the bitstream with three possible states. Your trellis looks like, one column per frame:

        0 0 0 0 …
        1 1 1 1 …
        2 2 2 2 …
        Imagine a full mesh of edges between each timestep. The edges are assigned probabilities based on training audio. The nodes get costs (also probabilities) based on the modem soft decode. We then ask the viterbi algorithm to find the path (selection of value for each frame) with the highest total probability.

        What I suggest with the hyper-states changes it to this, if we have just two hyper-states:

        0a 0a 0a 0a …
        1a 1a 1a 1a …
        2a 2a 2a 2a …
        0b 0b 0b 0b …
        1b 1b 1b 1b …
        2b 2b 2b 2b …

        in this case each frame has it’s 0/1/2 decision like before, but there is an additional bit is this an “A frame” or a “B frame”. What does A/B mean? It’s not a bit in the codec bitstream, you can think of it as a blind classification. The only effect this classification has is to change the transition probabilities. E.g. in mode A perhaps the signal is much more stationary (0 is more likely to stay 0) while in mode B it is less, and there is some low probability of switching from A to B but high from B to a or what have you.

        Where do these additional modes come from? they can be learned by training: (Failing a better way of usefully classifying frames) you could randomly assign frames to A/B, then learn the 6 way table given the random classification. Then use viterbi to reclassify the training data (e.g. replace the random assignments with the ‘most likely’ ones given the first pass table), and iterate.

        For some of the early daala video codec work our trained intrapredictors worked somewhat similarly to this and I found they did surprisingly well with random initialization (the trained intra-predictors had other issues, e.g. performance costs, poor filter conditioning), e.g. it successfully invented filters that diagonally extended neighboring lbocks (all in the DCT domain). http://people.xiph.org/~xiphmont/demo/daala/demo2.shtml

        For this application with enough additional state, I wouldn’t be surprised if it managed to e.g. classify frames that into things like [ss] sound vs [hh] sound and learned the probability that a [ss] frame is followed by a [hh] frame.

        I’m not sure if a bigger trained model would work better than e.g. a scheme for updating the weights in realtime… though a bigger trained model would have the advantage of avoiding ending up in a degenerate state that gave bad decodes.

        Did this help make my comments more clear?

        1. Thanks Greg, yes I understand it much better now. Sometimes these ideas take a few questions to tease out and the dialog can be really useful to others when done publicly like this.

          For sure – I can see several possibilities for hyperstates, V/UV, and silence and non-silence, good channel/bad channel.

          Also ways to extend the number of states using other bits, e.g. Jean-Marc discovered the delta-pitch and delta-energy are highly correlated.

          On a related note I’m currently trying a VQ of the LSPs to help with FreeDV 700 speech quality. The, I will try trellis decoding of the LSP VQ index bits, or perhaps the LSP trajectories themselves. The latter would require mapping of the LSP trajectory PDF to a finite number of states (I think).

          Normally when one uses more powerful compression (like VQ), the sensitivity to bit errors goes up and FEC is reqd.

          This idea (if it works!) will let me use the compression/quality power of the VQ, but maintain “error correcting” using trellis techniques using the correlation that must exist in the decoded LSPs.

          Not sure if it will work, will take a few weeks to set up a simulation.

          Also I imagine it’s possible to train the transition probabilities “over the air”, e.g. in high SNR segments.

          Cheers,

          David

          1. Another thing on this that occurred to me is that online training might be a way to use a large block very high rate FEC, … you might not want a 10 second decode latency but if your code is not that long it will not be long enough to help with some fading patterns. The training process in the background could use a higher latency decode. Though it’s somewhat dubious to me if it would be worth power budget on bits that aren’t very useful during the primary decode. On the other hand, if a signal was eaten by a fade, the best you can do with a low latency decoder is ask the other side to retransmit. But, if your receiver indicated that a higher latency decode was more successful, you could hit an instant replay button– this would be less disruptive than asking the other side to retransmit. (The replay would also help with making sure the user’s wetware decoder was fully primed and making the best of the signal— and the decoder can catch back up to realtime after a replay by playing faster and/or skipping over silence.)

            Even if not, as you say, instead of explicit FEC it could use residual redundancy in the signal. The tricky part is that unlike a well designed FEC making use of the redundancy may be exponentially computationally expensive. (E.g. it’s the special sparseness and absence of cycles that make a LDPC code largely decodable by belief propagation).

          2. It would be interesting to study (at a rigorous, information theory level) the difference between (i) ideal source codec + FEC and (ii) non ideal source codec + non-perfect decoding based on residual correlation. Would make a good PhD topic.

            The concept of a non-zero bit error rate being useful after decoding is not common in communications theory. We normally toss packets away that fail a checksum. Latency is not usually considered, or “softer” received error metrics like “its OK to be close as we can’t really hear the difference”, i.e. minimise the MSE in the encoded parameter.

            ATM it’s just assumed the “it’s digital, we must correct every bit error” approach (i) is superior.

            Re Fades – the big problem is that they are longer than acceptable latency for PTT operation. However a long delay is better than no-speaky-at-all, so sure, we can go for really long codes if we want (say 30s).

            Or we can use really wide bandwidth (100kHz instead of 2). I’m using frequency diversity on the FreeDV 700 mode and it’s worth > 4dB.

            Another approach is to lower the bit rate (crappier speech quality) and hence raise Eb/No. The channel is not opaque during a fade, just attenuated.

            Yet another is non-real time, “buffer and dump”. Talk for 10s, transmit at half real time for 20s.

            Or we could break away from COTs simplex HF radios that have 200ms tx-rx transition times. What if our radio could support ACK packets every 10ms, so the cost of ARQ was much lower and it worked for real time speech?

            – David

  3. (Bad habits from replying in places where I can edit:) The use of the additional hyperstates can also help limit the blowup from doing your decode across multiple parameters. E.g. with a plain trellis you’d consider N_lsp_values * N_pitch * N_voiceflag * … where if their interaction is bottlenecked through the hyperstates you’d consider N_lsp_values * N_hyperstates + N_pitch * N_hyperstates + N_voiceflag * N_hyperstates. Which could be considerably fewer states even if there were many hyperstates.

    I suppose we’ll know that this direction of development is tapped out when the result of the far end having a power failure is that the receiver finishes their message without them sending it. 🙂

Comments are closed.