WaveNet and Codec 2

Yesterday my friend and fellow open source speech coder Jean-Marc Valin (of Speex and Opus fame) emailed me with some exciting news. W. Bastiaan Kleijn and friends have published a paper called “Wavenet based low rate speech coding“. Basically they take bit stream of Codec 2 running at 2400 bit/s, and replace the Codec 2 decoder with the WaveNet deep learning generative model.

What is amazing is the quality – it sounds as good an an 8000 bit/s wideband speech codec! They have generated wideband audio from the narrowband Codec model parameters. Here are the samples – compare “Parametrics WaveNet” to Codec 2!

This is a game changer for low bit rate speech coding.

I’m also happy that Codec 2 has been useful for academic research (Yay open source), and that the MOS scores in the paper show it’s close to MELP at 2400 bit/s. Last year we discovered Codec 2 is better than MELP at 600 bit/s. Not bad for an open source codec written (more or less) by one person.

Now I need to do some reading on Deep Learning!

Reading Further

Wavenet based low rate speech coding
Wavenet Speech Samples
AMBE+2 and MELPe 600 Compared to Codec 2

16 thoughts on “WaveNet and Codec 2”

    1. Hi Walter,
      That’s an interesting sample and good on you for experimenting with speech coding! What sort of algorithm are you using?
      – David

      1. The codec I have been devising is more of a waveform codec but taking advantage of speech specific properties in the spectrum. It will forward music but it won’t sound pretty, on the other hand it does not suffer to much with noisy backgrounds as you can see from this sample (still at 1375 bit/s).
        https://instaud.io/private/707bfa9aa9a59e86289002438eab54378d7fd15f

        here is a snippet if solder smoke ep 127 with some music in the beginning @1250bit/s
        https://instaud.io/24BY

        1. I am a newbie when it comes to deep learning networks (in fact this is my first try ever) but I wanted to see how well this could be made to work. So after some fiddling I made a cascaded network with 20 hidden layers and trained it on just this sample for about one hour (generally one would want to train it on many different speakers/samples and much longer). The neural network indeed manages to recover lost detail and readability from the encoded data @1375bit/s. Presumably even lower rates should be possible to give decent results. Drawbacks is that it requires some heavy duty training to get to work properly.

          *** Original
          https://instaud.io/private/350230ceffcca75fa0c32a617ecf7b70ec7fd0b8

          *** Encoded/decoded with my codec @1375 bit/s
          https://instaud.io/private/9cc7e518bf68c820bf6611e45a6e27a80d5c0353

          *** Encoded with 1375bit/s used Neural network to decode
          https://instaud.io/private/f79177020597549b04bd69fe9a94ce4cae732213

          1. Wow that’s fantastic Walter, well done! That’s amazing quality at 1400 bit/s. Could you please share some code/details on how your system works?

            – David

          2. The encoder works with a lapped transform which then sorts out and reduce spectral information down to about 1200-1400 bit/s.
            As for the neural network it was just a quick test to see if and or how well it could recover missing information about the spectral components. This was however a very crude test. I only trained and tested it on sequence of these two speakers so I would say its more a proof of concept and gave me some hints on what to try next and how to make it more general hopefully. So far it all runs in realtime in matlab while as I understands it the wavenet will require some serious hardware to even make a few seconds of sound (unless someone has come up with a better algorithm to run it?).

          3. Thanks Walter. I’m reading up on Deep Learning and will try some similar tests with the baseline Codec 2 sinusoidal codec. I might try working in the spectral sample domain, rather than PCM samples like Wavenet.

            My understanding is that a GPU is reqd for training, but a CPU for actual use. However now 100% sure if that’s how the Wavenet/Codec 2 system works

  1. I listened to the examples and it did sound pretty good.

    I was a little shocked at the Speex sample though. Very raspy. Something seems wrong there.

    MELP to me, sounded a bit worse than codec 2 in the examples given, yet scored higher on their chart. Hmm…

    1. Speex is a CELP type waveform codec that is not designed for such low rates, so at 2.4 it’s operating way out of it’s range.

    2. At 2.4 kb/s, Speex doesn’t have enough bits to do CELP, so it becomes a vocoder — and a pretty bad one. This was mostly meant as a way to code background noise or a fallback to be able to transmit *something*, rather than anything good. So it’s not surprising that codec2 sounds a lot better than Speex at 2.4 kb/s.

  2. Wow, this is amazing. Does this mean that the bit stream is decodable by either Codec2 or the Wavenet model?

      1. This is very exciting! As I understand it, the model is basically working at “filling-in” the high-frequency voice characteristics to restore the wide-band sound? (and doing a good job of it, not just “faking it,” as the speaker identifiable listening test seems to suggest.) They didn’t include any packet loss, so that exact model wouldn’t be ideal for radio -but, presumably, a similar algorithm could be trained to “hide” artifacts due to packet loss, potentially improving speech intelligibility over HF radio, in addition to improving audio quality.

        1. Yes that’s right – apparently there is enough information in the Codec 2 bit stream (which was derived from a 8kHz sample rate, 4kHz wide speech signal), to regenerate the 4-8kHz range with great quality. That’s one big take away from this work.

          I agree with you ideas about using deep learning to cope with packet loss.

    1. I was kinda thinking the same thing.

      However, I believe the goal for now is best quality at comparable bitrates.

      At the same time, I don’t see why not. If the network can extract and reconstruct the missing parts, it should boost quality in one way or the other.

Comments are closed.