LPCNet meets Codec 2

The previous post described my attempts to come up to speed with NN based speech synthesis, with the kind help of Jean-Marc Valin and his LPCNet system.

As an exercise, I have adapted LPCNet to use Codec 2 features, and have managed to synthesise high quality speech at a sample rate of 8kHz. Here are the output speech samples:

Sample original LPCNet Codec 2
cq_ref Listen Listen
hts1a Listen Listen
hts2a Listen Listen
mmt1 Listen Listen
morig Listen Listen
speech_orig Listen Listen

I’m happy with all of the samples except cq_ref. That sample has a lot of low freq energy (like the pitch fundamental) which may not have been well represented in the training database. mmt1 has some artefacts, but this system already does better than any other low rate codec on this sample.

This is not quite a quantised speech codec, as I used unquantised Codec 2 parameters (10 Line Spectral Pairs, pitch, energy, and a binary voicing flag). However it does show how LPCNet (and indeed NN synthesis in general) can be trained to use different sets of input features, and the system I have built is close to an open source version of the Codec 2/NN system presented by Kleijn et al.

Why 8kHz rather than the higher quality 16 kHz? Well LPCNet requires a set of Linear Prediction Coefficients (LPCs). The LPCs dumped by Codec 2 are sampled at 8kHz. It’s possible, but not straight forward, to resample the LPC spectra at 16 kHz, but I chose to avoid that step for now.

Training

My initial attempts led to good quality speech using samples from within the training database, but poor quality on speech samples (like the venerable hts1a) from outside the training database. In Machine Learning land, this suggests “not enough training data”. So I dug up an old TIMIT speech sample database, and did a bunch of filtering on my speech samples to simulate what I have seen from microphones in my Codec 2 adventures. It’s all described in gory detail here (Training Tips section). Then, much to my surprise, it worked! Clean, good quality speech from all sorts of samples.

Further Work

  • Add code to generate 16 kHz LPCs from 8 kHz LPCs and try for 16 kHz synthesised speech
  • Use quantised Codec 2 parameters from say Codec 2 2400 or 1300 and see how it sounds.
  • Help Jean-Marc convert LPCNet to C and get it running in real time on commodity hardware.
  • Make a real world, over the air contact using NN based speech synthesis and FreeDV.
  • A computationally large part of the LPCNet (and indeed any *Net speech synthesis system) is dedicated to handling periodic pitch information. The harmonic sinusoidal model used in Codec 2 can remove this information and hence much of the CPU load. So a dramatic further reduction in the number of weights (and hence CPU load) is possible, although this may result in some quality reduction. Another way of looking at this (as highlighted by Jean-Marc’s LPCNet paper) is “how do we model the excitation” in source/filter type speech systems.
  • The Kleijn et al paper had the remarkable result that we can synthesise high quality speech from low bit rate Codec 2 features. What is quality trade off between the bit rate of the features and the speech quality? How coarsely can we quantise the speech features and still get high quality speech? How much of the quality is due to the NN, and how much the speech features?

Reading Further

Jean Marc’s blog post on LPCNet, including links to LPCNet source code and his ICASSP 2019 paper.
WaveNet and Codec 2
Source Code for my Codec 2 version of LPCNet

5 thoughts on “LPCNet meets Codec 2”

  1. On further work, also– how should the codec2 features change given the NN reconstruction.

    For example, if you’re doing wideband reconstruction for unvoiced frames there should probably at least be 1 bit to distinguish Ffff and Ssss sounds which I think aren’t especially predictable except via a language model (and aren’t even really distinguishable in narrowband). But more generally I expect the codec signals needed to be somewhat different– fewer bits on things the NN can predict, more bits on things that NN can’t predict.

  2. I understand your question as “How can we make a better bit-stream than codec2 for neural synthesis?” (hopefully that what you meant to ask). It’s actually an open question and there’s many ways to approach it. One of them would be do end-to-end training with the entire codec and have the network learn a few binary features that you can even corrupt with simulated transmission noise. Approaches like that would likely give you near optimal performance, but would have the drawback that your format is now defined only by a set of weights that you can no longer change.

    Alternatively, you can just develop a set of wideband features similar to the existing codec2 features but just optimized for neural net output. You can still do the NN training with random noise so that it learns that bit errors and lost packets happen, but you have a little less flexibility than the first option. On advantage of this is that it’s a bit easier to make adjustments to your codec (though you need to make sure your decoder will handle them correctly) and you have the option of using a less complex “traditional” decoder.

  3. Nice to see you made some progress using NN.
    Hard to say how much the sound would degrade if you reduce the bit rate but I would guess that 1300 rate should still work pretty well.

  4. Hi David,
    I wonder if you could explain to a non specialist how this method of transmitting information fits within the Shannon channel capacity and any mathematical relationship with the Shannon-Hartley theorem. It seems at a first glance to me that in order to reconstruct faithfully a voice to such a level of quality one would need to store in the neural network all possible voices and utterances (I believe it might be called training?). But I’m not an expert in this field and may be missing important information.

    1. Hi Adrian,

      Shannon-Hartley deals with how many bit/s you can get through a channel with noise. At this stage, we haven’t looked into channels with noise. This work is more information theory – given a certain signal, how much can we compress it to a low number of bits/s, then synthesise and still make it sound OK. The Kleijn paper referenced above does talk about that.

      Yes the training learns common traits of human speech signals and uses them as a model for synthesising. Turns out we humans have a lot in common.

Comments are closed.