As an exercise, I have adapted LPCNet to use Codec 2 features, and have managed to synthesise high quality speech at a sample rate of 8kHz. Here are the output speech samples:
|Sample||original||LPCNet Codec 2|
I’m happy with all of the samples except cq_ref. That sample has a lot of low freq energy (like the pitch fundamental) which may not have been well represented in the training database. mmt1 has some artefacts, but this system already does better than any other low rate codec on this sample.
This is not quite a quantised speech codec, as I used unquantised Codec 2 parameters (10 Line Spectral Pairs, pitch, energy, and a binary voicing flag). However it does show how LPCNet (and indeed NN synthesis in general) can be trained to use different sets of input features, and the system I have built is close to an open source version of the Codec 2/NN system presented by Kleijn et al.
Why 8kHz rather than the higher quality 16 kHz? Well LPCNet requires a set of Linear Prediction Coefficients (LPCs). The LPCs dumped by Codec 2 are sampled at 8kHz. It’s possible, but not straight forward, to resample the LPC spectra at 16 kHz, but I chose to avoid that step for now.
My initial attempts led to good quality speech using samples from within the training database, but poor quality on speech samples (like the venerable hts1a) from outside the training database. In Machine Learning land, this suggests “not enough training data”. So I dug up an old TIMIT speech sample database, and did a bunch of filtering on my speech samples to simulate what I have seen from microphones in my Codec 2 adventures. It’s all described in gory detail here (Training Tips section). Then, much to my surprise, it worked! Clean, good quality speech from all sorts of samples.
- Add code to generate 16 kHz LPCs from 8 kHz LPCs and try for 16 kHz synthesised speech
- Use quantised Codec 2 parameters from say Codec 2 2400 or 1300 and see how it sounds.
- Help Jean-Marc convert LPCNet to C and get it running in real time on commodity hardware.
- Make a real world, over the air contact using NN based speech synthesis and FreeDV.
- A computationally large part of the LPCNet (and indeed any *Net speech synthesis system) is dedicated to handling periodic pitch information. The harmonic sinusoidal model used in Codec 2 can remove this information and hence much of the CPU load. So a dramatic further reduction in the number of weights (and hence CPU load) is possible, although this may result in some quality reduction. Another way of looking at this (as highlighted by Jean-Marc’s LPCNet paper) is “how do we model the excitation” in source/filter type speech systems.
- The Kleijn et al paper had the remarkable result that we can synthesise high quality speech from low bit rate Codec 2 features. What is quality trade off between the bit rate of the features and the speech quality? How coarsely can we quantise the speech features and still get high quality speech? How much of the quality is due to the NN, and how much the speech features?