I’ve been working with Neural Net (NN) speech synthesis using LPCNet.
My interest is digital voice over HF radio. To get a NN codec “on the air” I need a fully quantised version at 2000 bit/s or below. The possibility of 8kHz audio over HF radio is intriguing, so I decided to experiment with quantising the LPCNet features. These consist of 18 spectral energy samples, pitch, and the pitch gain which is effectively a measure of voicing.
So I have built a Vector Quantiser (VQ) for the DCT-ed 18 log-magnitude samples. LPCNet updates these every 10ms, which is a bit too fast for my target bit rate. So I decimate to say 30ms, then use linear interpolation to reconstruct the 10ms frames at the decoder. The spectrum changes slowly (most of the time), so I quantise the difference between frames to save a few bits.
I’ve developed a script that generates a bunch of samples, plots various statistics, and builds a HTML page to summarise the results. Here is the current page, including samples for the fully quantised prototype codec at three bit rates between around 2000 and 1400 bits/s. If anyone would like more explanation of that page, just ask.
Discussion of Results
I can hear “birch” losing some quality at the 20ms decimation step. When training my own NN, I have had quite a bit of trouble with very rough speech when synthesising “canadian”. I’m learning that roughness in NN synthesis means more training required, the network just hasn’t experienced this sort of speaker before. The “canadian” sample is quite low pitch so I may need some more training material with low pitch speakers.
My quantisation scheme works really well on some of the carefully spoken Harvard sentences (oak, glue), in ideal recording conditions. However with more realistic, quickly spoken speech with real world background noise (separately, wanted) it starts to sound vocoder-ish (albeit a pretty good vocoder).
One factor is the frame rate decimation from 10 to 20-30ms, which I used to get the bit rate beneath 2000 bit/s. A better quantisation scheme, or LPCNet running on 20ms frames could improve this. Or we could just run it at greater that 2000 bit/s (say for VHF/UHF two way radio).
Comparison to Wavenet
|Wavenet, Codec 2 encoder, 2400 bits/s||Listen|
|LPCnet unquantised, 10ms frame rate||Listen|
|Quantised to 1733 bits/s (44bit/30ms)||Listen|
The “separately” sample from the Wavenet team sounds better to me. Ironically, the these samples use my Codec 2 encoder, running at just 8kHz! It’s difficult to draw broad conclusions from this, as we don’t have access to a Wavenet system to try many different samples. All codecs tend to break down under certain conditions and samples.
However it does suggest (i) we can eventually get higher quality from NN synthesis and (ii) it is possible to encode high quality wideband speech with features covering a narrow spectral range (e.g. 200-3800Hz for the Codec 2 encoder). The 18 element vectors (covering DC to 8000Hz) I’m currently using ultimately set the bit rate of my current system. After a few VQ stages the elements are independent Gaussians and reduction in quantiser noise is very slow as bits are added.
The LPCNet engine has several awesome features: it’s open source, runs in real time on regular CPUs, and is available for us to test on wide variety of samples. The speech quality I am achieving with even my first attempts is rather good compared to any other speech codecs I have played with at these bit rates – in either the open or closed source worlds.
Tips and Observations
I’ve started training my own models, and discovered that if you get rough speech – you probably need more data. For example when I tried training on 1E6 vectors, I had a few samples sounding rough when I tested the network. However with 5E6 vectors, it works just fine.
The LPCNet dump_data –train mode program helps you by being very clever. It “fuzzes” the speech frequency, gain, and adds a little noise. If the NN hasn’t experienced a particular combination of features before, it tends to get lost – and you get rough sounding speech.
I found that 10 Epochs of 5E6 vectors gives me good speech quality on my test samples. That takes about a day with my somewhat underpowered GPU. In fact, most of the training seems to happen on the first few Epochs:
Here is a plot of the training and validation loss for my training database:
This plot shows how much the loss changes on each Epoch, not very much, but not zero. I’m unsure if these small gains lead to meaningful improvements over many Epochs:
I looked into the LPCNet pitch and voicing estimation. Like all estimators (including those in Codec 2), they tend to make occasional mistakes. That’s what happen when you try to fit neat signal processing models to real-world biological signals. Anyway, the amazing thing is that LPCNet doesn’t care very much. I have some samples where pitch is all over the place but the speech still sounds OK.
This is really surprising to me. I’ve put a lot of time into the Codec 2 pitch estimators. Pitch errors are very obvious in traditional, model based low bit rate speech codecs. This suggest that with NNs we can get away with less pitch information – which means less bits and better compression. Same with voicing. This leads to intriguing possibilities for very low bit (few 100 bit/s) speech coding.
Conclusions, Further Work and FreeDV 2020
Overall I’m pleased with my first attempt at quantisation. I’ve learnt a lot about VQ and NN synthesis and carefully documented (and even scripted) my work. The learning and experimental experience has been very satisfying.
Next I’d like to get one of these candidates on the air, see how it sounds over real world digital radio channels, and find out what happens when we get bit errors. I’m a bit nervous about predictive quantisation on radio channels, as it causes errors to propagate in time. However I have a good HF modem and FEC, and some spare bits to add some non-predictive quantisation if needed.
My design for a new, experimental “FreeDV 2020” mode employing LPCNet uses just 1600 Hz of RF bandwidth for 8kHz bandwidth speech, and should run at 10dB SNR on a moderate fading channel.
Here is a longer example of LPCNet at 1733 bit/s compared to HF SSB at a SNR of 10dB (we can send error free LPCNet through a similar HF channel). The speech sample is from the MP3 source of the Australian weekly WIA broadcast:
|SSB simulation at 10dB SNR||Listen|
|LPCNet Quantised to 1733 bits/s (44bit/30ms)||Listen|
|Mixed LPCNet Quantised and SSB (thanks Peter VK2TPM!)||Listen|
This is really new technology, and there is a lot to explore. The work presented here represents my initial attempt at quantisation with the LPCNet synthesis engine, and is hopefully useful for other people who would like to experiment in the area.
Thanks Jean-Marc for developing the LPCnet technology, making the code open source, and answering my many questions.
LPCnet introductory page.
The source code for my quantisation work (and notes on how to use it) is available as a branch on the GitHub LPCNet repo.