Since the last post I’ve trained LPCNet using quantised Codec 2 parameters. I’ve also modified the Codec 2 decoder (c2dec) to dump the received features to a disk file so I can use them to synthesise speech using LPCNet.
So now I can decode a Codec 2 bit stream using a LPCNet decoder:
|Original 8kHz sample||Original|
|Codec 2 1300 encoder and Codec 2 decoder||codec2_1300|
|Codec 2 1300 encoder and LPCNet decoder||lpcnet_1300|
|Codec 2 2400 encoder and Codec 2 decoder||codec2_2400|
|Codec 2 2400 encoder and LPCNet decoder||lpcnet_2400|
|Codec 2 unquantised 6th order LSP and LPCNet decoder||lpcnet_6lsps|
I think it sounds pretty good, especially at 2400 bits/s. Even the lpcnet_1300 decoder sounds better than Codec 2 at 2400. LPCNet is much more natural, without the underwater sound of low rate vocoded speech.
The Kleijn et al paper showed NN based synthesises can produce high quality speech from feature sets of legacy vocoders like Codec 2. This implies the legacy vocoders are sending a lot of extra information that is not normally used by the legacy decoder/synthesis algorithms. If high quality speech is not required, it could be argued we are sending “too much” information, and scope exists to reduce the bit rate by coarser quantisation or sending less features.
The lpcnet_6lsps sample uses just 6th order Linear Prediction, in the form of 6 Line Spectral Pairs (LSPs). A LPC order of 10 is common for 8 kHz sampled speech. There is no way 6th order LPC would work (i.e. provide intelligible speech) for any existing LPC based vocoder (CELP, MELP etc). This sample has some odd artefacts; but is intelligible, and also quite natural sounding compared to a vocoder.
Curiously, when the synthesis breaks down (e.g. “depth of the well”), it sounds to me like the pitch is halved. I can hear this on both the male and female segments.
I still have much to learn in this area, but the initial results are promising, especially at 2400 bit/s. The main difference between the Codec 2 feature set at 2400 and 1300 is the update rate, so it would be useful to explore those differences further with a series of tests.
This work is not far from being usable “over the air” as a FreeDV mode, and promises a significant jump in quality. Codec 2 1300 is the vocoder used for FreeDV 1600, so it may be possible to develop a drop in replacement for initial testing.
The “best” (in terms of quality at a given bit rate) encoder and feature set (especially under quantisation) is an open research question. It is amazing that it works with the Codec 2 bit stream at all – we should be able to do better with a custom encoder.