I’ve returned to speech coding after a year of playing with VHF and HF modems. This is somewhat daunting for me, as speech coding is R&D, which tends to be very open ended; it’s possible to work for months with no clear outcomes. In contrast the modem work is straight forward engineering, and I get the positive feedback of “having stuff work” on a regular basis.
So I’m trying to time box the speech coding projects to a few days work each. This is quite a personal challenge, as there are just so many variables and paths to follow. It’s so easy to go off on a tangent and watch the months pass!
In this particular project I’m looking at the Codec 2 700C Vector Quantiser (VQ), and exploring ways to make it more inherently robust to high pass and low pass filtering at the edges of the spectrum. The broad goal is to improve speech quality at a given bit rate, or support a lower bit rate at the same speech quality. I’m targeting bit rates in the 600 bit/s range, and the lower end of the quality range (communications quality speech).
Subset Vector Quantiser
It’s well known that the most important speech information is between 300 and 3000 Hz. Energy outside that range makes the speech sounds nicer, but doesn’t help intelligibility much. Analog modes such as SSB exploit this by band limiting the speech so that the transmitter power is used just to punch through information in the narrow bandwidth that matters.
With digital speech the RF bandwidth is not directly linked to the bandwidth of the decoded speech. For example FreeDV 700D uses around 1100 Hz of RF bandwidth, but the decoded speech has energy covering most of the 0 to 4000 Hz range.
Codec 2 encodes and transmits the speech spectrum on a regular basis. As well as encoding parts of the spectrum necessary for intelligibility, it also has to encode other features, such as the high pass and low pass roll off the microphone and analog filters in the sound card. These features don’t carry any intelligibility, but bits are consumed to encode them. When you have just 28 bits/frame – every bit matters!
Turns out that some of the wilder variations in the speech spectrum from different sources of speech are in the 0-300Hz and 3000-4000Hz regions, for example different high pass or low pass filtering for a particular microphone or sound card. This can upset the Codec 2 quantiser, e.g. it might expend bits modelling a particular low pass filter response – lowering the quality of the perceptually important 300-3000 Hz information.
So I’ve prototyped a Vector Quantiser (VQ) that just uses the information in the 300-3000 Hz range to quantise the speech spectrum. However the VQ is trained on the full range, so it uses that full range to synthesise the speech, hopefully recovering some of the extended spectrum. There is also a limiter before the VQ to reduce the dynamic range of the frames.
Here are some samples processed with the newamp1 two stage VQ, and the new subset VQ. Like the newamp1 algorithm, it works on vectors of K=20 mel-spaced magnitude samples. The speech codec is only partially quantised (10ms frames, original phase, unquantised pitch), so it would sound worse in a real world fully quantised codec. The “newamp1” algorithm is used for Codec 2 700C [1] (and employed in FreeDV 700C/D/E), and uses 22 bit/frame (550 bits/s at a 40ms frame rate). The subset VQ uses just 12 bits frame (300 bits/s at a 40ms frame rate).
Filename | newamp1 | subset |
---|---|---|
big dog | Listen | Listen |
cap | Listen | Listen |
fish | Listen | Listen |
hts2a | Listen | Listen |
There are some samples where newamp1 sounds louder – this could be due to the gain limiter stage in the subset algorithm constraining the dynamic range. There also seems to be more high frequency response with newamp1, indicating subset is not recovering the high frequency speech energy as I had hoped. Both are quite intelligible, and acceptable for communications quality speech.
This table presents the mean square spectral distortion in dB*dB:
Filename | newamp1 | subset |
---|---|---|
big dog | 6.05 | 8.56 |
cap | 9.03 | 8.10 |
fish | 10.81 | 7.31 |
hts2a | 10.51 | 8.62 |
Both the samples and objective results show the subset VQ is holding up OK next to the reference newamp1, despite the low bit rate. I’ve found that around 9 dB*dB spectral distortion gives acceptable results for my (low communications quality) use case.
I tried adding an artificial low pass filter above 3400 Hz to a couple of the input samples, to simulate what might happen from different microphones.
It seems to work OK with the low pass filtered input samples, they sound pretty similar to the sample using the original source (one of the goals of this work):
Filename | subset | subset low pass |
---|---|---|
cap | Listen | Listen |
fish | Listen | Listen |
Conclusions and Further work
I’m surprised that a single stage VQ is working quite well at just 12 bits/frame. It’s also encoding both the average frame energy and the spectral shape (I use a separate scalar quantiser for frame energy in newamp1). This is quite optimal from a VQ theory point of view, but sometimes not possible due to practical concerns such as the storage/CPU requirements for a single stage VQ.
Further work ideas:
- Try a few more samples, and push this work through to a fully quantised speech codec.
- Seeing it’s doing well with a single stage, it would interesting to see if it sounds better with multiple stages.
- A single stage VQ enables other tricks, like non-FEC techniques to make the VQ robust to bit errors, such as sorting the VQ indexes [2] or Neural Net style training with bit errors.
- Try different companding curves instead of the hard limiter, to better represent louder signals.
Links
[1] Codec 2 700C
[2] Codec 2 at 450 bits/s. Another single VQ Codec 2 mode, that can generate high frequency information using a similar approach to this post.
[3] Codec 2 700C Equaliser Part 2. A previous approach to handle a similar problem, in this case the speech is equalised before hitting a full band VQ.
[4] The script train_sub_quant.sh in This GitHub PR is used to perform the experiments documented in this post.