In the last post in this series I was getting close to a fully quantised 700 bit/s codec. However as I pushed through I discovered a bug in the post-filter. I was accidentally cheating and using some of the encoder information in the decoder. When I corrected the bug the quality dropped significantly. I’ve hit these sorts of bugs before – the simulation code is complex and it’s easy to “declare victory” prematurely.

So I have abandoned the AbyS approach for now. Oh well, that’s “research and disappointment” for you. Plenty of new ideas though….

For the last few months I have been working on another solution that vector quantises a “fixed rate” version of the spectrum. The masking functions are still used to smooth the spectrum before sampling at the fixed rate. Much like we low pass filter time domain samples before sampling, the masking functions reduce the “bandwidth” and hence sample “rate” we need to represent the spectrum. Here is a block diagram of the current “700C” candidate codec:

The bit allocation is pitch (Wo) 6 bits, 1 bit for voicing, 16 bits for the amplitude VQ, 4 bits for energy and 1 bit spare. All updated every 40ms. The new work is in the “Decimate in Frequency” block, expanded here:

As the pitch of the speech varies, the number of harmonics used to represent the speech, L, varies. The goal is take a vector of L amplitude samples, vector quantise, and send them over a channel. To vector quantise them we need fixed length vectors. So a Discrete Fourier Transform (DFT) is used to resample the L amplitude samples to fixed vectors of length 20 (I have chosen k=10).

BTW a DFT is the generic form of a Fast Fourier Transform (FFT). A FFT is a computationally efficient (fast) way of computing a DFT.

The steps are similar to sampling a time domain signal. The bandwidth of the signal is limited by using the masking function to smooth the variations in the amplitude envelope. The use of masking functions means the smoothing matches the response of the ear, and no perceptually important information is lost.

I’ve recently been playing with OFDM modems, so I used a “cyclic suffix” to further smooth the DFT coefficients. DFTs like cyclic signals. If you have a DFT of an 8kHz signal, the sample at 3900Hz is the “close” to the sample at 0 Hz. If there is a step jump in amplitude – you get a lot of high frequency information in the DFT coefficients which is harder to quantise. So I throw away the last 500Hz of the speech signal (3500-4000 Hz), and replace it with a curve that ensures a smooth match between 3500 Hz and 0 Hz.

Yeah, I don’t know how I dream this stuff up either …… do I use the Force? Too much red wine or espresso? Experience? A life mispent on computers? Subconscious innovation? Plagiarism?

In the past I’ve tried to resample and VQ the spectrum of sinusoidal codecs a few times, without much success. Jean Marc also suggested something similar a few posts back. Anyhoo, getting somewhere this time around.

Here are some plots that show the algorithm in action for a frame of female speech:

Here are the amplitude samples (red crosses). The blue line has the cyclic suffix, note how it meets the first amplitude sample near 0Hz.

This figure shows the difference in the DFT coefficients with (blue) and without (green) the cyclic suffix:

Here is the cumulative energy of DFT coefficients, note that with the cyclic suffix (blue) low frequency energy dominates:

This figure shows a typical 2k=20 length vector that we vector quantise. Note it has zero mean – we extract the DC coefficient and separately quantise this as the frame energy.

**Samples**

Sample | 1300 | 700C Candidate |
---|---|---|

hts1a | Listen | Listen |

hts2a | Listen | Listen |

forig | Listen | Listen |

ve9qrp_10s | Listen | Listen |

mmt1 | Listen | Listen |

vkqi | Listen | Listen |

cq_ref | Listen | Listen |

Through a couple of years of on-air operation we have established that the 1300 bit/s codec (as used in FreeDV 1600 with 300 bit/s of FEC) has acceptable speech quality for HF. So the goal of this work is similar quality at 700 bit/s.

For some samples above (e.g. hts1a and mmt1a), 1300 is superior to the current 700C candidate. For others (e.g. hts2a and vk5qi) 700 sounds a little better. So I think I’m in the ball park.

There’s a bit of clipping at the start of cq_ref, and some level variations between the two modes on some samples. The 700C candidate has a few problems with unvoiced sounds, e.g. the intake of breath on ve9qrp_10, and the “ch” sound at the start of chicken in hts2a. Not sure why.

The cq_ref_1300 sample is a bit poor as the LPC technique used for spectral amplitudes falls over when the spectral dynamic range is high. In this sample the LF energy has much higher energy than the HF, i.e. a strong “Low Pass Filter” effect or spectral slope.

Next step is some refactoring – the Octave code is an untidy mess of 6 months of dead ends and false starts. A mirror of real world R&D I guess. Creating something new is not a tidy process. At least in my head. So many aspects of this algorithm that I could explore but I’d rather get this on the air and see if we really have something here. Would love to have some help with a port from Octave to C. Contact me if you’d like to work in this area.