Over the past few weeks I have made some solid progress on this codec. A “zero phase” model for synthesising the phases has been developed that requires zero bits to transmit the phases and just one voicing bit. The NLP pitch estimator C code is now running, and a pitch tracker has been developed. A post filter has been developed that helps with background noise. A bunch of little bugs have also been tracked down. The quality of the speech codec improves day by day.
There is now a Codec 2 Web page which includes some notes on the algorithms and instructions on how to run the codec. Useful if you would like to run some of your own speech samples through codec2.
I apologise for the length of this post, over the last few weeks I worked on a lot of stuff that I wanted to document. One goal of this project is to leave a trail of information behind for others to follow. Breadcrumbs on the Internet.
First Order Phase Model
Through a process of trial and error I have gradually developed two phase models. I spent the first week scribbling on paper and thinking and discarding a bunch of schemes. Then I reached back into the thesis time machine and borrowed some code from the 1995 David Rowe.
Chapter 6 of  presents a “first order” phase model. This models the harmonic phases as an excitation impulse driving an all pole LPC synthesis filter. One twist is that the excitation impulse gain is a complex number. The net result is that we try to fit a straight line to the excitation phase spectrum. The parameters of the straight line (slope and y-intercept) are then sent to the decoder.
Phases are tricky to work with as they wrap around every 360 degrees (2pi radians). Messes with your head. Here is the frame 44 from the file hts1a.raw, first the time domain speech, then the magnitude spectrum, then the phase spectrum:
Now on the phase spectrum, points near +/- 180 degrees are actually very close together. It’s just that everything wraps around at +/- 180 degrees so it gets confusing. For example +150 degrees and -150 degrees are actually 60 degrees apart. Alternatively you can choose to wrap around at 0 and 360 degrees. But it’s still confusing. The steady linear slope of the phase spectrum indicates a constant time shift of the time domain signal relative to the centre of the analysis frame.
OK, so we attempt to fit a first order phase model to the phase samples. We then measure how good this fit is using a Signal to Noise Ratio (SNR) measure. If the SNR is beneath a certain threshold we declare the frame unvoiced, otherwise we treat it as voiced. If the frame is unvoiced, we randomise all of the phases at the decoder.
On the first pass I used the first order model parameters (slope and y-intercept) at the decoder. This works pretty well. The output speech isn’t identical to the original phases but it sounds OK, especially through a speaker. Now a linear shift in phase across frequency corresponds to a time shift in the time domain. So the linear phase term actually specifies the position of each pitch pulse. This makes the first order model good at representing aperiodic speech, for example voiced speech with a very long pitch period or no regular pitch structure (say ‘Ahhhh” and gradually lower your pitch until it gets creaky). The first order model requires about 5 bits for the constant phase term, 7 for the slope, and 1 bit for the voicing for a total of 13 bits/frame.
Zero Phase Model
The zero phase model uses the same procedure at the encoder. We attempt to fit a straight line to the excitation phases and measure the SNR. However the model parameters are then discarded, all we keep is the voiced/unvoiced decision.
At the decoder we keep a track of the phase of the first harmonic. The excitation phase of the other harmonics is derived from this harmonic. We then filter each excitation harmonic using the LPC synthesis filter to get the final phase. The filtering is done in the frequency domain using multiplications rather than time domain convolution. More details on the zero phase model in the phase.c source code.
The zero phase model works surprisingly well, it sounds very similar to the first order model and requires just 1 voicing bit compared to 13 (wave file samples below).
I am sure I read about this zero phase model somewhere 20 years ago for my post grad work. Must dig that paper up some time. Not sure if that paper included the idea of using the LPC synthesis filter for the excitation phase. This accounts for a lot of the speech quality – without it males sound very “clicky”.
This simple, 1-bit voicing decision shouldn’t work as well as it does. Conventional wisdom is that some sort of mixed voicing model is required for high quality speech, for example declaring the first part of the spectrum voiced, then next part unvoiced. The AMBE algorithm works that way, and I think MELP also has some sort of mixed voicing model.
Just in case I was accidentally fudging it I tried:
/* just to make sure we are not cheating - kill all phases */
for(i=0; i<MAX_AMP; i++)
model.phi[i] = 0;
phase_synth_zero_order(snr, H, &prev_Wo, &ex_phase);
to make sure I wasn’t using original phases.
Compared to original phases the zero phase model has a few remaining artifacts – some males still sound a little “clicky”, and there are occasional tonal sounds in unvoiced areas of speech. The latter are probably due to the voicing estimator getting it wrong – i.e. declaring unvoiced speech as voiced. However given that a bunch of information (all the phases) has been thrown away I am pretty happy with the overall quality.
In the plots below you can see that the output speech signal is a little more impulsive – more of the energy in concentrated in the central peak at the start of each pitch period. The output signal also peaks at 20,000, input 15,000. I should point out that the output signal is a little behind the the input signal (small coding delay), which is why the left hand side of the output looks a little different to the input (in this frame we just passed a transition region).
For one sample (mmt1 – click to listen to the original (uncoded) sample), the zero phase model didn’t work very well. This sample has a high level of background noise. It sounded “clicky” and the background noise also had annoying periodic artifacts. Curiously, when I removed the contribution of the LPC model phase mmt1 sounded the same.
I tried a bunch of ideas to improve the sound quality over several days. Eventually I discovered that the clicky artifact was due to noise energy being synthesised as voiced harmonics. In speech corrupted by background noise, the high level parts of the speech spectrum contain speech energy, however in the low level areas (inter-formant regions) the background noise dominates. As we only have a single voicing decision for the whole spectrum, we get into trouble when we synthesise this inter-formant energy as voiced.
Frame 151 of mmt1 is a good example. You can see harmonic structure up about 1 kHz (regularly spaced harmonics), but after that it’s random background noise, except for maybe above 3 kHz where the speech signal pokes through the noise floor again:
The obvious solution is to adopt a mixed voicing model. However that means more bits and more parameters estimators. Parameter estimators always make occasional mistakes and can be painful to develop (I speak from experience). So I thought I’d try a post filter approach instead. The post filter works out an estimate of the background noise level, any harmonics that are beneath that level are set to unvoiced (i.e. the phases are scrambled).
Here are some plots of the post filter variables over the entire mmt1 sample. In some cases quite a high number of harmonics are declared unvoiced.
Current Frame and Background Energy Estimates
Percent Harmonics set to Unvoiced by Post Filter
The post filter worked pretty well on the mmt1 sample, making the speech sound closer to the samples with original phases (but still a little “clicky”). It also improved the quality of the background noise, making it sound less impulsive. The post filter is still experimental, for example I need to make sure it doesn’t mess up clean speech samples. Some more work should improve it.
Compared to mixed voicing methods the post filter approach has the big advantage of requiring zero bits – it works entirely on the information available at the decoder. I recall one version of MBE used around 12 bits/frame for voicing (600 bit/s at a 20ms frame rate).
Thoughts on Phase Models
Conceptually, the zero phase model is very close to the classic LPC-10 vocoder, that also fires impulses at the pitch period into a LPC synthesis filter, and has a single bit voiced/unvoiced decision. However the zero phase model produces higher quality speech. The main difference is that I am using a frequency domain synthesis approach. However time and frequency domain approaches should be interchangeable. Something to explore later, it might lead to an efficient time domain synthesis scheme.
For voiced speech the key role of any phase model is dispersion of energy around the onset of each pitch pulse. We don’t want all of the sine waves to come into phase at the same time or the speech sounds too “clicky”. Our ear is very sensitive to short time domain impulses like clicks. For voiced speech we can’t choose random phases as we perceive this as noise. If more than one pitch pulse forms per pitch period we perceive reverberation. More discussion on phases and speech perception in the Part 3 post in this series.
I am not sure why using the LPC model phase works so well although I have a theory. The LPC filter has peaks aligned with the peaks of the speech spectrum. The higher level LPC peaks are quite sharp. Like any filter, sharp peaks means a rapid phase shift over the region of the peak. This means that adjacent high level harmonics have quite different phases applied to them, dispersing the onset of the pitch pulse. So the LPC model naturally applies more dispersion to high level harmonics – just what we want to stop a click forming. Compare the hts1a frame 44 magnitude spectrum and the LPC phase spectrum for the same frame below:
LPC Phase Spectrum
Notice the big phase shift around 500Hz? This matches the first formant peak in the magnitude spectrum.
Note that it is possible to use another voicing estimator for the zero phase model. This might actually be a good thing as fitting the first order model is fairly high in complexity.
Over the past week I hunted down some residual problems with the zero phase model, such as noisy speech and problems with background noise. This sort of development is never a straight line. Lots of dead ends and dud tests and a few days where everything just goes right. For example I spent two days working on a fancy new synthesis routine to “fix” some roughness I could hear in the speech. After two days I discovered the problem was a bug I had introduced earlier in the week! Also when working on the pitch estimator I discovered I had the alignment of the LPC analysis window wrong – upsetting the phase models as well as the LPC/LSP work. Sometimes working on resolving one bug helps you fix another. And so we inch forward.
One technique that is really working well is the “dump file” idea I talked about in Part 2. I dump all the codecs internal states into a bunch of text files, then plot them using Octave. I have a little user interface that lets me step backwards and forwards one frame at a time.
Here is a screen shot of my desktop when working on the NLP pitch estimator (Click to enlarge):
Another technique I am finding very powerful is testing each DSP algorithm carefully using artificial signals. I guess this is really only unit testing but some times you need to think pretty hard about exactly how to test a DSP algorithm. For example to test the first order phase model I generated an artificial signal that was a train of impulses. This had a known phase spectrum so I could test that the first order model was getting an exact fit (it wasn’t at first, needed a finer sampling grid). To test the synthesis code I first drove it with fa single sinusoid, then a bunch of sinusoids to make an impulse train. In both tests I discovered subtle bugs.
Non-Linear Pitch estimator
I spent some time getting the C code version of the pitch estimator working. I have changed the post processor and pitch tracker from that described in Chapter 4 of , see the nlp.c source code. It works OK over the 5 samples I have been testing but will no doubt make occasional errors on a wider database. Pitch estimators always do.
One thing I like about this project is that I am pushing some (I hope) useful open source DSP code out onto the Internet. For example I don’t know of any other open source pitch estimators (but please tell me if you know of any). A working pitch estimator would have saved me heaps of time in my speech coding research. There is a tnlp.c unit test that runs the pitch estimator independently of the rest of the codec. For this codec only a coarse pitch estimate is required, but pitch refinement is much easier than initial estimation if a finer resolution is required. Large (gross) errors are usually the big problem when estimating the pitch of human speech.
LPC Magnitude Modelling and Males
For some low F0 (e.g. Male) speakers LPC modelling of the spectral magnitudes introduces some low frequency artifacts. This is because the LPC model spectrum can’t change quickly near 0Hz (I remember reading somewhere the slope is always 0 for the LPC spectrum at 0Hz). Here is an example from frame 114 of mmt1 (click for a larger image):
LPC Modeling Magnitude Spectrum
The purple line on this rather busy plot is the LPC magnitude model. It does a pretty good job except for near 0Hz. The cyan line at the bottom is the LPC modeling error. In this case the LPC modeling error for the first harmonic is 30dB!
This problem is particularly bad for samples like mmt1 that have some strong high pass filtering applied to the original sample. This means that the fundamental (first) harmonic has been zapped, but the LPC modeling inserts it again at some arbitrary level.
The fix was to add a single bit to the magnitude information. We measure the error in the first harmonic magnitude A at the encoder. If the LPC modeling error is larger than 6dB, a single bit is transmitted to the decoder. This bit instructs the decoder to attenuate A by 30dB after recovering A from the LPC model. This works well in practice – males with F0 and without high pass filtering sound pretty good.
Here are some speech samples that show the current state of the codec algorithms. To hear some of the differences I mention above needs a good set of headphones and patience. The differences are often quite small. However experience has shown that a lot of small differences can add up through the various processing stages. So it’s a good idea to work out the cause of any small coding artifacts and fix them as early as you can.
|Male||Female||Male + truck||Male||Female|
|Zero phase and post filter||hts1a||hts2a||mmt1||morig||forig|
|p=10 LPC modeled magnitudes||hts1a||hts2a||mmt1||morig||forig|
|Speex 1.11 8 kbit/s||hts1a||hts2a||mmt1||morig||forig|
Jean-Marc, sorry about using an earlier version of Speex. It’s what I had easily available via an apt-get package. I used the default quality level for 8 kbit/s. I will upgrade these samples later (actually hts1a and hts2a are the latest samples from the Speex site).
Note: I can hear some tonal artifacts in hts2a, although the other female (forig) sounds much better. I have suspected for some time that hts2a might have some aliasing, an artifact of that day long ago in 1990 when I sampled it from a DAT player using a DSP32C development system ISA card.
I am fairly happy with the various algorithms, including pitch estimation, the zero phase model, and LPC magnitude modeling. Performance on clean speech, speech with high pass filtering, and speech with background noise is acceptable. Next step is to look again at the LSP quantisation of the LPC model, and work on ways to reduce the current 10ms frame rate to 20ms. We will then have a first pass 2400 bit/s codec ready for alpha testing.
 Techniques for Harmonic Sinusoidal Coding
 Codec2 Web Page
 Open Source Low Rate Codec Part 1 – Project Kick off
 Open Source Low Rate Codec Part 2 – LPC Amplitude Modelling
 Open Source Low Rate Codec Part 3 – Phase and Male Speech