Open Source Low Rate Codec Part 4 – Zero Phase Model

Over the past few weeks I have made some solid progress on this codec. A “zero phase” model for synthesising the phases has been developed that requires zero bits to transmit the phases and just one voicing bit. The NLP pitch estimator C code is now running, and a pitch tracker has been developed. A post filter has been developed that helps with background noise. A bunch of little bugs have also been tracked down. The quality of the speech codec improves day by day.

There is now a Codec 2 Web page which includes some notes on the algorithms and instructions on how to run the codec. Useful if you would like to run some of your own speech samples through codec2.

I apologise for the length of this post, over the last few weeks I worked on a lot of stuff that I wanted to document. One goal of this project is to leave a trail of information behind for others to follow. Breadcrumbs on the Internet.

First Order Phase Model

Through a process of trial and error I have gradually developed two phase models. I spent the first week scribbling on paper and thinking and discarding a bunch of schemes. Then I reached back into the thesis time machine and borrowed some code from the 1995 David Rowe.

Chapter 6 of [1] presents a “first order” phase model. This models the harmonic phases as an excitation impulse driving an all pole LPC synthesis filter. One twist is that the excitation impulse gain is a complex number. The net result is that we try to fit a straight line to the excitation phase spectrum. The parameters of the straight line (slope and y-intercept) are then sent to the decoder.

Phases are tricky to work with as they wrap around every 360 degrees (2pi radians). Messes with your head. Here is the frame 44 from the file hts1a.raw, first the time domain speech, then the magnitude spectrum, then the phase spectrum:

Time domain

Magnitude Spectrum

Phase Spectrum

Now on the phase spectrum, points near +/- 180 degrees are actually very close together. It’s just that everything wraps around at +/- 180 degrees so it gets confusing. For example +150 degrees and -150 degrees are actually 60 degrees apart. Alternatively you can choose to wrap around at 0 and 360 degrees. But it’s still confusing. The steady linear slope of the phase spectrum indicates a constant time shift of the time domain signal relative to the centre of the analysis frame.

OK, so we attempt to fit a first order phase model to the phase samples. We then measure how good this fit is using a Signal to Noise Ratio (SNR) measure. If the SNR is beneath a certain threshold we declare the frame unvoiced, otherwise we treat it as voiced. If the frame is unvoiced, we randomise all of the phases at the decoder.

On the first pass I used the first order model parameters (slope and y-intercept) at the decoder. This works pretty well. The output speech isn’t identical to the original phases but it sounds OK, especially through a speaker. Now a linear shift in phase across frequency corresponds to a time shift in the time domain. So the linear phase term actually specifies the position of each pitch pulse. This makes the first order model good at representing aperiodic speech, for example voiced speech with a very long pitch period or no regular pitch structure (say ‘Ahhhh” and gradually lower your pitch until it gets creaky). The first order model requires about 5 bits for the constant phase term, 7 for the slope, and 1 bit for the voicing for a total of 13 bits/frame.

Zero Phase Model

The zero phase model uses the same procedure at the encoder. We attempt to fit a straight line to the excitation phases and measure the SNR. However the model parameters are then discarded, all we keep is the voiced/unvoiced decision.

At the decoder we keep a track of the phase of the first harmonic. The excitation phase of the other harmonics is derived from this harmonic. We then filter each excitation harmonic using the LPC synthesis filter to get the final phase. The filtering is done in the frequency domain using multiplications rather than time domain convolution. More details on the zero phase model in the phase.c source code.

The zero phase model works surprisingly well, it sounds very similar to the first order model and requires just 1 voicing bit compared to 13 (wave file samples below).

I am sure I read about this zero phase model somewhere 20 years ago for my post grad work. Must dig that paper up some time. Not sure if that paper included the idea of using the LPC synthesis filter for the excitation phase. This accounts for a lot of the speech quality – without it males sound very “clicky”.

This simple, 1-bit voicing decision shouldn’t work as well as it does. Conventional wisdom is that some sort of mixed voicing model is required for high quality speech, for example declaring the first part of the spectrum voiced, then next part unvoiced. The AMBE algorithm works that way, and I think MELP also has some sort of mixed voicing model.

Just in case I was accidentally fudging it I tried:


/* just to make sure we are not cheating - kill all phases */
for(i=0; i<MAX_AMP; i++)
    model.phi[i] = 0;
phase_synth_zero_order(snr, H, &prev_Wo, &ex_phase);

to make sure I wasn’t using original phases.

Compared to original phases the zero phase model has a few remaining artifacts – some males still sound a little “clicky”, and there are occasional tonal sounds in unvoiced areas of speech. The latter are probably due to the voicing estimator getting it wrong – i.e. declaring unvoiced speech as voiced. However given that a bunch of information (all the phases) has been thrown away I am pretty happy with the overall quality.

In the plots below you can see that the output speech signal is a little more impulsive – more of the energy in concentrated in the central peak at the start of each pitch period. The output signal also peaks at 20,000, input 15,000. I should point out that the output signal is a little behind the the input signal (small coding delay), which is why the left hand side of the output looks a little different to the input (in this frame we just passed a transition region).

Input Speech

Output Speech

Post Filter

For one sample (mmt1 – click to listen to the original (uncoded) sample), the zero phase model didn’t work very well. This sample has a high level of background noise. It sounded “clicky” and the background noise also had annoying periodic artifacts. Curiously, when I removed the contribution of the LPC model phase mmt1 sounded the same.

I tried a bunch of ideas to improve the sound quality over several days. Eventually I discovered that the clicky artifact was due to noise energy being synthesised as voiced harmonics. In speech corrupted by background noise, the high level parts of the speech spectrum contain speech energy, however in the low level areas (inter-formant regions) the background noise dominates. As we only have a single voicing decision for the whole spectrum, we get into trouble when we synthesise this inter-formant energy as voiced.

Frame 151 of mmt1 is a good example. You can see harmonic structure up about 1 kHz (regularly spaced harmonics), but after that it’s random background noise, except for maybe above 3 kHz where the speech signal pokes through the noise floor again:

Time domain

Magnitude Spectrum

The obvious solution is to adopt a mixed voicing model. However that means more bits and more parameters estimators. Parameter estimators always make occasional mistakes and can be painful to develop (I speak from experience). So I thought I’d try a post filter approach instead. The post filter works out an estimate of the background noise level, any harmonics that are beneath that level are set to unvoiced (i.e. the phases are scrambled).

Here are some plots of the post filter variables over the entire mmt1 sample. In some cases quite a high number of harmonics are declared unvoiced.

Current Frame and Background Energy Estimates

Percent Harmonics set to Unvoiced by Post Filter

The post filter worked pretty well on the mmt1 sample, making the speech sound closer to the samples with original phases (but still a little “clicky”). It also improved the quality of the background noise, making it sound less impulsive. The post filter is still experimental, for example I need to make sure it doesn’t mess up clean speech samples. Some more work should improve it.

Compared to mixed voicing methods the post filter approach has the big advantage of requiring zero bits – it works entirely on the information available at the decoder. I recall one version of MBE used around 12 bits/frame for voicing (600 bit/s at a 20ms frame rate).

Thoughts on Phase Models

Conceptually, the zero phase model is very close to the classic LPC-10 vocoder, that also fires impulses at the pitch period into a LPC synthesis filter, and has a single bit voiced/unvoiced decision. However the zero phase model produces higher quality speech. The main difference is that I am using a frequency domain synthesis approach. However time and frequency domain approaches should be interchangeable. Something to explore later, it might lead to an efficient time domain synthesis scheme.

For voiced speech the key role of any phase model is dispersion of energy around the onset of each pitch pulse. We don’t want all of the sine waves to come into phase at the same time or the speech sounds too “clicky”. Our ear is very sensitive to short time domain impulses like clicks. For voiced speech we can’t choose random phases as we perceive this as noise. If more than one pitch pulse forms per pitch period we perceive reverberation. More discussion on phases and speech perception in the Part 3 post in this series.

I am not sure why using the LPC model phase works so well although I have a theory. The LPC filter has peaks aligned with the peaks of the speech spectrum. The higher level LPC peaks are quite sharp. Like any filter, sharp peaks means a rapid phase shift over the region of the peak. This means that adjacent high level harmonics have quite different phases applied to them, dispersing the onset of the pitch pulse. So the LPC model naturally applies more dispersion to high level harmonics – just what we want to stop a click forming. Compare the hts1a frame 44 magnitude spectrum and the LPC phase spectrum for the same frame below:

Magnitude Spectrum

LPC Phase Spectrum

Notice the big phase shift around 500Hz? This matches the first formant peak in the magnitude spectrum.

Note that it is possible to use another voicing estimator for the zero phase model. This might actually be a good thing as fitting the first order model is fairly high in complexity.

DSP Hacking

Over the past week I hunted down some residual problems with the zero phase model, such as noisy speech and problems with background noise. This sort of development is never a straight line. Lots of dead ends and dud tests and a few days where everything just goes right. For example I spent two days working on a fancy new synthesis routine to “fix” some roughness I could hear in the speech. After two days I discovered the problem was a bug I had introduced earlier in the week! Also when working on the pitch estimator I discovered I had the alignment of the LPC analysis window wrong – upsetting the phase models as well as the LPC/LSP work. Sometimes working on resolving one bug helps you fix another. And so we inch forward.

One technique that is really working well is the “dump file” idea I talked about in Part 2. I dump all the codecs internal states into a bunch of text files, then plot them using Octave. I have a little user interface that lets me step backwards and forwards one frame at a time.

Here is a screen shot of my desktop when working on the NLP pitch estimator (Click to enlarge):

Another technique I am finding very powerful is testing each DSP algorithm carefully using artificial signals. I guess this is really only unit testing but some times you need to think pretty hard about exactly how to test a DSP algorithm. For example to test the first order phase model I generated an artificial signal that was a train of impulses. This had a known phase spectrum so I could test that the first order model was getting an exact fit (it wasn’t at first, needed a finer sampling grid). To test the synthesis code I first drove it with fa single sinusoid, then a bunch of sinusoids to make an impulse train. In both tests I discovered subtle bugs.

Non-Linear Pitch estimator

I spent some time getting the C code version of the pitch estimator working. I have changed the post processor and pitch tracker from that described in Chapter 4 of [1], see the nlp.c source code. It works OK over the 5 samples I have been testing but will no doubt make occasional errors on a wider database. Pitch estimators always do.

One thing I like about this project is that I am pushing some (I hope) useful open source DSP code out onto the Internet. For example I don’t know of any other open source pitch estimators (but please tell me if you know of any). A working pitch estimator would have saved me heaps of time in my speech coding research. There is a tnlp.c unit test that runs the pitch estimator independently of the rest of the codec. For this codec only a coarse pitch estimate is required, but pitch refinement is much easier than initial estimation if a finer resolution is required. Large (gross) errors are usually the big problem when estimating the pitch of human speech.

LPC Magnitude Modelling and Males

For some low F0 (e.g. Male) speakers LPC modelling of the spectral magnitudes introduces some low frequency artifacts. This is because the LPC model spectrum can’t change quickly near 0Hz (I remember reading somewhere the slope is always 0 for the LPC spectrum at 0Hz). Here is an example from frame 114 of mmt1 (click for a larger image):

LPC Modeling Magnitude Spectrum

The purple line on this rather busy plot is the LPC magnitude model. It does a pretty good job except for near 0Hz. The cyan line at the bottom is the LPC modeling error. In this case the LPC modeling error for the first harmonic is 30dB!

This problem is particularly bad for samples like mmt1 that have some strong high pass filtering applied to the original sample. This means that the fundamental (first) harmonic has been zapped, but the LPC modeling inserts it again at some arbitrary level.

The fix was to add a single bit to the magnitude information. We measure the error in the first harmonic magnitude A[1] at the encoder. If the LPC modeling error is larger than 6dB, a single bit is transmitted to the decoder. This bit instructs the decoder to attenuate A[1] by 30dB after recovering A[1] from the LPC model. This works well in practice – males with F0 and without high pass filtering sound pretty good.

Samples

Here are some speech samples that show the current state of the codec algorithms. To hear some of the differences I mention above needs a good set of headphones and patience. The differences are often quite small. However experience has shown that a lot of small differences can add up through the various processing stages. So it’s a good idea to work out the cause of any small coding artifacts and fix them as early as you can.

Male Female Male + truck Male Female
Original hts1a hts2a mmt1 morig forig
Sinusoidal hts1a hts2a mmt1 morig forig
Zero phase and post filter hts1a hts2a mmt1 morig forig
p=10 LPC modeled magnitudes hts1a hts2a mmt1 morig forig
Speex 1.11 8 kbit/s hts1a hts2a mmt1 morig forig

Jean-Marc, sorry about using an earlier version of Speex. It’s what I had easily available via an apt-get package. I used the default quality level for 8 kbit/s. I will upgrade these samples later (actually hts1a and hts2a are the latest samples from the Speex site).

Note: I can hear some tonal artifacts in hts2a, although the other female (forig) sounds much better. I have suspected for some time that hts2a might have some aliasing, an artifact of that day long ago in 1990 when I sampled it from a DAT player using a DSP32C development system ISA card.

Next Steps

I am fairly happy with the various algorithms, including pitch estimation, the zero phase model, and LPC magnitude modeling. Performance on clean speech, speech with high pass filtering, and speech with background noise is acceptable. Next step is to look again at the LSP quantisation of the LPC model, and work on ways to reduce the current 10ms frame rate to 20ms. We will then have a first pass 2400 bit/s codec ready for alpha testing.

Links

[1] Techniques for Harmonic Sinusoidal Coding
[2] Codec2 Web Page
[3] Open Source Low Rate Codec Part 1 – Project Kick off
[4] Open Source Low Rate Codec Part 2 – LPC Amplitude Modelling
[5] Open Source Low Rate Codec Part 3 – Phase and Male Speech

14 comments to Open Source Low Rate Codec Part 4 – Zero Phase Model

  • Hi David,

    Is there something wrong with your wave file samples? All the files in the forig column sound similar and sound very robotic. The same is true of all the files in the morig column. The files at the top of the list certainly don’t sound like originals.

    The hts1a and hts2a files have been used quite a bit, for a variety of purposes, since you first handed them around. I’ve never been happy with either, though. There is a gurgliness in the male voice that sounds unnatural. I was never convinced that’s how the man really sounded. The female voice has something odd about it too. It often degrades through processing more than other comparable female voice recordings, like something isn’t right about it. However, I never thought in terms of aliasing. Both the male and female recordings are pretty dead above 3.5kHz, when you look at their spectra. They certainly give the impression of any aliases being well suppressed.

  • david

    Hello Steve,

    I just downloaded all of the forig files to double check, and I think they are OK. Through headphones I can hear little artifacts. The original sounds original to me, but I guess the people speaking were staging it a bit, they had probably been reading Harvard test sentences all day long.

    So I figure that in completely the opposite fashion to hts2, forig it just codes very well. Speex is particularly good with this sample. This (along with morig) is a sample I obtained from the DVSI web site for AMBE. I found similar samples on a MELP site, I think they must be some samples floating around the US for tests they have done there. I imagine these samples were posted to those web sites as they code fairly well.

    morig actually has a very low freq response – I can see some signal near DC around the word “jumped”. I have seen similar audio from soft phones (i.e. audio from headsets as distinct from phones) when tested Oslec.

    I like hts1 as the parts with very low pitch and aperiodic voicing plays hell with pitch detectors and adaptive codebooks. Re aliasing on hts2 yeah I am not sure, in a few frames I recall seeing some lines that may have been harmonics above 4 KHz folded back into the passband. But the original doesn’t have the classic grating or mechanical sound of heavily aliased speech. It is rather “breathy”, not sure what that means to the spectrum (mixed voicing?) and I must confess I like samples that mess up speech coders :-)

    Cheers,

    David

  • Oh, there are certainly artifacts amongst the forig samples. They range from horrible, to slightly more horrible. :-) Those samples sound like they’ve been through a low bit rate codec already. Maybe that’s why they code well a second time around. Especially if the frames line up with the previous ones.

    morig certainly has more energy below 300Hz than you’d see off the PSTN, but then these are demos of military codecs.

    Steve

  • Pham Manh Tuan

    Hi David !
    Codec2 can use for IP04 in the future?

  • Morgan Gibson

    Hi.
    Great to see someone working on an FOSS low bitrate vocoder!
    Is there any progress with codec2 development? Latest SVN version is abt. 8 weeks old…

    Thank you for spending your time and skills for this matchless open source project.

  • david

    Thanks for you encouragement Morgan. Not much progress lately as you noted from the SVN logs. But I will get back to it.

  • Oscar IK1XPV

    Hi David,

    BRAVO! from Italy, your vocoder project is impressive.
    This is real HAM Spirit.
    Please continue.

    Ciao
    Oscar

  • Hello David,

    Congratulations! You spend a lot of time in this project – i think.

    I want to develop a patent- and license-free DV air interface for existing VHF/UHF FM radios in remote future (after D-Star / APCO era).

    In my idea I want use a higher baudrate around 8000baud, modulated with 2bits. I prefer to use speex wideband with the option that the receiver can fallback to the narrowband part of the dv if the signal gets weak.

    I have developed a little piece of hardware (called DV-Modem) to expand the capabilities of a modificated former mobile phone (Siemens C5). Now it works with D-Star (using AMBE-2020). I plan to implement APCO P25 too (I hope I find a way to convert IMBE packets to AMBE w/o violating DVSIs patents).

    Maybe your Codec2 sounds like the right thing for transmitting 4 DV streams simultanously from a relay station (instead one wideband).

    Definitifly I want to implement a fixed-point version on the AVR32 ┬ÁC I use on my hw, but I don’t have the mathematic basics to help you developing your codec – sri.

    Kind regards,
    Jan, DO1FJN

  • david

    Hi Jan,

    Thank you for your comments. I think a license/patent free DV air interface is a very important step. I would suggest you make the bit rate flexible, e.g. 1200-8000 bit/s.

    Cheers,

    David

  • Hi David,

    tnx for reply. Yes I plan a flexible framing. There is a frame-id after the frame-preamble coded in the sync. So it is possible to encode frame by frame different types or digital voice or data packets.
    From ~10kbit/s (wideband) speech down to 4 2.5kbit/s DV Streams.

    Maybe you implement a wideband enhancement (4-8kHz) for your Codec2?

    In my design, the decoder can use the narrow-part only if the wideband-additional defective (weak signals). Voice is divided in 5x40ms blocks (one frame) – so I can interleave 2 20ms Voice + FEC for a better short-fade-immunity.

    A 2nd design decision I made is strict separating frame/protocol from control data (routing…). Ii isn’t a part of the “air interface”… For network-capability a data-frame and/or inband-data must be used.

    If you have time, you can discuss my ideas… Just send me a Email.

    Ciao, Jan

    • Ross, ZL2WRW

      Hi,

      I too want to develop a free as in speech digital ham radio mode to replace both D* and ax.25 but I will have to wait until after my november exams before I can really get into codec2.

      I agree that variable over the air modulation rates are a good idea.

      I think that TDMA operation (TETRA, DECT and GSM) at the higher modulation rates (eg 4 slots with 3k6bps throughput/slot @ 19k2bps GMSK bt=0.3 modulation on a 25khz channel) would be a good idea because it would for example allow:
      Simultaneous voice QSOs and or data on the one repeater (a single 25khz channel repeater would only be able to RX on half the slots and TX on the other half but a split rx/tx frequency repeater would have full capacity – 4 RX and 4 TX slots)
      Telephone style “full duplex” communications on a single 25khz simplex channel.
      Any radio can be quickly configured to operate as a single channel repeater – no cavity filters or crossbanding required (very useful for public service, search and rescue or civil defense purposes).
      Mobile and portable users would be able to “ping” a repeater to see if they have good coverage without “kerchunking” it for all to hear.

      The downside is that we will all need new radio equipment because our current rigs are mostly far to slow to switch between RX & TX for TDMA use.

      Would this be enough of an improvement on D* to render it obsolete? What do you think?

      73′s Ross

  • Ingo Schmittner

    First of all: thanks for working on this project!
    Just my 2 ct. regarding the post filter, background noise problem described above: try using gammatone filters to factor in an auditory speech model. AFAIK this approach is taken in hearing aids and the latest cochlea implants. I think it will improve w0 estimation in noisy environments.

  • david

    Thanks Ingo – I have never heard of a gammatone filter but will check it out. So far the pitch estimator seems to be very robust, but voicing estimation needs work.

  • Thomas Kalka

    If you are still interested in other open source pitch estimation routines you will find
    one implementation inside praat (http://www.fon.hum.uva.nl/praat/)

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>