# Simple Keras “Hello World” Example – Mean Removal

Inspired by the Wavenet work with Codec 2 I’m dipping my toe into the word of Deep Learning (DL) using Keras. I’ve read Deep Learning with Python (quite an enjoyable read) and set up a Linux box with a GTX graphics card that is making my teenage sons weep with jealousy.

So after a couple of days of messing about here is my first “hello world” Keras example: mean_removal.py. It might be helpful for other Keras noobs. Assuming you have all the packages installed, it runs with either Python 2:

```\$ python mean_removal.py
```

Or Python 3:

```\$ python3 mean_removal.py
```

It removes the mean from vectors, using just a single layer regression model. The script runs OK on a regular PC without a chunky graphics card.

So I start by generating vectors from random numbers with a zero mean. I then add a random offset to each sample in the vector. Here are 5 vectors with random offsets added to them:

The Keras script is then trained to estimate and remove the offsets, so the output vectors look like:

Estimating the offset is the same as finding the “mean” of the vector. Yes I know we can do that with a “mean” function, but where’s the fun in that!

Here are some other plots that show the training and validation measures, and error metrics at the output:

The last two plots show pretty much all of the offset is removed and it restores the original (non offset) vectors with just a tiny bit of noise. I had to wind back the learning rate to get it to converge without “NAN” losses, possibly as I’m using fairly big input numbers. I’m familiar with the idea of learning rate from NLMS adaptive filters, such as those used for my work in echo cancellation.

Deep Learning for Codec 2

My initial ambitions for DL are somewhat more modest than the sample-by-sample synthesis used in the Wavenet work. I have some problems with Vector Quantisation (VQ) in the low rate Codec 2 modes. The VQ is used to compactly describe the speech spectrum, which carries the intelligibility of the signal.

The VQ gets upset with different microphones, speakers, or minor spectral shaping like gentle high pass/low pass filtering. This shaping often causes a poor vector to be chosen, which results in crappy speech quality. The current VQ error measure can’t tell the difference between spectral features that matter and those that don’t.

So I’d like to try DL to address those issues, and train a system to say “look, this speech and this speech are actually the same. Yes I know one of them has a 3dB/octave low pass filter, please ignore that”.

As emphasised in the text book above, some feature extraction can help with DL problems. For my first pass I’ll be working on parameters extracted by the Codec 2 model (like a compact version of the speech spectrum) rather than speech samples like Wavenet. This will reduce my CPU load significantly, at the expense of speech quality, which will be limited by the unquantised Codec 2 model. But that’s OK as a first step. A notch or two up on Codec 2 at 700 bit/s would be very useful, especially if it can run on a CPU from the first two decades of the 21st century.

Mean Removal on Speech Vectors

So to get started with Keras I chose mean removal. The mean level or constant offset is like the volume or energy in a speech signal, its the simplest form of spectral shaping I could imagine. I trained and tested it with vectors of random numbers, using numbers in the range of the speech spectral samples that Codec 2 plays with.

It’s a bit like an equaliser, vectors with arbitrary spectral shaping go in, “flat” unshaped vectors come out. They can then be sent to a Vector Quantiser. There are probably smarter ways to do this, but I need to start somewhere.

So as a next step I tried offset removal with vectors that represent the spectrum of 40ms speech frame:

This is pretty cool – the network was trained on random numbers but works well with real speech frames. You can also see the spectral slope I mentioned above, the speech energy gradually falls off at high frequencies. This doesn’t affect the intelligibility of the speech but tends to upset traditional Vector Quantisers. Especially mine.

Now that I have something super-basic working, the next step is to train and test networks to deal with some non-trivial spectral shaping.