ARM NEON Optimisation

I’ve been trying to optimise NEON DSP code on a Raspberry Pi. Using the intrinsics I managed to get a speed increase of about 3 times over vanilla C with just a few hours work. However the results are still significantly slower than the theoretical speed of the machine, which is 4 multiply-acculumates (8 float operations) per cycle. On a 1.2 GHz core that’s 9.6 GFLOPs.

Since then I’ve been looking at ARM manuals, Googling, and trying various ad-hoc ideas. There is a lack of working, fully optimised code examples, and I can’t find any data on cycle times and latency information for the Cortex A53 device used for the Rpi. The number of ARM devices and families is bewildering, and trying to find information in a series of thousand-page manuals daunting.

Fortunately the same NEON assembler seems to work (i.e. it assembles cleanly and you get the right results) on many ARM machines. It’s just unclear how fast it runs and why.

To get a handle on the problem I wrote a series of simple floating point dot product programs, and attempted to optimise them. Each program runs through a total of 1E9 dot product points, using an inner and outer loop. I made the inner loop pretty small (1000 floats) to try to avoid cache miss issues. Here are the results, using cycle counts measured with “perf”:

Program Test Theory cycles/loop Measured cycles/loop GFLOPS
dot1 Dot product no memory reads 1 4 1.2*8/4 = 1.2
dot2 Dot product no memory reads unrolled 1 1 1.2*8/1 = 9.6
dot3 Dot product with memory reads 3 9.6 1.2*8/9.6 = 1
dot4 Dot product with memory reads assembler 3 6.1 1.2*8/6.1 = 1.6
dotne10 Dot product with memory reads Ne10 3 11 1.2*8/11 = 0.87

Cycles/loop is how many cycles are executed for one iteration of the inner loop. The last column assumes a 1.2 GHz clock, and 8 floating point ops for every NEON vector multiply-accumulate (vmul.f32) instruction (a multiply, an add, 4 floats per vector processed in parallel).

The only real success I had was dot2, but that’s an unrealistic example as it doesn’t read memory in the inner loop. I guessed that the latencies in the NEON pipeline meant an unrolled loop would work better.

Assuming (as I can’t find any data on instruction timing) two cycles for the memory reads, and one for the multiply-accumulate, I was hoping at 3 cycles for dot3 and dot4. Maybe even better if there is some dual issue magic going on. Best I can do is 6 cycles.

I’d rather have enough information to “engineer” the system than have to rely on guesses. I’ve worked on many similar DSP optimisation projects in the past which have had data sheets and worked examples as a starting point.

Here is the neon-dot source code on GitLab. If you can make the code run faster – please send me a patch! The output looks something like:

$ make test
sum: 4e+09 FLOPS: 8e+09
sum: 4e+09 FLOPS: 8e+09
sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09
sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09
FLOPS: 4e+09
grep cycles dot_log.txt
     4,002,420,630      cycles:u    
     1,000,606,020      cycles:u    
     9,150,727,368      cycles:u
     6,361,410,330      cycles:u
    11,047,080,010      cycles:u

The dotne10 program requires the Ne10 library. There’s a bit of floating point round off in some of the program outputs (adding 1.0 to a big number), that’s not really a bug.

Some resources I did find useful:

  1. tterribe NEON tutorial. I’m not sure if the A53 has the same cycle timings as the Cortex-A discussed in this document.
  2. ARM docs, I looked at D0487 ARMv8 Arch Ref Manual, DDI500 Cortex A53 TRM, DDI502 Cortex A53 FPU TRM, which both reference the DEN013 ARM Cortex-A Series Programmer’s Guide. Couldn’t find any instruction cycle timing in any of them, but section 20.2 of DEN013 had some general tips.
  3. Linux perf was useful for cycle counts, and in record/report mode may help visualise pipeline stalls (but I’m unclear if that’s what I’m seeing due to my limited understanding).

Codec 2 and TWELP

DSP Innovations have recently published comparisons of Codec 2 with their TWELP codec at 2400 and 600 bit/s.

Along with some spirited rhetoric, they have published some TWELP 600 samples (including source). The comparison, especially in the 600 bit/s range, is very useful to my work.

I’ve extracted a random subset of the 600 bit/s a_eng.wav samples, broken up into a small chunks to make them easier to compare. Have a listen, and see what you think:

Sample Source MELP 600e Codec 2 700c TWELP 600
1 Listen Listen Listen Listen
2 Listen Listen Listen Listen
3 Listen Listen Listen Listen
4 Listen Listen Listen Listen
5 Listen Listen Listen Listen
6 Listen Listen Listen Listen

The samples do have quite a bit of background noise. The usual approach for noisy samples is to use a noise suppression algorithm first, e.g. we use the Speex noise suppression in FreeDV. However it’s also a test of the codecs robustness to background noise, so I didn’t perform any noise suppression for the Codec 2 samples.

Comparison

I am broadly in agreement with their results. Using the samples provided, the TWELP codec appears to be comparable to MELP 2400, with Codec 2 2400 a little behind both. This is consistent with other Codec 2 versus MELP/AMBE comparisons at 2400 bits/s. That’s not a rate I have been focussing on, most of my work has been directed at lower rates required for HF Digital voice.

I think – for these samples – their 600 bit/s codec also does better than Codec 2 700C, but not by a huge margin. Their results support our previous findings that Codec 2 is as good as (or even a little better) than MELP 600e. It does depend on the samples used, as I will explain below.

DSP Innovations have done some fine work in handling non-speech signals, a common weakness with speech codecs in this range.

Technology Claims

As to claims of superior technology, and “30 year old technology”:

  1. MELP 2400 was developed in the 1990’s, and DSP Innovations own results show similar speech quality, especially at 2400 bits/s.
  2. AMBE is in widespread use, and uses a very similar harmonic sinusoidal model to Codec 2.
  3. The fundamental work on speech compression was done in the 1970s and 80’s, and much of what we use today (e.g. in your mobile phone) is based on incremental advances on that.
  4. As any reader of this blog will know, Codec 2 has been under continual development for the past decade. I haven’t finished, still plenty of “DSP Innovation” to come!

While a fine piece of engineering, TWELP isn’t in a class of it’s own – it’s still a communications quality speech codec in the MELP/AMBE/Codec 2 quality range. They have not released any details of their algorithms, so they cannot be evaluated objectively by peer review.

PESQ and Perceptual evaluation of speech quality

DSP Innovations makes extensive use of the PESQ measure, for both this study and for comparisons to other competitors.

Speech quality is notoriously hard to estimate. The best way is through controlled subjective testing but this is expensive and time consuming. A utility to accurately estimate fine differences in speech quality would be a wonderful research tool. However in my experience (and the speech coding R&D community in general), such a tool does not exist.

The problem is even worse for speech codecs beneath 4 kbit/s, as they distort the signal so significantly.

The P.862 standard acknowledges these limits, and explicitly states in Table 3 “Factors, technologies and applications for which PESQ has not currently been validated … CELP and hybrid codecs < 4 kbit/s". The standard they are quoting does not support use of PESQ for their tests.

PESQ is designed for phone networks, and much higher bit rate codecs. In section 2 of the standard they present best-case correlation results of +/- 0.5 MOS points (note on a scale of 1-5, this is +/- 10% error). That’s when it is used for speech codecs > 4 kbit/s that it is designed for.

So DSP Innovations statements like “Superiority of the TWELP 2400 and MELPe 2400 over CODEC2 2400 is on average 0.443 and 0.324 PESQ appropriately” are unlikely to be statistically valid.

The PESQ algorithm (Figure 4a of the standard) throws away all phase information, keeping just the FFT power spectrum. This means it cannot evaluate aspects of the speech signal that are very important for speech quality. For example PESQ could not tell the difference between voiced speech (like a vowel) an unvoiced (like a consonant) with the same power spectrum.

DSP Innovations haven’t shown any error bars or standard deviations on their results. Even the best subjective tests will have error bars wider than the PESQ results DSP Innovations are claiming as significant.

I do sympathise with them. This isn’t a huge market, they are a small company, and subjective testing is expensive. Numbers look good on a commercial web site from a marketing sense. However I suggest you disregard the PESQ numbers.

Speech Samples

Speech codecs tend to work well with some samples and fall over with others. It is natural to present the best examples of your product. DSP Innovations chose what speech material they would present in their evaluation of Codec 2. I have asked them to give me the same courtesy and code speech samples of my choice using TWELP. I have received no response to my request.

Support and Porting

An open source codec can be ported to another machine in seconds (rather than months that DSP Innovations quote) with a cross compiler. At no cost.

Having the source code makes minor problems easy to fix yourself. We have a community that can answer many questions. For tougher issues; well I’m available for paid support – just like DSP Innovations.

Also …. well open source is just plain cool. As a reminder, here are the reasons I started Codec 2, nearly 10 years ago.

To quote myself:

A free codec helps a large amount of people and promotes development and innovation. A closed codec helps a small number people make money at the expense of stifled business and technical development for the majority.

Reading Further

Open Source Low Rate Speech Codec Part 1, the post that started Codec 2.
P.862 PESQ standard.
CODEC2 vs TWELP on 2400 bps. DSP Innovations evaluate Codec 2, MELP, and TWELP at 2400 bits/s.
CODEC2 vs TWELP on 700 bps. DSP Innovations evaluate Codec 2, MELP, and TWELP at 600 (ish) bits/s.
AMBE+2 and MELPe 600 Compared to Codec 2. An earlier comparison, using samples from DSP Innovations.

LPCNet meets Codec 2

The previous post described my attempts to come up to speed with NN based speech synthesis, with the kind help of Jean-Marc Valin and his LPCNet system.

As an exercise, I have adapted LPCNet to use Codec 2 features, and have managed to synthesise high quality speech at a sample rate of 8kHz. Here are the output speech samples:

Sample original LPCNet Codec 2
cq_ref Listen Listen
hts1a Listen Listen
hts2a Listen Listen
mmt1 Listen Listen
morig Listen Listen
speech_orig Listen Listen

I’m happy with all of the samples except cq_ref. That sample has a lot of low freq energy (like the pitch fundamental) which may not have been well represented in the training database. mmt1 has some artefacts, but this system already does better than any other low rate codec on this sample.

This is not quite a quantised speech codec, as I used unquantised Codec 2 parameters (10 Line Spectral Pairs, pitch, energy, and a binary voicing flag). However it does show how LPCNet (and indeed NN synthesis in general) can be trained to use different sets of input features, and the system I have built is close to an open source version of the Codec 2/NN system presented by Kleijn et al.

Why 8kHz rather than the higher quality 16 kHz? Well LPCNet requires a set of Linear Prediction Coefficients (LPCs). The LPCs dumped by Codec 2 are sampled at 8kHz. It’s possible, but not straight forward, to resample the LPC spectra at 16 kHz, but I chose to avoid that step for now.

Training

My initial attempts led to good quality speech using samples from within the training database, but poor quality on speech samples (like the venerable hts1a) from outside the training database. In Machine Learning land, this suggests “not enough training data”. So I dug up an old TIMIT speech sample database, and did a bunch of filtering on my speech samples to simulate what I have seen from microphones in my Codec 2 adventures. It’s all described in gory detail here (Training Tips section). Then, much to my surprise, it worked! Clean, good quality speech from all sorts of samples.

Further Work

  • Add code to generate 16 kHz LPCs from 8 kHz LPCs and try for 16 kHz synthesised speech
  • Use quantised Codec 2 parameters from say Codec 2 2400 or 1300 and see how it sounds.
  • Help Jean-Marc convert LPCNet to C and get it running in real time on commodity hardware.
  • Make a real world, over the air contact using NN based speech synthesis and FreeDV.
  • A computationally large part of the LPCNet (and indeed any *Net speech synthesis system) is dedicated to handling periodic pitch information. The harmonic sinusoidal model used in Codec 2 can remove this information and hence much of the CPU load. So a dramatic further reduction in the number of weights (and hence CPU load) is possible, although this may result in some quality reduction. Another way of looking at this (as highlighted by Jean-Marc’s LPCNet paper) is “how do we model the excitation” in source/filter type speech systems.
  • The Kleijn et al paper had the remarkable result that we can synthesise high quality speech from low bit rate Codec 2 features. What is quality trade off between the bit rate of the features and the speech quality? How coarsely can we quantise the speech features and still get high quality speech? How much of the quality is due to the NN, and how much the speech features?

Reading Further

Jean Marc’s blog post on LPCNet, including links to LPCNet source code and his ICASSP 2019 paper.
WaveNet and Codec 2
Source Code for my Codec 2 version of LPCNet

LPCNet – Open Source Neural Net Speech Synthesis

Jean-Marc Valin has been working on Neural Network (NN) based speech synthesis in his project called LPCNet. It has similar speech quality to Wavenet, but is based on an architecture called WaveRNN, and includes many new innovations.

Jean-Marc’s work is aimed at reducing the synthesis CPU load down to the level of a modern CPU, for example a mobile phone or Raspberry Pi, and he has made significant progress in that direction.

As well as being useful for his research – this code is a working, open source reference system for Neural Net (NN) based synthesis projects. He has also written an ICASSP 2019 paper on LPCNet, which explains many of the finer details of NN speech synthesis. Fantastic resources for other people coming up to speed in NN synthesis. Well done Jean-Marc!

Over the past few weeks Jean-Marc has kindly answered many NN-noob questions from me. I have used the answers to comment his code and add to his README. There are still many aspects of how this code works that I do not understand. However I can drive his software well enough to synthesise high quality speech:


The first sample was from inside the training database, the second outside.

The network is driven by some speech codec like parameters, but it’s not actually running as a speech codec at present. However it’s a great starting point for high quality speech (de)coding, or indeed speech synthesis.

How I trained

My GTX1060 GPU isn’t quite up to spec, so for training I had to reduce the batch_size to 16, and run for 60 epochs. I used the TSP speech database discussed in the LPCNet README, and followed Jean-Marc’s suggestion of resampling it twice (once at +5% Fs, once at -5% Fs), to get 3x the training data. It took 14 hours for me to train. Synthesis runs 10 times slower than real time on my GPU, however much of this is overhead. If the Keras code was ported to C – it would be close to real time on a modern laptop/phone CPU.

References

Jean Marc’s blog post on LPCNet, including links to LPCNet source code and his ICASSP 2019 paper.
WaveNet and Codec 2
WaveRNN
FFTNet, some good figures that helped me get my head around the idea of sampling a probability distribution.

Codec 2 2200 Candidate D

Every time I start working on Deep Learning and Codec 2 I get side tracked! This time, I started developing a reference codec that could be used to explore machine learning, however the reference codec was sounding pretty good early in it’s development so I have pushed it through to a fully quantised state. For lack of a better name it’s called candidate D, as that’s where I am up to in a series of codec prototypes.

The previous Codec 2 2200 post described candidate C. That also evolved from a “quick effort” to develop a reference codec to explore my deep learning ideas.

Learning about Vector Quantisation

This time, I explored Vector Quantisation (VQ) of spectral magnitude samples. I feel my VQ skills are weak, so did a bit of reading. I really enjoy learning, especially in areas I have been fooling around for a while but never really understood. It’s a special feeling when the theory clicks into place with the practical.

So I have these vectors of K=40 spectral magnitude samples, that I want to quantise. To get a feel for the data I started out by looking at smaller 2 and 3 dimensional vectors. Forty dimensions is a bit much to handle, so I started out by plotting smaller slices. Here are 2D and 3D scatter plots of adjacent samples in the vector:


The data is highly correlated, almost a straight line relationship. An example of a 2-bit, 2D vector quantiser for this data might be the points (0,0) (20,20) (30,30) (40,40). Consider representing the same data with two 1D (scalar) quantisers over the same 2 bit range (0,20,30,40). This would take 4 bits in total, and be wasteful as it would represent points that would never occur, such as (60,0).

[1] helped me understand the relationship between covariance and VQ, using 2D vectors. For Candidate D I extended this to K=40 dimensions, the number of samples I am using for the spectral magnitudes. Then [2] (thirty year old!) paper how the DCT relates to vector quantisation and the eigenvector/value rotations described in [1]. I vaguely remember snoring my way through eigen-thingies at math lectures in University!

My VQ work to date has used minimum Mean Square Error (MSE) to train and match vectors. I have been uncomfortable with MSE matching for a while, as I have observed poor choices in matching vectors to speech. For example if the target vector falls off sharply at high frequencies (say a LPF at 3500 Hz), the VQ will try to select a vector that matches that fall off, and ignore smaller, more perceptually important features like formants.

VQs are often trained to minimise the average error. They tend to cluster VQ points closer to those samples that are more likely to occur. However I have found that beneath a certain threshold, we can’t hear the quantisation error. In Codec 2 it’s hard to hear any distortion when spectral magnitudes are quantised to 6 dB steps. This suggest that we are wasting bits with fine quantiser steps, and there may be better ways to design VQs, for example a uniform grid of points that covers a few standard deviations of data on the scatter plots above.

I like the idea of uniform quantisation across vector dimensions and the concepts I learnt during this work allowed me to do just that. The DCT effectively lets me use scalar quantisation of each vector element, so I can easily choose any quantiser shape I like.

Spectral Quantiser

Candidate D uses a similar design and bit allocation to Candidate C. Candidate D uses K=40 resampling of the spectral magnitudes, to help preserve narrow high frequency formants that are present for low pitch speakers like hts1a. The DCT of the rate K vectors is computed, and quantised using a Huffman code.

There are not enough bits to quantise all of the coefficients, so we stop when we run out of bits, typically after 15 or 20 (out of a total of 40) DCTs. On each frame the algorithm tries direct or differential quantisation, and chooses the method with the lowest error.

Results

I have a couple of small databases that I use for listening tests (about 15 samples in total). I feel Candidate D is better than Codec 2 1300, and also Codec 2 2400 for most (but not all) samples.

In particular, Candidate D handles samples with lots of low frequency energy better, e.g. cq_ref and kristoff in the table below.

Sample 1300 2400 2200 D
cq_ref Listen Listen Listen
kristoff Listen Listen Listen
me Listen Listen Listen
vk5local_1 Listen Listen Listen
ebs Listen Listen Listen

For a high quality FreeDV mode I want to improve speech quality over FreeDV 1600 (which uses Codec 2 1300 plus some FEC bits), and provide better robustness to different speakers and recording conditions. As you can hear – there is a significant jump in quality between the 1300 bit/s codec and candidate D. Implemented as a FreeDV mode, it would compare well with SSB at high SNRs.

Next Steps

There are many aspects of Candidate D that could be explored:

  • Wideband audio, like the work from last year.
  • Back to my original aim of exploring deep learning with Codec 2.
  • Computing the DCT coefficients from the rate L (time varying) magnitude samples.
  • Better time/freq quantisation using a 2D DCT rather than the simple difference in time scheme used for Candidate D.
  • Porting to C and developing a real time FreeDV 2200 mode.

The current candidate D 2200 codec is implemented in Octave, so porting to C is required before it is usable for real world applications, plus some more C to integrate with FreeDV.

If anyone would like to help, please let me know. It’s fairly straight forward C coding, I have already done the DSP. You’ll learn a lot, and be part of the open source future of digital radio.

Reading Further

[1] A geometric interpretation of the covariance matrix, really helped me understand what was going on with VQ in 2 dimensions, which can then be extended to larger dimensions.

[2] Vector Quantization in Speech Coding, Makhoul et al.

[3 Codec 2 Wideband, previous DCT based Codec 2 Work.

Porting a LDPC Decoder to a STM32 Microcontroller

A few months ago, FreeDV 700D was released. In that post, I asked for volunteers to help port 700D to the STM32 microcontroller used for the SM1000. Don Reid, W7DMR stepped up – and has been doing a fantastic job porting modules of C code from the x86 to the STM32.

Here is a guest post from Don, explaining how he has managed to get a powerful LDPC decoder running on the STM32.

LDPC for the STM32

The 700D mode and its LDPC function were developed and used on desktop (x86) platforms. The LDPC decoder is implemented in the mpdecode_core.c source file.

We’d like to run the decoder on the SM1000 platform which has an STM32F4 processor. This requires the following changes:

  • The code used doubles in several places, while the stm32 has only single precision floating point hardware.
  • It was thought that the memory used might be too much for a system with just 192k bytes of RAM.
  • There are 2 LDPC codes currently supported, HRA_112_112 used in 700D and, H2064_516_sparse used for Balloon Telemetry. While only the 700D configuration needed to work on the STM32 platform, any changes made to the mainstream code needed to work with the H2064_516_sparse code.

Testing

Before making changes it was important to have a well defined test process to validate new versions. This allowed each change to be validated as it was made. Without this the final debugging would likely have been very difficult.

The ldpc_enc utility can generate standard test frames and the ldpc_dec utility receive the frames and measure bit errors. So errors can be detected directly and BER computed. ldpc_enc can also output soft decision symbols to emulate what the modem would receive and pass into the LDPC decoder. A new utility ldpc_noise was written to add AWGN to the sample values between the above utilities. here is a sample run:

$ ./ldpc_enc /dev/zero - --sd --code HRA_112_112 --testframes 100 | ./ldpc_noise - - 1 | ./ldpc_dec - /dev/null --code HRA_112_112 --sd --testframes
single sided NodB = 1.000000, No = 1.258925
code: HRA_112_112
code: HRA_112_112
Nframes: 100
CodeLength: 224 offset: 0
measured double sided (real) noise power: 0.640595
total iters 3934
Raw Tbits..: 22400 Terr: 2405 BER: 0.107
Coded Tbits: 11200 Terr: 134 BER: 0.012

ldpc_noise is passed a “No” (N-zero) level of 1dB, Eb=0, so Eb/No = -1, and we get a 10% raw BER, and 1% after LDPC decoding. This is a typical operating point for 700D.

A shell script (ldpc_check) combines several runs of these utilities, checks the results, and provides a final pass/fail indication.

All changes were made to new copies of the source files (named *_test*) so that current users of codec2-dev were not disrupted, and so that the behaviour could be compared to the “released” version.

Unused Functions

The code contained several functions which are not used anywhere in the FreeDV/Codec2 system. Removing these made it easier to see the code that was used and allowed the removal of some variables and record elements to reduce the memory used.

First Compiles

The first attempt at compiling for the stm32 platform showed that the the code required more memory than was available on the processor. The STM32F405 used in the SM1000 system has 128k bytes of main RAM.

The largest single item was the DecodedBits array which was used to saved the results for each iteration, using 32 bit integers, one per decoded bit.

    int *DecodedBits = calloc( max_iter*CodeLength, sizeof( int ) );

This used almost 90k bytes!

The decode function used for FreeDV (SumProducts) used only the last decoded set. So the code was changed to save only one pass of values, using 8 bit integers. This reduced the ~90k bytes to just 224 bytes!

The FreeDV 700D mode requires on LDPC decode every 160ms. At this point the code compiled and ran but was too slow – using around 25ms per iteration, or 300 – 2500ms per frame!

C/V Nodes

The two main data structures of the LDPC decoder are c_nodes and v_nodes. Each is an array where each node contains additional arrays. In the original code these structures used over 17k bytes for the HRA_112_112 code.

Some of the elements of the c and v nodes (index, socket) are indexes into these arrays. Changing these from 32 bit to 16 bit integers and changing the sign element into a 8 bit char saved about 6k bytes.

The next problem was the run time. Each 700D frame must be fully processed in 160 ms and the decoder was taking several times this long. The CPU load was traced to the phi0() function, which was calling two maths library functions. After optimising the phi0 function (see below) the largest use of time was the index computations of the nested loops which accessed these c and v node structures.

With each node having separate arrays for index, socket, sign, and message these indexes had to be computed separately. By changing the node structures to hold an array of sub-nodes instead this index computation time was significantly reduced. An additional benefit was about a 4x reduction in the number of memory blocks allocated. Each allocation block includes additional memory used by malloc() and free() so reducing the number of blocks reduces memory use and possible heap fragmentation.

Additional time was saved by only calculating the degree elements of the c and v nodes at start-up rather than for every frame. That data is kept in memory that is statically allocated when the decoder is initialized. This costs some memory but saves time.

This still left the code calling malloc several hundred times for each frame and then freeing that memory later. This sort of memory allocation activity has been known to cause troubles in some embedded systems and is usually avoided. However the LDPC decoder needed too much memory to allow it to be statically allocated at startup and not shared with other parts of the code.

Instead of allocating an array of sub-nodes for each c or v node, a single array of bytes is passed in from the parent. The initialization function which calculates the degree elements of the nodes also counts up the memory space needed and reports this to its caller. When the decoder is called for a frame, the node’s pointers are set to use the space of this array.

Other arrays that the decoder needs were added to this to further reduce the number of separate allocation blocks.

This leaves the decisions of how to allocate and share this memory up to a higher level of the code. The plan is to continue to use malloc() and free() at a higher level initially. Further testing can be done to look for memory leakage and optimise overall memory usage on the STM32.

PHI0

There is a non linear function named “phi0” which is called inside several levels of nested loops within the decoder. The basic operation is:

   phi0(x) = ln( (e^x + 1) / (e^x - 1) )

The original code used double precision exp() and log(), even though the input, output, and intermediate values are all floats. This was probably an oversight. Changing to the single single precision versions expf() and logf() provided some improvements, but not enough to meet our CPU load goal.

The original code used piecewise approximation for some input values. This was extended to cover the full range of inputs. The code was also structured differently to make it faster. The original code had a sequence of if () else if () else if () … This can take a long time when there are many steps in the approximation. Instead two ranges of input values are covered with linear steps that is implemented with table lookups.

The third range of inputs in non linear and is handled by a binary tree of comparisons to reduce the number of levels. All of this code is implemented in a separate file to allow the original or optimised version of phi0 to be used.

The ranges of inputs are:

             x >= 10      result always 0
      10   > x >=  5      steps of 1/2
       5   > x >= 1/16    steps of 1/16
    1/16   > x >= 1/4096  use 1/32, 1/64, 1/128, .., 1/4096
    1/4096 > x            result always 10

The range of values that will appear as inputs to phi0() can be represented with as fixed point value stored in a 32 bit integer. By converting to this format at the beginning of the function the code for all of the comparisons and lookups is reduced and uses shifts and integer operations. The step levels use powers of 2 which let the table index computations use shifts and make the fraction constants of the comparisons simple ones that the ARM instruction set can create efficiently.

Misc

Two of the configuration values are scale factors that get multiplied inside the nested loops. These values are 1.0 in both of the current configurations so that floating point multiply was removed.

Results

The optimised LDPC decoder produces the same output BER as the original.

The optimised decoder uses 12k of heap at init time and needs another 12k of heap at run time. The original decoder just used heap at run time, that was returned after each call. We have traded off the use of static heap to clean up the many small heap allocations and reduce execution time. It is probably possible to reduce the static space further perhaps at the cost of longer run times.

The maximum time to decode a frame using 100 iterations is 60.3 ms and the median time is 8.8 ms, far below our budget of 160ms!

Future Possibilities

The remaining floating point computations in the decoder are addition and subtraction so the values could be represented with fix point values to eliminate the floating point operations.

Some values which are computed from the configuration (degree, index, socket) are constants and could be generated at compile time using a utility called by cmake. However this might actually slow down the operation as the index computations might become slower.

The index and socket elements of C and V nodes could be pointers instead of indexes into arrays.

Experiments would be required to ensure these changes actually speed up the decoder.

Bio

Don got his first amateur license in high school but was soon distracted with getting an engineering degree (BSEE, Univ. of Washington), then family and life. He started his IC design career with the CPU for the HP-41C calculator. Then came ICs for printers and cameras, work on IC design tools, and some firmware for embedded systems. Exposure to ARES public service lead to a new amateur license – W7DMR and active involvement with ARES. He recently retired after 42 years and wanted to find an open project that combined radio, embedded systems and DSP.

Don lives in Corvallis, Oregon, USA a small city with the state technical university and several high tech companies.

Open Source Projects and Volunteers

Hi it’s David back again ….

Open source projects like FreeDV and Codec 2 rely on volunteers to make them happen. The typical pattern is people get excited, start some work, then drift away after a few weeks. Gold is the volunteer that consistently works week in, week out until their particular project is done. The number of hours/week doesn’t matter – it’s the consistency that is really helpful to the projects. I have a few contributors/testers/users of FreeDV in this category and I appreciate you all deeply – thank you.

If you would like to help out, please contact me. You’ll learn a lot and get to work towards an open source future for HF radio.

If you can’t help out technically, but would like to support this work, please consider Patreon or PayPal.

Reading Further

LDPC using Octave and the CML library. Our LDPC decoder comes from Coded Modulation Library (CML), which was originally used to support Matlab/Octave simulations.

Horus 37 – High Speed SSTV Images. The CML LDPC decoder was converted to a regular C library, and used for sending images from High Altitude Balloons.

Steve Ports an OFDM modem from Octave to C. Steve is another volunteer who put in a fine effort on the C coding of the OFDM modem. He recently modified the modem to handle high bit rates for voice and HF data applications.

Rick Barnich KA8BMA did a fantastic job of designing the SM1000 hardware. Leading edge, HF digital voice hardware, designed by volunteers.

Tony K2MO Tests FreeDV

Tony, K2MO, has recently published some fine videos of FreeDV 1600, 700C, and 700D passing through simulated HF channels. The results are quite interesting.

This video shows the 700C mode having the ability to decode with 50% of it’s carriers removed:

This 700C modem sends two copies of the tx signal at high and low frequencies, a form of diversity to help overcome selective fading. These are the combined at the receiver.

Tony’s next video shows three FreeDV modes passing through a selective fading HF channel simulation:

This particular channel has slow fading, a notch gradually creeps across the spectrum.

Tony originally started testing to determine which FreeDV mode worked best on NVIS paths. He used path parameters based on VOACAP prediction models which show the relative time delay and signal power for the each propagation mode i.e., F1, F2:

Note the long delay paths (5ms). The CCIR NVIS path model also suggests a path delay of 7ms. That much delay puts the F-layer at 1000 km (well out into space), which is a bit of a puzzle.

This video shows the results of the VOCAP NVIS path:

In this case 700C does better than 700D. The 700C modem (COHPSK) is a parallel tone design, which is more robust to long multipath delays. The OFDM modem used for 700D is configured for multipath delays of up to 2ms, but tends to fall over after that as the “O” for Orthogonal assumption breaks down. It can be configured for longer delays, at a small cost in low SNR performance.

The OFDM modem gives much tighter packing for carriers, which allows us to include enough bits for powerful FEC, and have a very narrow RF bandwidth compared to 700C. FreeDV 700D has the ability to perform interleaving (Tools-Options “FreeDV 700 Options”), which is a form of time diversity. This feature is not widely used at present, but simulations suggest it is worth up to 4dB.

It would be interesting to combine frequency diversity, LDPC, and OFDM in a wider bandwidth signal. If anyone is interested in doing a little C coding to try this let me know.

I’ve actually seen long delay on NVIS paths in the “real world”. Here is a 40M 700D contact between myself and Mark, VK5QI, who is about 40km away from me. Note at times there are notches on the waterfall 200Hz apart, indicating a round trip path delay of 1500km:

Reading Further

Modems for HF Digital Voice Part 1
, explaining the frequency diversity used in 700C
Testing FreeDV 700C, shows how to use some built in test features like noise insertion and interfering carriers.
FreeDV 700D
FreeDV User Guide, including new 700D features like interleaving

Simple Keras “Hello World” Example – Mean Removal

Inspired by the Wavenet work with Codec 2 I’m dipping my toe into the word of Deep Learning (DL) using Keras. I’ve read Deep Learning with Python (quite an enjoyable read) and set up a Linux box with a GTX graphics card that is making my teenage sons weep with jealousy.

So after a couple of days of messing about here is my first “hello world” Keras example: mean_removal.py. It might be helpful for other Keras noobs. Assuming you have all the packages installed, it runs with either Python 2:

$ python mean_removal.py

Or Python 3:

$ python3 mean_removal.py

It removes the mean from vectors, using just a single layer regression model. The script runs OK on a regular PC without a chunky graphics card.

So I start by generating vectors from random numbers with a zero mean. I then add a random offset to each sample in the vector. Here are 5 vectors with random offsets added to them:

The Keras script is then trained to estimate and remove the offsets, so the output vectors look like:

Estimating the offset is the same as finding the “mean” of the vector. Yes I know we can do that with a “mean” function, but where’s the fun in that!

Here are some other plots that show the training and validation measures, and error metrics at the output:



The last two plots show pretty much all of the offset is removed and it restores the original (non offset) vectors with just a tiny bit of noise. I had to wind back the learning rate to get it to converge without “NAN” losses, possibly as I’m using fairly big input numbers. I’m familiar with the idea of learning rate from NLMS adaptive filters, such as those used for my work in echo cancellation.

Deep Learning for Codec 2

My initial ambitions for DL are somewhat more modest than the sample-by-sample synthesis used in the Wavenet work. I have some problems with Vector Quantisation (VQ) in the low rate Codec 2 modes. The VQ is used to compactly describe the speech spectrum, which carries the intelligibility of the signal.

The VQ gets upset with different microphones, speakers, or minor spectral shaping like gentle high pass/low pass filtering. This shaping often causes a poor vector to be chosen, which results in crappy speech quality. The current VQ error measure can’t tell the difference between spectral features that matter and those that don’t.

So I’d like to try DL to address those issues, and train a system to say “look, this speech and this speech are actually the same. Yes I know one of them has a 3dB/octave low pass filter, please ignore that”.

As emphasised in the text book above, some feature extraction can help with DL problems. For my first pass I’ll be working on parameters extracted by the Codec 2 model (like a compact version of the speech spectrum) rather than speech samples like Wavenet. This will reduce my CPU load significantly, at the expense of speech quality, which will be limited by the unquantised Codec 2 model. But that’s OK as a first step. A notch or two up on Codec 2 at 700 bit/s would be very useful, especially if it can run on a CPU from the first two decades of the 21st century.

Mean Removal on Speech Vectors

So to get started with Keras I chose mean removal. The mean level or constant offset is like the volume or energy in a speech signal, its the simplest form of spectral shaping I could imagine. I trained and tested it with vectors of random numbers, using numbers in the range of the speech spectral samples that Codec 2 plays with.

It’s a bit like an equaliser, vectors with arbitrary spectral shaping go in, “flat” unshaped vectors come out. They can then be sent to a Vector Quantiser. There are probably smarter ways to do this, but I need to start somewhere.

So as a next step I tried offset removal with vectors that represent the spectrum of 40ms speech frame:


This is pretty cool – the network was trained on random numbers but works well with real speech frames. You can also see the spectral slope I mentioned above, the speech energy gradually falls off at high frequencies. This doesn’t affect the intelligibility of the speech but tends to upset traditional Vector Quantisers. Especially mine.

Now that I have something super-basic working, the next step is to train and test networks to deal with some non-trivial spectral shaping.

Reading Further

Deep Learning with Python
WaveNet and Codec 2
Codec 2 700C, the current Codec 2 700 bit/s mode. With better VQ we can improve on this.
Codec 2 at 450 bit/s, some fine work from Thomas and Stefan, that uses a form of machine learning to synthesise 16 kHz speech from 8 kHz inputs.
FreeDV 700D, the recently released FreeDV mode that uses Codec 2 700C. A FreeDV Mode also includes a modem, FEC, protocol.
RNNoise: Learning Noise Suppression, Jean-Marc’s DL network for noise reduction. Thanks Jean-Marc for the brainstorming emails!

Band Pass Filter and Power Amplifier for Simple HF Data

Is it possible to move data over HF radio using very simple, low cost hardware and clever SDR software? In the last few posts (here and here) I’ve been constructing and testing building blocks for a simple HF data terminal. This post describes a few more, a 3-8 MHz Band Pass Filter (BPF) and 1W Power Amplifier (PA).

Band Pass Filter

The RTL-SDR samples at 28.8 MHz, capturing a broad chunk of spectrum. In direct mode we just sample the Q-channel, so any energy above 14.4 MHz will be aliased into our passband; e.g. both 21 and 7 MHz will appear as a 7 MHz sampled signal.

In the previous post we determined the ADC overloads at -30dBm, so we want to remove any strong signals above or near that level. One source of strong signals is broadcast band AM radio between 500 to 1600 kHz.

The use case is “100 mile” data links so I’d like the receiver to work on the 80M (3.5 MHz) as well as 40M (7.1 MHz) bands, which sets the BPF passband at 3 to 8 MHz. I hooked up my spec-an to a 40M antenna and could see AM broadcast signals peaking at -40dBm, so I set a BPF specification of > 20dB attenuation at 1.5 MHz to keep the sum of all those signals well away from the -30dBm limit. At the high frequency end I specified at > 30dB attenuation at 21 MHz, to reduce any energy aliased down to 7 MHz.

I designed a cascaded High Pass Low Pass/Filter using some tables from my ancient (but still excellent) copy of “RF Circuit Design”, by Chris Bowick. The Octave rtl_sdr script does the calculations for me. A spreadsheet would work well too.

I simulated the BPF using LTSpice, fixed a few bugs, and tweaked it for real world component values. Here is the circuit and frequency response on log and linear scales:



I soldered up the BPF Manhattan style using commercial axial 1uH inductors and ceramic capacitors, then tested it using the spec-an and tracking generator (note linear scale):

The table at the bottom shows the measured attenuation at some important frequencies. The attenuation is a bit low at 21 MHz, perhaps due to the finite Q of the real world inductors. Quite a good match to the LTSpice simulation and close enough for my experiments. The little step at around 10 MHz is a tracking generator artefact.

The next plot shows the effect of the BPF when my spec-an is connected to my 40M dipole (0 to 10MHz span). Yellow is the received signal without the filter, purple with the filter.

The big spike around 0 Hz is an artefact on the spec-an. The filter is doing a good job of nailing the AM broadcast band energy. You can see a peak around 7.4 MHz where the dipole is resonant. Actually this is a bit of a surprise to me, as I want it resonant around 7.2MHz, I better check that out! At 7.2-ish the insertion loss (difference between the purple and yellow) is a few dB as per the tracking generator plot above. It’s more like 6dB at 7.4 MHz (the dipole peak), not quite sure why. An insertion loss of 3dB at 7.2 MHz is OK for my application.

Power Amplifier

A few weeks ago I hooked the rpitx to my 40M dipole and managed to demodulate the 11mW signal a few km away (over an urban channel) using a mag loop and my FT-817. I decided to build a small 1W PA to make the system usable over “100 mile” HF channels. The actual power is not that critical, as we can trade power off against bit rate. For example if a given HF channel supports 100 bit/s at 1W, we then know we can get 1000 bit/s at 10W.

Even low bit rates can be very useful if you have no other communication. A text message or Tweet, allowing for some overhead, averages about 1000 bits. So at 1000 bit/s you can send 1 txt per second, 3600 an hour, or 86,000/day. That’s very useful communication if you are in a disaster situation and want to tell family you are alive. Or perhaps live in a remote area with no other communication. Of course HF channels come and go, so the actual throughput will be less than that.

I explored the junk box and found a partially constructed Beach 40. I isolated the driver and PA stage and poked it with my signal generator. Turns out it had a bit too much gain (the rpitx has plenty of drive) so I ended up with this simple PA circuit:



The only spurious output I can see is the 2nd harmonic is at -44 dBC, meeting ACMA specs:

The low pass filter at the output has a 3dB point at about 10 MHz which is a little high. It could be brought down a little to increase stop-band attenuation and reduce the 2nd harmonic further. I haven’t done anything about impedance matching the input, as it hits 1W (30dBm) output with 14dBm drive from the rpitx. The 1 inch square heatsink is quite warm after 10 minutes but I can still hold it. It’s not very efficient, 2.9W DC input power for 1W out, however 16dB power gain is quite good for a PA. Anyhoo, it’s a fine starting point for my experiments, we can optimise the PA later if necessary.

Next Steps

OK, so I have most of the building blocks I need for some over the air HF data experiments. There was a bit of engineering involved in building the BPF and PA, but the designs are very simple and can be constructed for a few $ or even from road kill (recycled) components. We now have a very low cost HF data radio, running high performance modems, connected to a Linux computer and Wifi.

Next I will put some software together to estimate data throughput, set the system up with real antennas, and gather some experimental results over real world HF channels.

Reading Further

Rpitx and 2FSK, first part in this series.
Testing a RTL-SDR with FSK on HF, second part in this series.
rtl_sdr.m script that calculates component values for the BPF.