Porting a LDPC Decoder to a STM32 Microcontroller

A few months ago, FreeDV 700D was released. In that post, I asked for volunteers to help port 700D to the STM32 microcontroller used for the SM1000. Don Reid, W7DMR stepped up – and has been doing a fantastic job porting modules of C code from the x86 to the STM32.

Here is a guest post from Don, explaining how he has managed to get a powerful LDPC decoder running on the STM32.

LDPC for the STM32

The 700D mode and its LDPC function were developed and used on desktop (x86) platforms. The LDPC decoder is implemented in the mpdecode_core.c source file.

We’d like to run the decoder on the SM1000 platform which has an STM32F4 processor. This requires the following changes:

  • The code used doubles in several places, while the stm32 has only single precision floating point hardware.
  • It was thought that the memory used might be too much for a system with just 192k bytes of RAM.
  • There are 2 LDPC codes currently supported, HRA_112_112 used in 700D and, H2064_516_sparse used for Balloon Telemetry. While only the 700D configuration needed to work on the STM32 platform, any changes made to the mainstream code needed to work with the H2064_516_sparse code.

Testing

Before making changes it was important to have a well defined test process to validate new versions. This allowed each change to be validated as it was made. Without this the final debugging would likely have been very difficult.

The ldpc_enc utility can generate standard test frames and the ldpc_dec utility receive the frames and measure bit errors. So errors can be detected directly and BER computed. ldpc_enc can also output soft decision symbols to emulate what the modem would receive and pass into the LDPC decoder. A new utility ldpc_noise was written to add AWGN to the sample values between the above utilities. here is a sample run:

$ ./ldpc_enc /dev/zero - --sd --code HRA_112_112 --testframes 100 | ./ldpc_noise - - 1 | ./ldpc_dec - /dev/null --code HRA_112_112 --sd --testframes
single sided NodB = 1.000000, No = 1.258925
code: HRA_112_112
code: HRA_112_112
Nframes: 100
CodeLength: 224 offset: 0
measured double sided (real) noise power: 0.640595
total iters 3934
Raw Tbits..: 22400 Terr: 2405 BER: 0.107
Coded Tbits: 11200 Terr: 134 BER: 0.012

ldpc_noise is passed a “No” (N-zero) level of 1dB, Eb=0, so Eb/No = -1, and we get a 10% raw BER, and 1% after LDPC decoding. This is a typical operating point for 700D.

A shell script (ldpc_check) combines several runs of these utilities, checks the results, and provides a final pass/fail indication.

All changes were made to new copies of the source files (named *_test*) so that current users of codec2-dev were not disrupted, and so that the behaviour could be compared to the “released” version.

Unused Functions

The code contained several functions which are not used anywhere in the FreeDV/Codec2 system. Removing these made it easier to see the code that was used and allowed the removal of some variables and record elements to reduce the memory used.

First Compiles

The first attempt at compiling for the stm32 platform showed that the the code required more memory than was available on the processor. The STM32F405 used in the SM1000 system has 128k bytes of main RAM.

The largest single item was the DecodedBits array which was used to saved the results for each iteration, using 32 bit integers, one per decoded bit.

    int *DecodedBits = calloc( max_iter*CodeLength, sizeof( int ) );

This used almost 90k bytes!

The decode function used for FreeDV (SumProducts) used only the last decoded set. So the code was changed to save only one pass of values, using 8 bit integers. This reduced the ~90k bytes to just 224 bytes!

The FreeDV 700D mode requires on LDPC decode every 160ms. At this point the code compiled and ran but was too slow – using around 25ms per iteration, or 300 – 2500ms per frame!

C/V Nodes

The two main data structures of the LDPC decoder are c_nodes and v_nodes. Each is an array where each node contains additional arrays. In the original code these structures used over 17k bytes for the HRA_112_112 code.

Some of the elements of the c and v nodes (index, socket) are indexes into these arrays. Changing these from 32 bit to 16 bit integers and changing the sign element into a 8 bit char saved about 6k bytes.

The next problem was the run time. Each 700D frame must be fully processed in 160 ms and the decoder was taking several times this long. The CPU load was traced to the phi0() function, which was calling two maths library functions. After optimising the phi0 function (see below) the largest use of time was the index computations of the nested loops which accessed these c and v node structures.

With each node having separate arrays for index, socket, sign, and message these indexes had to be computed separately. By changing the node structures to hold an array of sub-nodes instead this index computation time was significantly reduced. An additional benefit was about a 4x reduction in the number of memory blocks allocated. Each allocation block includes additional memory used by malloc() and free() so reducing the number of blocks reduces memory use and possible heap fragmentation.

Additional time was saved by only calculating the degree elements of the c and v nodes at start-up rather than for every frame. That data is kept in memory that is statically allocated when the decoder is initialized. This costs some memory but saves time.

This still left the code calling malloc several hundred times for each frame and then freeing that memory later. This sort of memory allocation activity has been known to cause troubles in some embedded systems and is usually avoided. However the LDPC decoder needed too much memory to allow it to be statically allocated at startup and not shared with other parts of the code.

Instead of allocating an array of sub-nodes for each c or v node, a single array of bytes is passed in from the parent. The initialization function which calculates the degree elements of the nodes also counts up the memory space needed and reports this to its caller. When the decoder is called for a frame, the node’s pointers are set to use the space of this array.

Other arrays that the decoder needs were added to this to further reduce the number of separate allocation blocks.

This leaves the decisions of how to allocate and share this memory up to a higher level of the code. The plan is to continue to use malloc() and free() at a higher level initially. Further testing can be done to look for memory leakage and optimise overall memory usage on the STM32.

PHI0

There is a non linear function named “phi0” which is called inside several levels of nested loops within the decoder. The basic operation is:

   phi0(x) = ln( (e^x + 1) / (e^x - 1) )

The original code used double precision exp() and log(), even though the input, output, and intermediate values are all floats. This was probably an oversight. Changing to the single single precision versions expf() and logf() provided some improvements, but not enough to meet our CPU load goal.

The original code used piecewise approximation for some input values. This was extended to cover the full range of inputs. The code was also structured differently to make it faster. The original code had a sequence of if () else if () else if () … This can take a long time when there are many steps in the approximation. Instead two ranges of input values are covered with linear steps that is implemented with table lookups.

The third range of inputs in non linear and is handled by a binary tree of comparisons to reduce the number of levels. All of this code is implemented in a separate file to allow the original or optimised version of phi0 to be used.

The ranges of inputs are:

             x >= 10      result always 0
      10   > x >=  5      steps of 1/2
       5   > x >= 1/16    steps of 1/16
    1/16   > x >= 1/4096  use 1/32, 1/64, 1/128, .., 1/4096
    1/4096 > x            result always 10

The range of values that will appear as inputs to phi0() can be represented with as fixed point value stored in a 32 bit integer. By converting to this format at the beginning of the function the code for all of the comparisons and lookups is reduced and uses shifts and integer operations. The step levels use powers of 2 which let the table index computations use shifts and make the fraction constants of the comparisons simple ones that the ARM instruction set can create efficiently.

Misc

Two of the configuration values are scale factors that get multiplied inside the nested loops. These values are 1.0 in both of the current configurations so that floating point multiply was removed.

Results

The optimised LDPC decoder produces the same output BER as the original.

The optimised decoder uses 12k of heap at init time and needs another 12k of heap at run time. The original decoder just used heap at run time, that was returned after each call. We have traded off the use of static heap to clean up the many small heap allocations and reduce execution time. It is probably possible to reduce the static space further perhaps at the cost of longer run times.

The maximum time to decode a frame using 100 iterations is 60.3 ms and the median time is 8.8 ms, far below our budget of 160ms!

Future Possibilities

The remaining floating point computations in the decoder are addition and subtraction so the values could be represented with fix point values to eliminate the floating point operations.

Some values which are computed from the configuration (degree, index, socket) are constants and could be generated at compile time using a utility called by cmake. However this might actually slow down the operation as the index computations might become slower.

The index and socket elements of C and V nodes could be pointers instead of indexes into arrays.

Experiments would be required to ensure these changes actually speed up the decoder.

Bio

Don got his first amateur license in high school but was soon distracted with getting an engineering degree (BSEE, Univ. of Washington), then family and life. He started his IC design career with the CPU for the HP-41C calculator. Then came ICs for printers and cameras, work on IC design tools, and some firmware for embedded systems. Exposure to ARES public service lead to a new amateur license – W7DMR and active involvement with ARES. He recently retired after 42 years and wanted to find an open project that combined radio, embedded systems and DSP.

Don lives in Corvallis, Oregon, USA a small city with the state technical university and several high tech companies.

Open Source Projects and Volunteers

Hi it’s David back again ….

Open source projects like FreeDV and Codec 2 rely on volunteers to make them happen. The typical pattern is people get excited, start some work, then drift away after a few weeks. Gold is the volunteer that consistently works week in, week out until their particular project is done. The number of hours/week doesn’t matter – it’s the consistency that is really helpful to the projects. I have a few contributors/testers/users of FreeDV in this category and I appreciate you all deeply – thank you.

If you would like to help out, please contact me. You’ll learn a lot and get to work towards an open source future for HF radio.

If you can’t help out technically, but would like to support this work, please consider Patreon or PayPal.

Reading Further

LDPC using Octave and the CML library. Our LDPC decoder comes from Coded Modulation Library (CML), which was originally used to support Matlab/Octave simulations.

Horus 37 – High Speed SSTV Images. The CML LDPC decoder was converted to a regular C library, and used for sending images from High Altitude Balloons.

Steve Ports an OFDM modem from Octave to C. Steve is another volunteer who put in a fine effort on the C coding of the OFDM modem. He recently modified the modem to handle high bit rates for voice and HF data applications.

Rick Barnich KA8BMA did a fantastic job of designing the SM1000 hardware. Leading edge, HF digital voice hardware, designed by volunteers.

7 thoughts on “Porting a LDPC Decoder to a STM32 Microcontroller”

  1. Cool news!
    Great to see its no more just a one man show!
    Keep the good work up!

    The great thing is with porting algorithms to small MCUs will benefit also the big CPUs lots of improvements. Cross platform rules!

  2. This is pretty cool, because I have no idea how LDPC works, and I’m glad others do. Not only that, but can locate and modify the non-optimized code. Thanks Don!

  3. Thanks a lot for the great effort of your team!
    Is the port of LDPC the 1st step of 700D porting to SM1000? The OFDM and codec2 itself should be the more difficult one? If the sram and the MIPS of the STM32F4 ,how about to replace it to STM32F7 or even H7 with SDRAM support?

    Best Regards?
    James

  4. Thanks a lot for your reply!
    Yes I read the information about porting OFDM from Octave to C, but never know you have port them to SM1000 even after checking the codec2 0.8 version. As I have got the information that due to the limitation of the SRAM and MIPS , the 700B and 700C haven’t been ported to SM1000. Am I right? So porting 700D to SM1000 is still in the plan? To running the 700D on the SM1000 like box is my dream.

    Further question about Freedv 1600 mode in SM1000, I am trying to modify the decimation ratio in the down_convert_and_rx_filter() ,from the original decimating by 4 then by 10 to the new arrangement by 10 then by 4, because the large decimation ratio in front will save calculation more in the front FIR filter. The rx_fdm_fcorr is 1st filtered and decimated by 10. The memory of rx_fdm is reduced to 112 complex , still 7 symbol in the buffer for the next down converting. After then, decimating by 4 to get the final 4 pcs sample putting to rx_filt[c]. The new code is running ,but with strong noise companion with the demoded voice, likely getting the wrong symbol estimation. Would you give me some advice to debug?

    Best Regards,
    James

    1. The OFDM modem used in 700D uses less memory that the cohpsk modem used for the earlier 700C modes, which makes it a little easier to port to embedded devices. However the main reason the earlier 700 modes were not ported to the SM1000 is that no one stepped up to do the work. Don is making the 700D port happen, as he is prepared to do the work.

      Test your new filter design using unit tests on the x86 first, and carefully check it’s BER performance. compared to the original code.

  5. Hi, David,
    Thanks a lot!
    For the 700d mode, as the LDPC code have take the 60ms in the 160ms frame interval, and there still have OFDM and 700d codec need to be port , I am wondering if the resource is enough for the rest porting. So I suggest replace the old STM32F4 to the STM32F7 or H7 with the pin to pin one. For the larger SRAM inside and the higher MIPS, it might help to the porting. The SM1000 user could buy the new chip and repalce it themself .

    For the FDMDV, the plan to decimation by 10 then by 4 is totally wrong as it destroy the half of the carrier in the front decimation. The original one it the best choice.

    Still have a nother question, there is just 15 carrier listed in the FDMDV document , but 17 carrier calculated in the down conversion process in the code . Why?

    Best Reegards,
    James

Comments are closed.