A few months ago, FreeDV 700D was released. In that post, I asked for volunteers to help port 700D to the STM32 microcontroller used for the SM1000. Don Reid, W7DMR stepped up – and has been doing a fantastic job porting modules of C code from the x86 to the STM32.
Here is a guest post from Don, explaining how he has managed to get a powerful LDPC decoder running on the STM32.
LDPC for the STM32
The 700D mode and its LDPC function were developed and used on desktop (x86) platforms. The LDPC decoder is implemented in the mpdecode_core.c source file.
We’d like to run the decoder on the SM1000 platform which has an STM32F4 processor. This requires the following changes:
- The code used doubles in several places, while the stm32 has only single precision floating point hardware.
- It was thought that the memory used might be too much for a system with just 192k bytes of RAM.
- There are 2 LDPC codes currently supported, HRA_112_112 used in 700D and, H2064_516_sparse used for Balloon Telemetry. While only the 700D configuration needed to work on the STM32 platform, any changes made to the mainstream code needed to work with the H2064_516_sparse code.
Before making changes it was important to have a well defined test process to validate new versions. This allowed each change to be validated as it was made. Without this the final debugging would likely have been very difficult.
The ldpc_enc utility can generate standard test frames and the ldpc_dec utility receive the frames and measure bit errors. So errors can be detected directly and BER computed. ldpc_enc can also output soft decision symbols to emulate what the modem would receive and pass into the LDPC decoder. A new utility ldpc_noise was written to add AWGN to the sample values between the above utilities. here is a sample run:
$ ./ldpc_enc /dev/zero - --sd --code HRA_112_112 --testframes 100 | ./ldpc_noise - - 1 | ./ldpc_dec - /dev/null --code HRA_112_112 --sd --testframes single sided NodB = 1.000000, No = 1.258925 code: HRA_112_112 code: HRA_112_112 Nframes: 100 CodeLength: 224 offset: 0 measured double sided (real) noise power: 0.640595 total iters 3934 Raw Tbits..: 22400 Terr: 2405 BER: 0.107 Coded Tbits: 11200 Terr: 134 BER: 0.012
ldpc_noise is passed a “No” (N-zero) level of 1dB, Eb=0, so Eb/No = -1, and we get a 10% raw BER, and 1% after LDPC decoding. This is a typical operating point for 700D.
A shell script (ldpc_check) combines several runs of these utilities, checks the results, and provides a final pass/fail indication.
All changes were made to new copies of the source files (named *_test*) so that current users of codec2-dev were not disrupted, and so that the behaviour could be compared to the “released” version.
The code contained several functions which are not used anywhere in the FreeDV/Codec2 system. Removing these made it easier to see the code that was used and allowed the removal of some variables and record elements to reduce the memory used.
The first attempt at compiling for the stm32 platform showed that the the code required more memory than was available on the processor. The STM32F405 used in the SM1000 system has 128k bytes of main RAM.
The largest single item was the DecodedBits array which was used to saved the results for each iteration, using 32 bit integers, one per decoded bit.
int *DecodedBits = calloc( max_iter*CodeLength, sizeof( int ) );
This used almost 90k bytes!
The decode function used for FreeDV (SumProducts) used only the last decoded set. So the code was changed to save only one pass of values, using 8 bit integers. This reduced the ~90k bytes to just 224 bytes!
The FreeDV 700D mode requires on LDPC decode every 160ms. At this point the code compiled and ran but was too slow – using around 25ms per iteration, or 300 – 2500ms per frame!
The two main data structures of the LDPC decoder are c_nodes and v_nodes. Each is an array where each node contains additional arrays. In the original code these structures used over 17k bytes for the HRA_112_112 code.
Some of the elements of the c and v nodes (index, socket) are indexes into these arrays. Changing these from 32 bit to 16 bit integers and changing the sign element into a 8 bit char saved about 6k bytes.
The next problem was the run time. Each 700D frame must be fully processed in 160 ms and the decoder was taking several times this long. The CPU load was traced to the phi0() function, which was calling two maths library functions. After optimising the phi0 function (see below) the largest use of time was the index computations of the nested loops which accessed these c and v node structures.
With each node having separate arrays for index, socket, sign, and message these indexes had to be computed separately. By changing the node structures to hold an array of sub-nodes instead this index computation time was significantly reduced. An additional benefit was about a 4x reduction in the number of memory blocks allocated. Each allocation block includes additional memory used by malloc() and free() so reducing the number of blocks reduces memory use and possible heap fragmentation.
Additional time was saved by only calculating the degree elements of the c and v nodes at start-up rather than for every frame. That data is kept in memory that is statically allocated when the decoder is initialized. This costs some memory but saves time.
This still left the code calling malloc several hundred times for each frame and then freeing that memory later. This sort of memory allocation activity has been known to cause troubles in some embedded systems and is usually avoided. However the LDPC decoder needed too much memory to allow it to be statically allocated at startup and not shared with other parts of the code.
Instead of allocating an array of sub-nodes for each c or v node, a single array of bytes is passed in from the parent. The initialization function which calculates the degree elements of the nodes also counts up the memory space needed and reports this to its caller. When the decoder is called for a frame, the node’s pointers are set to use the space of this array.
Other arrays that the decoder needs were added to this to further reduce the number of separate allocation blocks.
This leaves the decisions of how to allocate and share this memory up to a higher level of the code. The plan is to continue to use malloc() and free() at a higher level initially. Further testing can be done to look for memory leakage and optimise overall memory usage on the STM32.
There is a non linear function named “phi0” which is called inside several levels of nested loops within the decoder. The basic operation is:
phi0(x) = ln( (e^x + 1) / (e^x - 1) )
The original code used double precision exp() and log(), even though the input, output, and intermediate values are all floats. This was probably an oversight. Changing to the single single precision versions expf() and logf() provided some improvements, but not enough to meet our CPU load goal.
The original code used piecewise approximation for some input values. This was extended to cover the full range of inputs. The code was also structured differently to make it faster. The original code had a sequence of if () else if () else if () … This can take a long time when there are many steps in the approximation. Instead two ranges of input values are covered with linear steps that is implemented with table lookups.
The third range of inputs in non linear and is handled by a binary tree of comparisons to reduce the number of levels. All of this code is implemented in a separate file to allow the original or optimised version of phi0 to be used.
The ranges of inputs are:
x >= 10 result always 0 10 > x >= 5 steps of 1/2 5 > x >= 1/16 steps of 1/16 1/16 > x >= 1/4096 use 1/32, 1/64, 1/128, .., 1/4096 1/4096 > x result always 10
The range of values that will appear as inputs to phi0() can be represented with as fixed point value stored in a 32 bit integer. By converting to this format at the beginning of the function the code for all of the comparisons and lookups is reduced and uses shifts and integer operations. The step levels use powers of 2 which let the table index computations use shifts and make the fraction constants of the comparisons simple ones that the ARM instruction set can create efficiently.
Two of the configuration values are scale factors that get multiplied inside the nested loops. These values are 1.0 in both of the current configurations so that floating point multiply was removed.
The optimised LDPC decoder produces the same output BER as the original.
The optimised decoder uses 12k of heap at init time and needs another 12k of heap at run time. The original decoder just used heap at run time, that was returned after each call. We have traded off the use of static heap to clean up the many small heap allocations and reduce execution time. It is probably possible to reduce the static space further perhaps at the cost of longer run times.
The maximum time to decode a frame using 100 iterations is 60.3 ms and the median time is 8.8 ms, far below our budget of 160ms!
The remaining floating point computations in the decoder are addition and subtraction so the values could be represented with fix point values to eliminate the floating point operations.
Some values which are computed from the configuration (degree, index, socket) are constants and could be generated at compile time using a utility called by cmake. However this might actually slow down the operation as the index computations might become slower.
The index and socket elements of C and V nodes could be pointers instead of indexes into arrays.
Experiments would be required to ensure these changes actually speed up the decoder.
Don got his first amateur license in high school but was soon distracted with getting an engineering degree (BSEE, Univ. of Washington), then family and life. He started his IC design career with the CPU for the HP-41C calculator. Then came ICs for printers and cameras, work on IC design tools, and some firmware for embedded systems. Exposure to ARES public service lead to a new amateur license – W7DMR and active involvement with ARES. He recently retired after 42 years and wanted to find an open project that combined radio, embedded systems and DSP.
Don lives in Corvallis, Oregon, USA a small city with the state technical university and several high tech companies.
Open Source Projects and Volunteers
Hi it’s David back again ….
Open source projects like FreeDV and Codec 2 rely on volunteers to make them happen. The typical pattern is people get excited, start some work, then drift away after a few weeks. Gold is the volunteer that consistently works week in, week out until their particular project is done. The number of hours/week doesn’t matter – it’s the consistency that is really helpful to the projects. I have a few contributors/testers/users of FreeDV in this category and I appreciate you all deeply – thank you.
If you would like to help out, please contact me. You’ll learn a lot and get to work towards an open source future for HF radio.
LDPC using Octave and the CML library. Our LDPC decoder comes from Coded Modulation Library (CML), which was originally used to support Matlab/Octave simulations.
Horus 37 – High Speed SSTV Images. The CML LDPC decoder was converted to a regular C library, and used for sending images from High Altitude Balloons.
Steve Ports an OFDM modem from Octave to C. Steve is another volunteer who put in a fine effort on the C coding of the OFDM modem. He recently modified the modem to handle high bit rates for voice and HF data applications.