ARM NEON Optimisation

I’ve been trying to optimise NEON DSP code on a Raspberry Pi. Using the intrinsics I managed to get a speed increase of about 3 times over vanilla C with just a few hours work. However the results are still significantly slower than the theoretical speed of the machine, which is 4 multiply-acculumates (8 float operations) per cycle. On a 1.2 GHz core that’s 9.6 GFLOPs.

Since then I’ve been looking at ARM manuals, Googling, and trying various ad-hoc ideas. There is a lack of working, fully optimised code examples, and I can’t find any data on cycle times and latency information for the Cortex A53 device used for the Rpi. The number of ARM devices and families is bewildering, and trying to find information in a series of thousand-page manuals daunting.

Fortunately the same NEON assembler seems to work (i.e. it assembles cleanly and you get the right results) on many ARM machines. It’s just unclear how fast it runs and why.

To get a handle on the problem I wrote a series of simple floating point dot product programs, and attempted to optimise them. Each program runs through a total of 1E9 dot product points, using an inner and outer loop. I made the inner loop pretty small (1000 floats) to try to avoid cache miss issues. Here are the results, using cycle counts measured with “perf”:

Program Test Theory cycles/loop Measured cycles/loop GFLOPS
dot1 Dot product no memory reads 1 4 1.2*8/4 = 1.2
dot2 Dot product no memory reads unrolled 1 1 1.2*8/1 = 9.6
dot3 Dot product with memory reads 3 9.6 1.2*8/9.6 = 1
dot4 Dot product with memory reads assembler 3 6.1 1.2*8/6.1 = 1.6
dotne10 Dot product with memory reads Ne10 3 11 1.2*8/11 = 0.87

Cycles/loop is how many cycles are executed for one iteration of the inner loop. The last column assumes a 1.2 GHz clock, and 8 floating point ops for every NEON vector multiply-accumulate (vmul.f32) instruction (a multiply, an add, 4 floats per vector processed in parallel).

The only real success I had was dot2, but that’s an unrealistic example as it doesn’t read memory in the inner loop. I guessed that the latencies in the NEON pipeline meant an unrolled loop would work better.

Assuming (as I can’t find any data on instruction timing) two cycles for the memory reads, and one for the multiply-accumulate, I was hoping at 3 cycles for dot3 and dot4. Maybe even better if there is some dual issue magic going on. Best I can do is 6 cycles.

I’d rather have enough information to “engineer” the system than have to rely on guesses. I’ve worked on many similar DSP optimisation projects in the past which have had data sheets and worked examples as a starting point.

Here is the neon-dot source code on GitLab. If you can make the code run faster – please send me a patch! The output looks something like:

$ make test
sum: 4e+09 FLOPS: 8e+09
sum: 4e+09 FLOPS: 8e+09
sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09
sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09
FLOPS: 4e+09
grep cycles dot_log.txt
     4,002,420,630      cycles:u    
     1,000,606,020      cycles:u    
     9,150,727,368      cycles:u
     6,361,410,330      cycles:u
    11,047,080,010      cycles:u

The dotne10 program requires the Ne10 library. There’s a bit of floating point round off in some of the program outputs (adding 1.0 to a big number), that’s not really a bug.

Some resources I did find useful:

  1. tterribe NEON tutorial. I’m not sure if the A53 has the same cycle timings as the Cortex-A discussed in this document.
  2. ARM docs, I looked at D0487 ARMv8 Arch Ref Manual, DDI500 Cortex A53 TRM, DDI502 Cortex A53 FPU TRM, which both reference the DEN013 ARM Cortex-A Series Programmer’s Guide. Couldn’t find any instruction cycle timing in any of them, but section 20.2 of DEN013 had some general tips.
  3. Linux perf was useful for cycle counts, and in record/report mode may help visualise pipeline stalls (but I’m unclear if that’s what I’m seeing due to my limited understanding).

5 thoughts on “ARM NEON Optimisation”

    1. Thanks Diego. However I’m not clear if the A53 (ARMv8) in Aarch32 mode has the same cycle timing as the Cortex A8, which is a somewhat older family. The paper hints that reference (40) may be a helpful online tool, but the link is dead.

      1. That’s a good question.

        I have to admit I didn’t read the paper, as I was sure I saw more up to date work on various neon optimised algorithms maybe around 2016/17, but just can’t seem to find the refs. There was a time I was incredibly interested in crypto, but now my life somehow doesn’t let me have the peace of mind digging into it, you know paid work and such weird concepts…

        Should I come across something more relevant, I’ll post here.

        1. I was digging bit deeper and it seems the Cortex A8 is the reference in crypto and I remembered it wrongly.

          Sorry, but I hope you will still make your way through the poorly documented stretches of the ARM landscape.

  1. there are lies, damn lies, and spec-sheets.

    The memory interface on the rpi isn’t wide enough. The neon unit has all sorts of stalls in it. you don’t have enough regular registers in the general case in 32 bit mode. You would be better off trying to get on a more recent arm and trying again, and/or quitting at your current 3x level.

Comments are closed.