I’ve been trying to optimise NEON DSP code on a Raspberry Pi. Using the intrinsics I managed to get a speed increase of about 3 times over vanilla C with just a few hours work. However the results are still significantly slower than the theoretical speed of the machine, which is 4 multiply-acculumates (8 float operations) per cycle. On a 1.2 GHz core that’s 9.6 GFLOPs.
Since then I’ve been looking at ARM manuals, Googling, and trying various ad-hoc ideas. There is a lack of working, fully optimised code examples, and I can’t find any data on cycle times and latency information for the Cortex A53 device used for the Rpi. The number of ARM devices and families is bewildering, and trying to find information in a series of thousand-page manuals daunting.
Fortunately the same NEON assembler seems to work (i.e. it assembles cleanly and you get the right results) on many ARM machines. It’s just unclear how fast it runs and why.
To get a handle on the problem I wrote a series of simple floating point dot product programs, and attempted to optimise them. Each program runs through a total of 1E9 dot product points, using an inner and outer loop. I made the inner loop pretty small (1000 floats) to try to avoid cache miss issues. Here are the results, using cycle counts measured with “perf”:
|Program||Test||Theory cycles/loop||Measured cycles/loop||GFLOPS|
|dot1||Dot product no memory reads||1||4||1.2*8/4 = 1.2|
|dot2||Dot product no memory reads unrolled||1||1||1.2*8/1 = 9.6|
|dot3||Dot product with memory reads||3||9.6||1.2*8/9.6 = 1|
|dot4||Dot product with memory reads assembler||3||6.1||1.2*8/6.1 = 1.6|
|dotne10||Dot product with memory reads Ne10||3||11||1.2*8/11 = 0.87|
Cycles/loop is how many cycles are executed for one iteration of the inner loop. The last column assumes a 1.2 GHz clock, and 8 floating point ops for every NEON vector multiply-accumulate (vmul.f32) instruction (a multiply, an add, 4 floats per vector processed in parallel).
The only real success I had was dot2, but that’s an unrealistic example as it doesn’t read memory in the inner loop. I guessed that the latencies in the NEON pipeline meant an unrolled loop would work better.
Assuming (as I can’t find any data on instruction timing) two cycles for the memory reads, and one for the multiply-accumulate, I was hoping at 3 cycles for dot3 and dot4. Maybe even better if there is some dual issue magic going on. Best I can do is 6 cycles.
I’d rather have enough information to “engineer” the system than have to rely on guesses. I’ve worked on many similar DSP optimisation projects in the past which have had data sheets and worked examples as a starting point.
Here is the neon-dot source code on GitLab. If you can make the code run faster – please send me a patch! The output looks something like:
$ make test sum: 4e+09 FLOPS: 8e+09 sum: 4e+09 FLOPS: 8e+09 sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09 sum: 4.03116e+09 target cycles: 1e+09 FLOPS: 8e+09 FLOPS: 4e+09 grep cycles dot_log.txt 4,002,420,630 cycles:u 1,000,606,020 cycles:u 9,150,727,368 cycles:u 6,361,410,330 cycles:u 11,047,080,010 cycles:u
The dotne10 program requires the Ne10 library. There’s a bit of floating point round off in some of the program outputs (adding 1.0 to a big number), that’s not really a bug.
Some resources I did find useful:
- tterribe NEON tutorial. I’m not sure if the A53 has the same cycle timings as the Cortex-A discussed in this document.
- ARM docs, I looked at D0487 ARMv8 Arch Ref Manual, DDI500 Cortex A53 TRM, DDI502 Cortex A53 FPU TRM, which both reference the DEN013 ARM Cortex-A Series Programmer’s Guide. Couldn’t find any instruction cycle timing in any of them, but section 20.2 of DEN013 had some general tips.
- Linux perf was useful for cycle counts, and in record/report mode may help visualise pipeline stalls (but I’m unclear if that’s what I’m seeing due to my limited understanding).