This article describes how to write fast DSP code for your Blackfin.
For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).
We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.
The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.
We obtained gains from:
1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.
2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.
3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.
4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.
Why The Blackfin is Different
Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.
The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.
The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.
To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.
The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.
A Simple Library for Profiling
To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:
for(i=0; i<10; i );
around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
root:/var/tmp> ./testenc male.wav male.out
start nb_encode, 0
whole frame analysis, 17797
Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
all of the CPU.
Care and Feeding of your Blackfin Cache
One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
"P0 = %2;\n\t"
"R0 = 0;\n\t"
"%0 = CYCLES;\n\t"
"LOOP dot%= LC0 = %3;\n\t"
"W[P0 ] = R0;\n\t"
This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.
However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
Test 1: Write 100 16-bit shorts
686 542 597 542 542 542 542 597 542 542
The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.
However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.
The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.
This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.
There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).
Another less exciting way is to enable “write back” cache (a kernel configuration option):
Test 1: Write 100 16-bit shorts
119 102 102 102 102 102 102 102 102 102
Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.
On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.
In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..
The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.
The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.
In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
/* try alloc-ing and freeing some memory */
pa = (void*)l1_data_A_sram_alloc(0x4);
printf("pa = 0xx\n",(int)pa);
ret = l1_data_A_sram_free((unsigned int)pa);
which produces the output:
pa = 0xff803e30
i.e. a chunk of memory with an address in internal memory bank A.
To see the effect of internal versus cache/external memory:
Test 1: data in external memory
ret = 100: run time: 173 103 103 103 103 103 103 103 103 103
Test 2: data in internal memory
ret = 100: run time: 103 103 103 103 103 103 103 103 103 103
After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.
I should mention that to make this driver work I needed to add a an entry to my
/dev/l1alloc c 664 0 0 254 0 0 0 -
then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
root:~> ls /dev/l1alloc -l
crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc
root:~> cp var/tmp/l1_alloc_k.ko .
root:~> insmod l1_alloc_k.ko
Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.
Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:
The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).
(1) Write through cache, ext memory: 564
(2) Write through cache, int memory: 455
(3) Write back cache , ext memory: 465
(4) Write back cache , int memory: 438
So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.
Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.
To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
svn co http://svn.xiph.org/trunk/speex
Or you can download a current snapshot from here.