How to make your Blackfin fly Part 2

This article describes how to write fast DSP code for your Blackfin.

For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).

We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.

The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.

We obtained gains from:

1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.

2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.

3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.

4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.

Why The Blackfin is Different

Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.

The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.

The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.

Some examples

To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.

The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.

A Simple Library for Profiling

To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:

for(i=0; i<10; i );


around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
root:/var/tmp> ./test_samcycles
start, 0
end, 503
TOTAL, 503

Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
root:/var/tmp> ./testenc male.wav male.out
start nb_encode, 0
move, 1352
autoc, 16149
lpc, 3180
lpc_to_lsp, 21739
whole frame analysis, 17797

Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
all of the CPU.

Care and Feeding of your Blackfin Cache

One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
"P0 = %2;\n\t"
"R0 = 0;\n\t"
"%0 = CYCLES;\n\t"
"LOOP dot%= LC0 = %3;\n\t"
"LOOP_BEGIN dot%=;\n\t"
"W[P0 ] = R0;\n\t"
"LOOP_END dot%=;\n\t"

This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.

However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
686 542 597 542 542 542 542 597 542 542

The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.

However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.

The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.

This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.

There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).

Another less exciting way is to enable “write back” cache (a kernel configuration option):
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
119 102 102 102 102 102 102 102 102 102

Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.

On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.

In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..

Internal Memory

The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.

The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.

In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
/* try alloc-ing and freeing some memory */

pa = (void*)l1_data_A_sram_alloc(0x4);
printf("pa = 0xx\n",(int)pa);
ret = l1_data_A_sram_free((unsigned int)pa);

which produces the output:
root:~> /var/tmp/test_l1_alloc
pa = 0xff803e30

i.e. a chunk of memory with an address in internal memory bank A.

To see the effect of internal versus cache/external memory:
Test 1: data in external memory
ret = 100: run time: 173 103 103 103 103 103 103 103 103 103
Test 2: data in internal memory
ret = 100: run time: 103 103 103 103 103 103 103 103 103 103

After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.

I should mention that to make this driver work I needed to add a an entry to my
uClinux-dist/vendors/AnalogDevices/BF537-STAMP/device_table.txt file:
/dev/l1alloc c 664 0 0 254 0 0 0 -

then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
root:~> ls /dev/l1alloc -l
crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc
root:~> cp var/tmp/l1_alloc_k.ko .
root:~> insmod l1_alloc_k.ko
Using l1_alloc_k.ko
root:~> /var/tmp/test_l1_alloc

Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.

Speex Benchmarks

Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:

The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).

(1) Write through cache, ext memory: 564
(2) Write through cache, int memory: 455
(3) Write back cache , ext memory: 465
(4) Write back cache , int memory: 438

So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.

Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.

To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
svn co

Or you can download a current snapshot from here.

movies sapphic contentfree adult trailers moviemovies free voyeursapphic petite moviesdancing dirty moviefree clips movie ebony pornmovie pornosex movie hidden disneyfucking granny moviesmovie the blowbarrington alfreds of island rhodetheme g ali ringtonesringtones 3ringtones 311ringtones 99missed call ringtone 1mobile ringtones absolutely freeac ringtone dc Mapviagra pills 100mg priceviagra acapulcoviagra achetertramadol $79 180200 cod overnight tramadolper hcl tramadol acetaminophentramadol all aboutversus ambien xanax Mapringtone acdc midi12 ringtone adamringtones australia fair advancesecond ringtone 203310 ringtone composebalut abs cbn ringtone64 poliphony ringtones1000 words ringtone Mapporn 69 position69ing porngrade 6th porn7 penis porn inchporno 70mivies 70s pornporn pictures 70sporn s fan 80 Map

4 thoughts on “How to make your Blackfin fly Part 2”

  1. Okay, wow – thanks for the very useful info…

    One question though: You write, “There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).”

    I am very curious about this comment… would you care to elaborate? I’m in the process of choosing a DSP for an application right now; I’m thinking of either an ADSP-BF537, or a TMS320VC5509A. Is there something about the difference between the architectures, for the case you discuss, that makes the TI part more straightforward to get to go faster, in this case?

    I’m looking at Linux to avoid buying the $4000 worth of tools to get going with a Blackfin (or a TMS*, for that matter), but I’m wondering if this will just get me a lot of headaches… I wonder if it’s just a much larger amount of work to get efficient object code out of gcc, than from Analog / TI’s native tools…

    Any thoughts?



  2. Hi Dave,

    It really depends on the application, and perhaps your preferences. Last time I checked the TI tools were Windows/GUI based, and had expensive single user licenses. I think the BF537 is much faster than a C55x class chip, but havent checked the benchmarks recently.

    However developing blackfin asm in gcc macros is tough – a nice single stepping GUI type debugger like code composer (under Linux) would be very helpful. Actually I really liked the DOS-based TI C5x tools 10 years ago for the C50 better than todays gdb!

    Once you understand the architectures (as this post series describes) its not hard to get the performance you require.

    I would guess that time wise you can get going on a Blackfin faster – buy a $226 STAMP and download the free tools and you can be writing optimised assembler that day.

    Of course is you need an operating system, the Blackfin wins hands down, as you have uClinux.

    I have developed on both families. I don’t like windows based development (of anything), so having gcc and command line tools (even with poor debugger) makes me prefer the Blackfin.

    – David

Comments are closed.