How to make your Blackfin fly Part 1

The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

Boring Mathematical Bit

As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

The sample code just finds the dot product of two vectors:

int dot(short *x, short *y, int len)
{
   int i,dot;
 
   dot = 0;
   for(i=0; i<len; i  )
     dot  = x[i] * y[i];
 
   return dot;
}

It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

Down to Business

Enough talk, here is a run of the sample code from my BF537 STAMP:

root:/var/tmp> ./cycles
Theoretical best case is N/2 = 50 cycles
Test 1: Vanilla C
  ret = 100: run time:
  3838 3507 3373 3408 3373 3373 3373 3373 3373 3373
Test 2: data in external memory, outboard cycles function
  ret = 100: run time:
  442 240 239 218 218 218 218 218 218 218
Test 3: data in external memory, inboard cycles
  ret = 100: run time:
242 103 103 103 103 103 103 103 103 103
Test 4: data in internal memory, inboard cycles
  ret = 100: run time:
  214 53 53 53 53 53 53 53 53 53

A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:

int cycles() {
  int ret;
 
   __asm__ __volatile__
   (
   "%0 = CYCLES;\n\t"
   : "=&d" (ret)
   :
   : "R1"
   );
 
   return ret;
}

Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

Allocating Internal Memory

At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:

/* I know, I know - this is very naughty :-) */
short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */
short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */

I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.

Summary

Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.

Acknowledgements

Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

5 comments to How to make your Blackfin fly Part 1

  • [...] The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here. [...]

  • Brent Adams

    David Rowe

    Hi:

    I ran cycles.c from your “How to make your Blackfin Fly Part 1″ article on my BF537 STAMP ver 1.3 board and got the following results shown below. I am wondering why the number of cycles are so much large than yours. I was able to duplicate your results of 5.4 cylcles per operation as shown in “How to make your Blackfin fly Part 2″ blog. Is this just a write through / write back issue?

    Thanks

    Brent Adams

    Results from cycles.c per “How to Make My Blackfin Fly Part 1″

    Theoretical best case is N/2 = 50 cycles
    Test 1: Vanilla C
    ret = 100: run time:
    45875 45211 45206 45246 45271 45276 45246 45246 45271 45211
    Test 2: data in external memory, outboard cycles function
    ret = 100: run time:
    5970 4983 4978 4933 4978 4978 4933 5003 4998 4933
    Test 3: data in external memory, inboard cycles
    ret = 100: run time:
    4285 4250 4250 4300 4250 4250 4300 4250 4250 4290
    Test 4: data in internal memory, inboard cycles
    ret = 100: run time:
    53 53 53 53 53 53 53 53 53 53

  • David Rowe

    Hi Brett,

    I just ran the program again on both a BF537 and a BF533 (both have write back cache enabled):

    root:/var/tmp> ./cycles
    Theoretical best case is N/2 = 50 cycles
    Test 1: data in external memory, outboard cycles function
    ret = 100: run time: 359 175 174 174 174 174 174 174 174 174
    Test 2: data in external memory, inboard cycles
    ret = 100: run time: 138 103 103 103 103 103 103 103 103 103
    Test 3: data in internal memory, inboard cycles
    ret = 98: run time: 89 53 53 53 53 53 53 53 53 53
    root:/var/tmp>

    I can’t think why your numbers were so high? Are you using a stock kernel?

    - David

  • Pac Austin

    hi, i`m sry that my question is out of issue but i`m kinda newbi to this … i want to do some dsp myself based on a webcamera , but i can`t find any examples of codes anywhere … if you can help pls e-mail me Thx

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>