## A Bit Exact DSP Fairy Tale

Once upon a time in a far off land there was an underpaid pale skinny guy working in an impoverished start up who wanted to build products with fixed point DSP chips in them. The problem was he had just about killed himself doing the same thing many times in the past.

In particular, he seemed to spend a zillion hours debugging intricate fixed point DSP assembler to get products working. All while some project manager who hadn’t written a line of code for 10 years was hovering behind him asking questions like “is it ready” and mumbling something about milestone dates that were ridiculous when proposed, and soon became hilarious after the project started.

So, our hero wondered, “is there a better way”.

How to Build a DSP Based Product

Many DSP products are developed like this:

The academics “brains trust” guys work out a clever new algorithm. Now these guys are way-smart, but tend to express themselves in algebra with lots of Greek letters. They usually have PhDs, and cost a lot, and there aren’t many of them. They might test their brain-child in a simulation algorithm like Matlab:

x = [1.1 2.0 3.0];
y = [2.5 3.0 4.0];
dot = x * y’;

Now this lets them test the algorithm quickly (dot = 20.75), but you can’t run code like this on a real world product due to (a) it runs 10 million times slower than real time and (b) Matlab licences cost \$5,000 each and only run on \$3000 PCs. Which is a shame when u are building a \$99 product.

So some poor engineer then has to make the DSP algorithm work in real time (i.e. fast) on the “target” hardware, a \$10 DSP chip. Actually, in general, these guys actually like the detail work this involves. The first step they might do is convert to floating point C:
```float x = {1.1,2.0,3.0}; float y = {2.5,3.0,4.0};```

``` float dot(float x[], float y[], int n) { float d = 0.0; int i; for(i=0; i<n; i ) d = x[i]*y[i]; ```

``` return d; }```

But the guys & gals in Marketing (they are a breed closely related to Managers, but tend to be more attractive and use airports more) have decided that this must be a low cost product, so a \$10 fixed point DSP is chosen.

Now a fixed point DSP is a chip that can perform multiplies and adds quickly, as long as you only use integers. They don’t do floats. So the engineer converts the algorithm to fixed point:
```short x = {11,20,30}; /* x*10 */ short y = {25,30,40}; /* x*10 */```

``` int dot(short x[], short y[], int n) { int d = 0; int i; for(i=0; i<n; i ) d = x[i]*y[i]; ```

``` return d; /* returns dot*100 */ }```
Now look what has happened to the x[] and y[] arrays. We have multiplied each number by 10 so we can use an integer to hold the value. If we didn’t scale each number then we would lose information (for example converting the 1.1 to 1) which on certain algorithms would cause problems.

However because each input value is multiplied by 10, that means the result is multiplied by 100 ( 10x * 10y = 100xy ). So we need to take this output scaling into account in any other stages of the algorithm. One more piece of information we need carry around with us as we code. One more source of bugs.

This is a very simple scaling example. In practice it gets much scarier and means you need to hold a lot more information in your head when you program in fixed point (see complex sample below).

There is also the potential of overflow problems. As the output is scaled up by 100, it is easy to overflow the 32 bit range of the integer output, for example if we used x[] and y[] arrays with larger numbers.

I Feel the Need for Speed

The product requires that this routine must run as fast as possible on the DSP chips. So we need to code an optimised version of this routine in assembler:
```int dot(short *x, short *y, int len) { int d;```

``` __asm__ ( "I0 = %3;\n\t" "I1 = %4;\n\t" "A1 = A0 = 0;\n\t" "R0 = [I0 ] || R1 = [I1 ];\n\t" "LOOP dot%= LC0 = %5 >> 1;\n\t" "LOOP_BEGIN dot%=;\n\t" "A1 = R0.H * R1.H, A0 = R0.L * R1.L || R0 = [I0 ] || R1 = [I1 ];\n\t" "LOOP_END dot%=;\n\t" "R0 = (A0 = A1);\n\t" "%0 = R0 >> 1;\n\t" : "=&d" (d), "=&d" (before), "=&d" (after) : "a" (x), "a" (y), "a" (len) : "I0", "I1", "A1", "A0", "R0", "R1" ); ```

``` return d; }```

This is assembler for the Blackfin DSP, the actual code will of course vary for different DSP chips. The important bit is this:
``` A1 = R0.H * R1.H, A0 = R0.L * R1.L || R0 = [I0 ] || R1 = [I1 ]```

In a single clock cycle, our intrepid DSP performs two parallel 16-bit multiplies, two 16-bit adds, and fetches 2 32 bit words from memory, post incrementing the pointers after each fetch. It does all this 500 million times a second, and sometimes it even gives you the right answer.

But not often. First you need to spend a lot of time getting it all to work right. The assembler is much harder to grok than the previous versions of the algorithm, and you use up a lot of brain power on a myriad of fine details. Get any one of the details wrong and the product is a dud.

The target development environment is also tough. In many cases the feedback you use to debug might be an analogue signal flowing into or out of the product (that is processed internally by the DSP). Or a blinking LED. Sometimes you don’t even know when its really working, bugs can simply slip through. You can’t always stop the code and single step using a debugger as parts of the algorithm depend on real-time activity.

It’s a development/testing/maintenance nightmare.

Bit Exact Assembler

There is a way to ease the pain a little. Lets say there is a bug in the assembler version. The problem is that to debug the algorithm you may need to understand both the assembler and the DSP algorithm, I mean to debug something you need to know how it works, right?

This is OK for a simple algorithm, but what if it is even moderately complex? Then you are dealing with (i) a very complex algorithm and (ii) very complex assembler all at the same time. This is usually too much for mere mortals (and even minor deities). The result is long, painful development, failed projects, late nights, angry spouses, and lots of pizza (well its not all bad I guess).

The trick is to divide and conquer. Make sure we are only hitting one tough problem at any given time.

One approach I have found very useful is bit exact porting to assembler. There are two important steps:

1. At the fixed point C stage, you test very very carefully. Run batteries of tests and simulations. These can be performed on a regular PC in non real time, using powerful debuggers. The idea is to verify that (i) the algorithm is OK and (ii) the fixed point port works OK. In particular (ii) is very tough, so its nice to handle this in a relatively benign environment like a PC or workstation.

2. Port each function to real time assembler one by one. Test each function against the fixed point C reference. Make sure the functions give IDENTICAL output – right down to the last bit. This takes a lot of discipline – near enough is NOT good enough.

The benefits of bit exact coding are:

1. When coding assembler, you don’t have to worry or even understand the algorithm. You just have to make sure that the C version matches the assembler version. When they do, you know that the overall system will perform as expected. This lightens the load on your brain, leading to less bugs.

2. As you integrate modules of code together, problems are easy to spot, just look for where the output doesn’t match the fixed point C simulation. This is much simpler that performing batteries of tests in a real time environment.

3. A less senior engineer can be used to perform the assembly coding, as they don’t need algorithm knowledge. This makes it easier to find people for the project, lowering costs. Adding people to the project is less of an issue, as they can get up to speed and be productive quickly.

4. Any bugs are exposed early (while unit testing), rather than late (after integration). You get an honest measure of progress, rather than nasty surprises late in the project.

Where it fails:

1. It takes a lot of discipline to perform (1) CAREFULLY. There is a lot of pressure from managers (and engineers) to jump onto the sexy real time/hardware integration stage. If you are part way through (1), and there is a milestone due, it is very tempting to “declare victory” and move onto the next stage, WAY before you are ready. The results are predictable…

2. It takes discipline to make the assembler bit exact. I mean, look, it’s nearly the same, just a few bits are different? The problem is these little errors may lead to much bigger flaws in performance later on. Try changing a single bit in your operating system binary image and see how well it runs. In some (most) cases bit-exactness matters!

3. It is honest. It gives you a black and white go/no-go indication at various stages of the project. For example the fixed point C stage is not complete until it has shown perfect performance against the spec. This reduces the “room to move” managers have, you can’t declare a phase “complete” just to suit a milestone date. It really has to work. This sort of honesty doesn’t sit well with some managers, for example they may have to admit problems to a customer early on in a project.

Case Study – Non Bit Exact

Now here is what typically happens. Our hero is under pressure from his manager to progress, so he short circuits the fixed point C simulation stage and codes straight into assembler. The C simulation and bit exact assembler have results that are “close” but not really the same. No effort is made to synchronise the results, there isn’t enough time!

Time is also short so he can’t “waste time” on unit testing, there is a milestone with a big fat cash payment attached.

The assembly code is loaded into flash on some brand-new, untested hardware, power is applied, and some basic tests performed.

Damn. It doesn’t work.

So our hero and his team start trying to debug the real time assembler code. However debugging in real time is very time consuming. The best brains in the company spend months tracking down bugs that end up being simple one line fixes. Bugs constantly interact with each other, leading the team off in false directions.

He doesn’t go back to the C simulation as it now gives different results to the assembler and “we don’t have enough time” to make them match.

As the project falls behind, there is no one who can be added, as only people who understand both the algorithms and the assembler can be used. The team is debugging leading edge algorithms in assembler.

A year passes. Senior management is fuming. They keep getting told “real soon now”. However no-one really knows how long it will be, because no one really knows just how many bugs there are. Frustration and stress is very high all around.

Eventually the company runs out of funds and the project is cancelled. There are mass departures of some of the best and most capable people from the company. Our hero leaves to join another start up where he repeats the same process.

Case Study – Bit Exact

Our hero and some members of the brains trust spend many months carefully porting the algorithm to fixed point and testing the entire product in a robust simulation. This apparent lack of progress is tough for the managers to handle, but they understand the benefits and bravely support the team.

Lots of complex bugs are thrown up, and the team works through them one by one. They don’t take too long to fix because these tough problems are at least occurring in a PC environment, where it easy to track them down.

In parallel, the fixed point C modules are passed to a team to perform the assembler coding. This team is made of up many junior engineers, who are instructed to make sure each module is bit exact with the reference C simulation. A few of them don’t like this, but the more senior members keep them on track. The junior members do a good job, as all they need to focus on is the assembler, not the algorithm details.

A few bugs slip through unit testing to integration. A senior member who is a veteran of previous “death march” DSP projects is assigned to fix the bug. She discovers that the the problem was due to an exceptional case not covered in the unit test and sure enough the assembler module wasn’t bit-exact under that case. She had previously been a critic of the extra effort required for bit exactness but the speed in which this bug was found and fixed impresses her.

Early in the project some unforeseen problems are discovered during the work on the fixed point C simulation. Some parts are running slower than expected which means more assembler may be required in the final product. This throws up an early indication that the project is at risk of falling behind.

This critical management information is flagged early, while there was time to do something about it. It is decided to add some extra engineers to the assembler side. As soon as they learn the DSP instruction set they quickly come up to speed. They don’t need to know or understand the algorithm details.

The schedule still slips by a few weeks, but all the stakeholders understand that this is a better outcome than delivering a lemon on time. No further slips occur, and the project manager earns a great deal of trust and respect by delivering “bad news early” rather than trying to hit arbitrary milestone dates and risking the entire project.

When the real-time DSP software is integrated with the hardware there are still a few bugs to find. However by this stage the real time software is at a very high quality level so the bug count is low. Integration actually proceeds on schedule, amazing everybody, especially veterans of past projects.

The product is a success, the start up has a great IPO, and the pale skinny guy retires at 28 years old.

Complex Example

Here is a rather more complex example of fixed point C code from the Speex project:
```static inline spx_word32_t cheb_poly_eva( spx_word16_t *coef, spx_word16_t x, int m, char *stack ) { int i; spx_word16_t b0, b1; spx_word32_t sum;```

``` /*Prevents overflows*/ if (x>16383) x = 16383; if (x<-16383) x = -16383; /* Initialise values */ b1=16384; b0=x; /* Evaluate Chebyshev series formulation using an iterative approach */ sum = ADD32(EXTEND32(coef[m]), EXTEND32(MULT16_16_P14(coef[m-1],x))); for(i=2;i<=m;i ) { spx_word16_t tmp=b0; b0 = SUB16(MULT16_16_Q13(x,b0), b1); b1 = tmp; sum = ADD32(sum, EXTEND32( MULT16_16_P14(coef[m-i],b0))); } ```

``` return sum; }```

In English (sort of), this function works the value of a polynomial at various points around a circle. Don’t ask me why. DSP engineers just like doing things like that. For more information/punishment see the file lsp.c from the Speex project.

Macros like ADD32() simulate the instruction set of the DSP chip in the simulation code, which makes porting this function actual DSP assembler a simpler step. However they also make the code much harder to read.

## Renumbering components using Perl

I am designing a telephony version of the BlackfinOne DSP/uClinux board. To develop the design I am using the gEDA CAD tools. This board runs uClinux on a Blackfin BF532 processor, and has been designed using the open source gEDA tools which makes it really easy for anyone to re-use the design. Thank you very much to Dimitar Penev and Ivan Danov who designed and published this board. They have done an excellent job.

The first stage in any electronic design is usually to enter the schematic. The schematic shows the components and their wiring in a form that is easy to follow:

I wanted to add two Ethernet ports to the BlackfinOne design, so I found a reference design and entered the schematic using the gschem program.

After I entered some sheets for the Ethernet chips, I had a little problem. In the schematic above you can see that every component has a letter & number next to it, for example, C1, C2 for capacitors, or U1, U2 etc for chips. This is a unique ID for each part. The letter generally indicates what sort of part is is (capacitor, resistor, chip etc). These IDs are know as reference designators or as a “refdes”.

Now I wanted the refdes on my new Ethernet sheets to merge nicely with those on the existing BlackfinOne design. For example if the last capacitors refdes on the BlackfinOne design stopped at C51, then my C’s should start at C52.

There is a program to do this (refdes_renum described here in the gEDA tutorial) but unfortunately it renumbers all of the refdes in the design, and I only wanted to renumber the new parts that I had added. I didn’t want to mess up anything in the existing design. Please also see the Links section below for a new alternative – grenum.

Now of course it is possible to number the components manually in gschem via the GUI, but if you know a little Perl then its much more fun to hack out a solution! Now fortunately gschem uses text files to store the schematic. The refdes are stored in the schematic file like this:
```--snip-- C 7000 87800 1 0 0 bf-93LC46.sym { T 8200 89100 5 10 1 1 0 0 1 refdes=U19 } --snip--```

I think this means that the component (leading C) that is drawn using the bf-93LC46.sym symbol file has a refdes called U19. The line starting with T tells gschem where to place the refdes text on the sheet. Anyway all I am really interested in is reading and changing the refdes lines.

So the Perl script num_refdes.pl first scans all of the existing schematic files for lines that have the text /refdes=/. The number of the refdes is then extracted, and stored in a hash if it is the largest refdes so far. On the new schematics, I initially enter each refdes as C?, U? etc. The script scans these, and wherever it finds a ? it replaces it with the next refdes in that series.

Improvements and Reflections

There is one obvious improvement. The current script isn’t smart enough to use a refdes that hasn’t been used yet, for example if C1,C2,C4 are currently used it isn’t smart enough to use C3, it will use C5 instead.

This script took me about 70 minutes to write and it worked perfectly the first time. To be honest it didn’t really save me any time (I could have manually entered the refdes in less time) but it was lots more fun to do it this way. And I guess it will be useful the next time I have this problem. Perhaps now that I have posted on it some one else might also find it useful. Come to think of it writing this blog post took longer than writing the script! DOH!

BTW every time I work with Perl (which is maybe once every two months) I spend the first 20 minutes looking up web sites to work out which regexp to use, how to iterate through a hash, etc. If I don’t work on Perl and regexps constantly it all looks like line noise to me after a few weeks:
`if (\$line =~ /refdes=(\w)(\?)/) {`

But hey, its no big deal to refresh my memory every few months.

Solving little problems like this with Perl is fun. You almost hope there isn’t a solution out there already, just so you have an excuse to write one!

I have also written a few Perl scripts for PCB (the PCB design application in the gEDA suite) here.

After I posted the above, Levente Kovacs kindly pointed out his grenum utility which is now included with gEDA! This has been recently added to gEDA and performs the same job as the Perl script above, but can also spot gaps in refdes sequences. As grenum is now part of the official gEDA distribution I would recommend using it rather than num_refdes.pl. I share a trait common among engineers – a strong desire to “do it myself” rather than checking carefully for existing solutions! Thanks Levente!
14 payday 20 loan directory2500 cash quick loansquick 26 loan 37 paydaycredit bad loan 2b personal45 loan payday 31 advance uksites 37 payday loan 53loan cash payday advance 5 5006 loan online kentucky payday 8loan 6 cost 20,8 payday lowlink 60 payday loanmovies hentaimovie lesbiansex moviesmovies adultfree hentai moviesporn moviesmovie clips porn freemovies gay free Mapmp3 achha ji1990 mp3 temptationsnation acumen mp3drame mp3 adamatecktonik mp3 tepr acdgmp3 ackermann sucheachhe mp3 lagtemp3 blog 1990s Mapfree naked moviesdildo movies hugehardcore clips moviemovies dans pornblow free movies job bestwomen movies nakedgirls free teen moviesdouble penetration free movies Mapporn cliphunterporn clipmasterclips free pornporn anal of clipsof girl porn clipsporn clipsporno clips gratuitclipsporn Map

## Using Make to build hardware

I have been working on some telephony hardware using the gEDA open source CAD software. Now designing Printed Circuit Boards (PCBs) is fun, but very GUI-centric. Click-Click-Click with the mouse all day. After a few frustrating experiences (OK I messed up the design a few times) I thought “there must be a better way” and “why can’t I use the command line and Make for this?”

Lucky for me the gEDA software uses text files to store the PCB layout so as an experiment I wrote some Perl scripts to handle some boring repetitive tasks. The bottom line is I am using Make to assemble a Printed Circuit Board design from smaller “modules” of PCB – just like software. I have written about it (with lots of fancy screen shots) here.movie enemafacesit moviessex women movies fatmovie favorite quotesfemale movies strippershitting females moviesfighting movie website the temptationfinal xi fantasy moviemovies free wmv adultfree sex anal movie downloadsringtone sanyo 5400 sprint freesamsung ringtone a680 midiamc 30 barrington moviesnokia easy 3585i text ringtone freeaudiovox ringtone 8910 cricketclowns 1000 ringtone nextelfree a950 ringtone samsung scha660 samsung sprint free ringtone Mapto credit allowing friends use2bcredit accept 2bcardsunion credit affinity federal 07481accreditation masters divinity aabc inmaine federal union credit acadia madawaskaaccreditation ahrpcard processingcom gambling merchant credit accountbureas 3 credit Map

## How to make your Blackfin fly Part 2

For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).

We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.

The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.

We obtained gains from:

1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.

2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.

3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.

4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.

Why The Blackfin is Different

Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.

The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.

The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.

Some examples

To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.

The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.

A Simple Library for Profiling

To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:
` SAMCYCLES("start");`

``` for(i=0; i<10; i ); ```

` SAMCYCLES("end");`

around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
```root:/var/tmp> ./test_samcycles start, 0 end, 503 TOTAL, 503```

Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
```root:/var/tmp> ./testenc male.wav male.out start nb_encode, 0 move, 1352 autoc, 16149 lpc, 3180 lpc_to_lsp, 21739 whole frame analysis, 17797 ```

Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
all of the CPU.

Care and Feeding of your Blackfin Cache

One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
``` "P0 = %2;\n\t" "R0 = 0;\n\t" "%0 = CYCLES;\n\t" "LOOP dot%= LC0 = %3;\n\t" "LOOP_BEGIN dot%=;\n\t" "W[P0 ] = R0;\n\t" "LOOP_END dot%=;\n\t"```

This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.

However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
```root:/var/tmp> ./writethru Test 1: Write 100 16-bit shorts 686 542 597 542 542 542 542 597 542 542```

The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.

However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.

The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.

This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.

There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).

Another less exciting way is to enable “write back” cache (a kernel configuration option):
```root:/var/tmp> ./writethru Test 1: Write 100 16-bit shorts 119 102 102 102 102 102 102 102 102 102```

Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.

On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.

In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..

Internal Memory

The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.

The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.

In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
` /* try alloc-ing and freeing some memory */`

``` ```

``` pa = (void*)l1_data_A_sram_alloc(0x4); printf("pa = 0xx\n",(int)pa); ret = l1_data_A_sram_free((unsigned int)pa);```

which produces the output:
```root:~> /var/tmp/test_l1_alloc pa = 0xff803e30```

i.e. a chunk of memory with an address in internal memory bank A.

To see the effect of internal versus cache/external memory:
```Test 1: data in external memory ret = 100: run time: 173 103 103 103 103 103 103 103 103 103 Test 2: data in internal memory ret = 100: run time: 103 103 103 103 103 103 103 103 103 103```

After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.

I should mention that to make this driver work I needed to add a an entry to my
uClinux-dist/vendors/AnalogDevices/BF537-STAMP/device_table.txt file:
`/dev/l1alloc c 664 0 0 254 0 0 0 -`

then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
```root:~> ls /dev/l1alloc -l crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc root:~> cp var/tmp/l1_alloc_k.ko . root:~> insmod l1_alloc_k.ko Using l1_alloc_k.ko root:~> /var/tmp/test_l1_alloc```

Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.

Speex Benchmarks

Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:

The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).

(1) Write through cache, ext memory: 564
(2) Write through cache, int memory: 455
(3) Write back cache , ext memory: 465
(4) Write back cache , int memory: 438

So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.

Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.

To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
`svn co http://svn.xiph.org/trunk/speex`

## GSM Port for the Blackfin

For my uCasterisk project I needed a couple of optimised codecs for the Blackfin. This post discusses the steps taken to port GSM to the Blackfin.

Usage

1/ To make:
` make`

2/ To test:

to your Blackfin hardware and type:
``` root:/var/tmp> ./tgsm male.wav male.out TOTAL, 0 SNR = 10.4591 dB enc 114 dec 39 k cycles/frame root:/var/tmp>```
When it runs it prints out the number of cycles it took to execute each 20ms encode and decode frame.

You can then upload the output file (male.out) to your host and listen to it. On my Linux box I use “play male.sw”, the sw lets “play” recognise it as a 16-bit signed-word file.

Optimisation

I spent a day or so optimising the code, for example:

a) I wrote Blackfin versions of the macros in gsm/inc/private.h

b) Applied the profiling macros SAMCYCLES and worked out which parts of the code needed the most optimisation.

c) I looked at the assembler output of various functions (gcc -S or -save-temps options) and modified the C code for better output, such as using the hardware loop supported by gcc 4.1. A lot of the original GSM code was written for older x86 compilers, and lots of compiler-specific mods were evident. In many cases to speed up code I just went back to vanilla C and the Blackfin compiler did a better job!

e) By inspecting the assembler I found some important routines were making function calls inside their inner loops which is very inefficient. These were modified to remove the function calls.

f) Use some assembler in the tightest, most cycle-hungry loops.

Performance

Using gcc 4.1 and testing on a Blackfin STAMP BF533 board:
```encode: 114,000 cycles/fr: (114,000/0.02s) = 5.7 MIPs decode: 39,000 cycles/fr: (39,000/0.02s) = 1.95 MIPs```

The initial number of cycles per encode was 274,000, decode 82,000.

Further Work

My gut feel is it might be possible to reduce the total (encode plus decode) cycles by perhaps another 30% with further optimisation.

a) The analysis and synthesis filter functions consume about 50,000 cycles per encode/decode cycle, they could be converted to assembler.

b) The RPE algorithm (rpe.c) could be optimised.

c) Blackfin internal memory might speed some operations, such as autocorrelation.

How To Profile

I have written a set of macros (samcycles.h) to sample the Blackfin cycles counter. Here is an example on how to use them:

a) Patch code.c:
` patch -p0 < code_profile.patch`

``` root:/var/tmp> ./tgsm male.wav male.out start Gsm_Coder, 0 Gsm_Preprocess, 5312 Gsm_LPC_Analysis, 11406 Gsm_Short_Term_Analysis_Filter, 23483 Gsm_Long_Term_Predictor, 11525 Gsm_RPE_Encoding, 8308 Gsm_Long_Term_Predictor, 10947 Gsm_RPE_Encoding, 5411 Gsm_Long_Term_Predictor, 10701 Gsm_RPE_Encoding, 5422 Gsm_Long_Term_Predictor, 10696 Gsm_RPE_Encoding, 5409 end Gsm_Coder, 521 TOTAL, 109141 SNR = 10.4591 dB enc 115 dec 39 k cycles/frame root:/var/tmp>```

c) To investigate further, just add more SAMCYCLES() macros. Its a good idea to remove or disable the macros when you are finished, as they use a few thousand cycles:
` patch -R -p0 < code_profile.patch`

Thanks

To Jean-Marc Valin and the Speex project, I used some of their assembler code (see COPYING.xiph for the copyright message related to this code).
loans \$100loan state alaskapercent 125 home equity loanloan home alabama mobilefast 10,000 loanequity accelerated loansfraud advance fee loana loan paper c b Mapporn review 1280×72013 girl porn14 lesbian pornporn year old 14vfere min 15 pornpreviews 15 porn minclips minute porn 1515 minutes porn Mapagreement laptop loanlaserpro origination loan systemva loans about learnlegal templates document loan personalrepayment loan educational for legislationlenders in loan bad wi makingof loans lenght used rvlibrary loan Map

## How to make your Blackfin fly Part 1

The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

Boring Mathematical Bit

As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

The sample code just finds the dot product of two vectors:
```int dot(short *x, short *y, int len) { int i,dot;```

``` dot = 0; for(i=0; i<len; i ) dot = x[i] * y[i]; ```

``` return dot; }```

It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

Enough talk, here is a run of the sample code from my BF537 STAMP:

```root:/var/tmp> ./cycles Theoretical best case is N/2 = 50 cycles Test 1: Vanilla C ret = 100: run time: 3838 3507 3373 3408 3373 3373 3373 3373 3373 3373 Test 2: data in external memory, outboard cycles function ret = 100: run time: 442 240 239 218 218 218 218 218 218 218 Test 3: data in external memory, inboard cycles ret = 100: run time: 242 103 103 103 103 103 103 103 103 103 Test 4: data in internal memory, inboard cycles ret = 100: run time: 214 53 53 53 53 53 53 53 53 53```

A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:
```int cycles() { int ret;```

``` __asm__ __volatile__ ( "%0 = CYCLES;\n\t" : "=&d" (ret) : : "R1" ); ```

``` return ret; }```

Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

Allocating Internal Memory

At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:
```/* I know, I know - this is very naughty :-) */ short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */ short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */```

I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.

Summary

Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.

Acknowledgements

Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

## Measuring Stack Usage in Multi-threaded uClinux Apps

In regular Linux the MMU allows the stack to grow dynamically, the MMU just allocates more physical pages. However in uClinux, the correct amount of stack for each thread must be allocated before the thread is created.

Too little stack and your program will corrupt the system in nasty, unpredictable ways. Thread stack gets malloced from the system heap, so an overflow means writes to an arbitrary address just outside the block of memory allocated to the thread. This memory could possibly be in use by other parts of the system, perhaps for a different application. These sorts of bugs can be very difficult to track down.

If you allocate too much stack, then you are wasting memory, a valuable resource on embedded systems. For example I discovered I was allocating far too much stack and wasting Mbytes of memory, especially when multiple threads were running.

The standard approach is to try random values of stack until you find one that works. However I thought it might be a better idea to actually measure the amount of stack used by each thread. Then I could tweak the stack allocation to optimise memory usage and even check for stack overflows at run time.

I have written a small library (called threadstack):
```unsigned int threadstack_free(pthread_t *thread); unsigned int threadstack_used(pthread_t *thread);```

Here is a sample run on my Blackfin BF537 STAMP, when a 100k stack was allocated to a thread:
```root:/var/tmp> ./test_threadstack stack used: 1300 stack free: 100620 root:/var/tmp>```

How it Works

The functions work by examining memory allocated to the stack. The theory is that if the memory is non-zero, then it must have been used by the thread at some time (the entire block of memory used for the stack is initially set to 0 before the thread starts). So the routines search the stack memory for the first non-zero value, and that is declared the “high water mark” – the point where the stack reached it’s maximum.
```0xff <- stack top 0xff 0xfe <- high water mark 0x00 0x00 0x00 <- stack bottom```

The high water mark will change over time, so after your thread has been running for a while is the best time to measure stack usage.

One weakness with this approach is that if stack allocation is way too low your program may bomb before these routines get a chance to run. However in that case you will at least know there is a problem, and can increase stack to some high number (e.g. Mbytes) to get the program running, before using these functions to determine actual stack requirements.

Usage

In my uClinux Asterisk port I have added code to check for stack overflow just before a thread ends:
```pthread_t thread = pthread_self(); assert(threadstack_free(&thread) > 10*1024);```

This code checks that while the thread was running, the minimum free stack was 10k. The assert will kill the program with an error message and tell me straight away I need more stack. Much nicer than getting an obscure bug in the system due to a stack overflow on a thread. Now the program finds stack overflow bugs for me!

This example above runs from within the actual thread itself, hence the call to pthread_self() to discover the threads handle. You can also call the functions from another thread (e.g. the main thread), for example to periodically meter stack usage.

## Porting multi-threaded apps to uClinux

I have recently been working on improving the stability of uCasterisk, a port of Asterisk to uClinux. This required some research into memory management for multi-threaded apps on uClinux. I didn’t find any one resource that had everything I needed to know so I thought I would collate some of the information I found here as a resource for others. Thanks to all those (especially on the Blackfin forums) who helped answer my questions.

I am using the Blackfin flavour of uClinux and the uCasterisk application as an example, but this information should apply equally to other uClinux systems/applications.

MMU versus no-MMU

Asterisk is a pretty big application for uClinux, the executable is about 2.5M and when running several calls can consume 32M of system memory. The big difference between uCasterisk and other Asterisk implementations is the lack of MMU. A MMU is handy when working with large, multi-threaded apps. For example when a thread is kicked off you can allocate a virtual stack of say 2M, but physical memory will only be allocated as and when it is actually required (say due to a write to a previously unused part of the stack). If your thread never uses all of the stack, then the physical memory is available for other users.

On a MMU-less system you need to work out the maximum stack your thread may need, and allocate that. If you get it wrong, your application (and possibly the whole system) will bomb. This generally means you are wasting memory compared to the MMU case, as you always need to allocate the worst case amount of memory required.

One possible advantage of MMU-less systems is no nasty surprises – any memory allocated really does exist, and no over-commitment is possible. On a MMU-based system physical memory isn’t actually allocated until you write to it, and it may be paged to disk just when you need it (although I understand there are options to control this behaviour).

When you start an app, you get allocated a stack for the application. This is actually a stack for the main thread of the application. When you start a new thread (say with pthread_create()) the thread gets allocated a new stack from the system heap. The two stacks are completely unrelated. The size of each stack is independent, you control the size in different ways (see below).

Tips for Porting to uClinux

Don’t enable stack checking. This feature is very useful for single-threaded apps; it causes the operating system to kill the app when it uses all of it’s stack space. Very useful, as it tells you straight away to increase the stack size. Unfortunately at present this feature hasn’t been extended to multi-thread applications; using it with multi-threaded apps (at least on the Blackfin) causes problems as pointed out in the 2005R4 RC2 release notes and discussed here.

You control the application (main thread) stack with the -s option, on my Blackfin system the command line is:

```bfin-uclinux-gcc -Wl,-elf2flt='-s 1000000' \ -o thread thread.c -pthread```

In this example the stack is set to 1000000 bytes.

You control the size of the stack for each thread you create using pthread_attr_setstacksize(), for example (from the Asterisk utils.c file):

``` pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, 0x1000000); pthread_create(&thread, &attr, thread_func, NULL); ```

Monitoring Memory Usage

cat /proc/meminfo can be very useful, here is the output from my Blackfin STAMP BF533 board, taken while uCasterisk was running with several SIP calls in progress:

```root:/var/log/asterisk> cat /proc/meminfo MemTotal: 59784 kB MemFree: 11084 kB Buffers: 100 kB Cached: 4172 kB SwapCached: 0 kB Active: 3828 kB Inactive: 444 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 59784 kB LowFree: 11084 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 4 kB Writeback: 0 kB Mapped: 0 kB Slab: 43744 kB CommitLimit: 29892 kB Committed_AS: 0 kB PageTables: 0 kB VmallocTotal: 0 kB VmallocUsed: 0 kB VmallocChunk: 0 kB```

The most important fields are MemFree (total system memory free) and Slab (system wide heap in use).

In earlier versions of Linux the CommitLimit field indicated the maximum Slab was allowed to reach before processes were killed (with Out-Of-Memory errors). However on my distro I discovered by experiment that you can actually increase the Slab well beyond this limit, as indicated above. Looking at the kernel source file uClinux-dist/linux-2.6.x/mm/nommu.c, __vm_enough_memory() function it appears that the memory allocator uses the OVERCOMMIT_GUESS method, which ignores the CommitLimit and allows up to 97% of memory to be allocated.

It is interesting to observe MemFree as you perform different operations. For example on uCasterisk when a new SIP call starts, a thread is created, which requires stack and heap space. I also noticed MemFree decreasing when I copied files on a ram file system – this caught me for a while as uCasterisk was chewing through available system memory writing Call Data Records to the ram disk and eventually causing Out of Memory errors.

ps an top are also useful, as they indicate the amount of memory allocated to the system/application.

CommitLimit and OOM Killer
Why malloc is different under uClinux
Application Debugging on the Blackfin
Intro to Linux Apps on the Blackfin (skip to bottom of page)

Summary

## The YouBox – Hardware for YouOS

I have been following an idea I originally discovered in a Paul Graham essay about the advantages of placing applications on the web server rather than the desktop PC. Hot mail and Gmail are good examples. The application and the data for that application (your email) are stored on a server.

YouOS is an interesting development along this path – it is an entire operating system that runs on a server, complete with an IDE for application development and some really powerful collaboration models. One very powerful feature is the ability to move from one computer to another, fire up YouOS on the browser, and there are all your applications and data – just as you left them.

Looking into the future a little there will come a day where ALL your applications, and ALL your data can be stored on a server.

This transition seems to be already happening in some parts of the world. This point from the Ajax web site really interested me:

• Web as the only Platform Thanks to the widespread adoption of public internet access, the so-called technology gap between countries and between socioeconomic groups is closing. Many people don’t actually own a PC, but do have regular access to the web at internet cafes or schools or friends’ homes. For this diverse category of user, there’s no point installing applications and keeping their data locally. The web is their only platform.

So we have a large number of people who use the web as a platform for economic reasons. YouOS will increase the power of the web platform. What would also help is lowering the cost of web access.

With YouOS the only application you need to run on your PC is a browser. Which suggests to me that you don’t need a PC anymore, just some bare-bones hardware with internet connectivity capable of running a browser.

One thought I have had is adding a monochrome LCD display and keyboard to an embedded linux platform (like the hardware in a WRT54G router). These little routers retail for \$60, so must cost about \$20 to build. Add a keyboard and LCD and you still have a device that still costs less than \$50 to build. Then you would have a small, Wifi connected computer with plenty of CPU power/memory to run a browser, basic command line tools etc.

Combined with ubiquitous connectivity we have a YouOS-Access-Device (YAD? or maybe “the YouBox”) and can replace the desktop in many applications. In a laptop form factor the YouBox could be really light and thin and almost disposable. It would be very portable and lightweight and would use much less power than a regular laptop. It’s more like a larger version of a Palm. For a few extra dollars you could add sound-blaster type audio and the device is also a telephone.

I know there is a sub-\$100 laptop project out there, the YouBox is another approach that uses the web as a platfrom paradigm to optimise the hardware. One advantage of the YouBox is that it can be put together in small quantities using off the shelf components.

The YouBox concept still has many questions – for example the need for an internet backbone to connect the YouBox to a server and also the YouOS servers. However these may be a little easier to solve, for example if one backbone/server can handle X YouBox clients, you amortise the backbone/server cost by X.