A Bit Exact DSP Fairy Tale

Once upon a time in a far off land there was an underpaid pale skinny guy working in an impoverished start up who wanted to build products with fixed point DSP chips in them. The problem was he had just about killed himself doing the same thing many times in the past.

In particular, he seemed to spend a zillion hours debugging intricate fixed point DSP assembler to get products working. All while some project manager who hadn’t written a line of code for 10 years was hovering behind him asking questions like “is it ready” and mumbling something about milestone dates that were ridiculous when proposed, and soon became hilarious after the project started.

So, our hero wondered, “is there a better way”.

How to Build a DSP Based Product

Many DSP products are developed like this:

The academics “brains trust” guys work out a clever new algorithm. Now these guys are way-smart, but tend to express themselves in algebra with lots of Greek letters. They usually have PhDs, and cost a lot, and there aren’t many of them. They might test their brain-child in a simulation algorithm like Matlab:

x = [1.1 2.0 3.0];
y = [2.5 3.0 4.0];
dot = x * y’;

Now this lets them test the algorithm quickly (dot = 20.75), but you can’t run code like this on a real world product due to (a) it runs 10 million times slower than real time and (b) Matlab licences cost $5,000 each and only run on $3000 PCs. Which is a shame when u are building a $99 product.

So some poor engineer then has to make the DSP algorithm work in real time (i.e. fast) on the “target” hardware, a $10 DSP chip. Actually, in general, these guys actually like the detail work this involves. The first step they might do is convert to floating point C:
float x = {1.1,2.0,3.0};
float y = {2.5,3.0,4.0};

float dot(float x[], float y[], int n) {
float d = 0.0;
int i;

for(i=0; i<n; i )
d = x[i]*y[i];

return d;
}

But the guys & gals in Marketing (they are a breed closely related to Managers, but tend to be more attractive and use airports more) have decided that this must be a low cost product, so a $10 fixed point DSP is chosen.

Now a fixed point DSP is a chip that can perform multiplies and adds quickly, as long as you only use integers. They don’t do floats. So the engineer converts the algorithm to fixed point:
short x = {11,20,30}; /* x*10 */
short y = {25,30,40}; /* x*10 */

int dot(short x[], short y[], int n) {
int d = 0; int i;

for(i=0; i<n; i ) d = x[i]*y[i];

return d; /* returns dot*100 */
}

Now look what has happened to the x[] and y[] arrays. We have multiplied each number by 10 so we can use an integer to hold the value. If we didn’t scale each number then we would lose information (for example converting the 1.1 to 1) which on certain algorithms would cause problems.

However because each input value is multiplied by 10, that means the result is multiplied by 100 ( 10x * 10y = 100xy ). So we need to take this output scaling into account in any other stages of the algorithm. One more piece of information we need carry around with us as we code. One more source of bugs.

This is a very simple scaling example. In practice it gets much scarier and means you need to hold a lot more information in your head when you program in fixed point (see complex sample below).

There is also the potential of overflow problems. As the output is scaled up by 100, it is easy to overflow the 32 bit range of the integer output, for example if we used x[] and y[] arrays with larger numbers.

I Feel the Need for Speed

The product requires that this routine must run as fast as possible on the DSP chips. So we need to code an optimised version of this routine in assembler:
int dot(short *x, short *y, int len)
{
int d;

__asm__
(
"I0 = %3;\n\t"
"I1 = %4;\n\t"
"A1 = A0 = 0;\n\t"
"R0 = [I0 ] || R1 = [I1 ];\n\t"
"LOOP dot%= LC0 = %5 >> 1;\n\t"
"LOOP_BEGIN dot%=;\n\t"
"A1 = R0.H * R1.H, A0 = R0.L * R1.L
|| R0 = [I0 ]
|| R1 = [I1 ];\n\t"
"LOOP_END dot%=;\n\t"
"R0 = (A0 = A1);\n\t"
"%0 = R0 >> 1;\n\t"
: "=&d" (d), "=&d" (before), "=&d" (after)
: "a" (x), "a" (y), "a" (len)
: "I0", "I1", "A1", "A0", "R0", "R1"
);

return d;
}

This is assembler for the Blackfin DSP, the actual code will of course vary for different DSP chips. The important bit is this:
A1 = R0.H * R1.H, A0 = R0.L * R1.L
|| R0 = [I0 ]
|| R1 = [I1 ]

In a single clock cycle, our intrepid DSP performs two parallel 16-bit multiplies, two 16-bit adds, and fetches 2 32 bit words from memory, post incrementing the pointers after each fetch. It does all this 500 million times a second, and sometimes it even gives you the right answer.

But not often. First you need to spend a lot of time getting it all to work right. The assembler is much harder to grok than the previous versions of the algorithm, and you use up a lot of brain power on a myriad of fine details. Get any one of the details wrong and the product is a dud.

The target development environment is also tough. In many cases the feedback you use to debug might be an analogue signal flowing into or out of the product (that is processed internally by the DSP). Or a blinking LED. Sometimes you don’t even know when its really working, bugs can simply slip through. You can’t always stop the code and single step using a debugger as parts of the algorithm depend on real-time activity.

It’s a development/testing/maintenance nightmare.

Bit Exact Assembler

There is a way to ease the pain a little. Lets say there is a bug in the assembler version. The problem is that to debug the algorithm you may need to understand both the assembler and the DSP algorithm, I mean to debug something you need to know how it works, right?

This is OK for a simple algorithm, but what if it is even moderately complex? Then you are dealing with (i) a very complex algorithm and (ii) very complex assembler all at the same time. This is usually too much for mere mortals (and even minor deities). The result is long, painful development, failed projects, late nights, angry spouses, and lots of pizza (well its not all bad I guess).

The trick is to divide and conquer. Make sure we are only hitting one tough problem at any given time.

One approach I have found very useful is bit exact porting to assembler. There are two important steps:

1. At the fixed point C stage, you test very very carefully. Run batteries of tests and simulations. These can be performed on a regular PC in non real time, using powerful debuggers. The idea is to verify that (i) the algorithm is OK and (ii) the fixed point port works OK. In particular (ii) is very tough, so its nice to handle this in a relatively benign environment like a PC or workstation.

2. Port each function to real time assembler one by one. Test each function against the fixed point C reference. Make sure the functions give IDENTICAL output – right down to the last bit. This takes a lot of discipline – near enough is NOT good enough.

The benefits of bit exact coding are:

1. When coding assembler, you don’t have to worry or even understand the algorithm. You just have to make sure that the C version matches the assembler version. When they do, you know that the overall system will perform as expected. This lightens the load on your brain, leading to less bugs.

2. As you integrate modules of code together, problems are easy to spot, just look for where the output doesn’t match the fixed point C simulation. This is much simpler that performing batteries of tests in a real time environment.

3. A less senior engineer can be used to perform the assembly coding, as they don’t need algorithm knowledge. This makes it easier to find people for the project, lowering costs. Adding people to the project is less of an issue, as they can get up to speed and be productive quickly.

4. Any bugs are exposed early (while unit testing), rather than late (after integration). You get an honest measure of progress, rather than nasty surprises late in the project.

Where it fails:

1. It takes a lot of discipline to perform (1) CAREFULLY. There is a lot of pressure from managers (and engineers) to jump onto the sexy real time/hardware integration stage. If you are part way through (1), and there is a milestone due, it is very tempting to “declare victory” and move onto the next stage, WAY before you are ready. The results are predictable…

2. It takes discipline to make the assembler bit exact. I mean, look, it’s nearly the same, just a few bits are different? The problem is these little errors may lead to much bigger flaws in performance later on. Try changing a single bit in your operating system binary image and see how well it runs. In some (most) cases bit-exactness matters!

3. It is honest. It gives you a black and white go/no-go indication at various stages of the project. For example the fixed point C stage is not complete until it has shown perfect performance against the spec. This reduces the “room to move” managers have, you can’t declare a phase “complete” just to suit a milestone date. It really has to work. This sort of honesty doesn’t sit well with some managers, for example they may have to admit problems to a customer early on in a project.

Case Study – Non Bit Exact

Now here is what typically happens. Our hero is under pressure from his manager to progress, so he short circuits the fixed point C simulation stage and codes straight into assembler. The C simulation and bit exact assembler have results that are “close” but not really the same. No effort is made to synchronise the results, there isn’t enough time!

Time is also short so he can’t “waste time” on unit testing, there is a milestone with a big fat cash payment attached.

The assembly code is loaded into flash on some brand-new, untested hardware, power is applied, and some basic tests performed.

Damn. It doesn’t work.

So our hero and his team start trying to debug the real time assembler code. However debugging in real time is very time consuming. The best brains in the company spend months tracking down bugs that end up being simple one line fixes. Bugs constantly interact with each other, leading the team off in false directions.

He doesn’t go back to the C simulation as it now gives different results to the assembler and “we don’t have enough time” to make them match.

As the project falls behind, there is no one who can be added, as only people who understand both the algorithms and the assembler can be used. The team is debugging leading edge algorithms in assembler.

A year passes. Senior management is fuming. They keep getting told “real soon now”. However no-one really knows how long it will be, because no one really knows just how many bugs there are. Frustration and stress is very high all around.

Eventually the company runs out of funds and the project is cancelled. There are mass departures of some of the best and most capable people from the company. Our hero leaves to join another start up where he repeats the same process.

Case Study – Bit Exact

Our hero and some members of the brains trust spend many months carefully porting the algorithm to fixed point and testing the entire product in a robust simulation. This apparent lack of progress is tough for the managers to handle, but they understand the benefits and bravely support the team.

Lots of complex bugs are thrown up, and the team works through them one by one. They don’t take too long to fix because these tough problems are at least occurring in a PC environment, where it easy to track them down.

In parallel, the fixed point C modules are passed to a team to perform the assembler coding. This team is made of up many junior engineers, who are instructed to make sure each module is bit exact with the reference C simulation. A few of them don’t like this, but the more senior members keep them on track. The junior members do a good job, as all they need to focus on is the assembler, not the algorithm details.

A few bugs slip through unit testing to integration. A senior member who is a veteran of previous “death march” DSP projects is assigned to fix the bug. She discovers that the the problem was due to an exceptional case not covered in the unit test and sure enough the assembler module wasn’t bit-exact under that case. She had previously been a critic of the extra effort required for bit exactness but the speed in which this bug was found and fixed impresses her.

Early in the project some unforeseen problems are discovered during the work on the fixed point C simulation. Some parts are running slower than expected which means more assembler may be required in the final product. This throws up an early indication that the project is at risk of falling behind.

This critical management information is flagged early, while there was time to do something about it. It is decided to add some extra engineers to the assembler side. As soon as they learn the DSP instruction set they quickly come up to speed. They don’t need to know or understand the algorithm details.

The schedule still slips by a few weeks, but all the stakeholders understand that this is a better outcome than delivering a lemon on time. No further slips occur, and the project manager earns a great deal of trust and respect by delivering “bad news early” rather than trying to hit arbitrary milestone dates and risking the entire project.

When the real-time DSP software is integrated with the hardware there are still a few bugs to find. However by this stage the real time software is at a very high quality level so the bug count is low. Integration actually proceeds on schedule, amazing everybody, especially veterans of past projects.

The product is a success, the start up has a great IPO, and the pale skinny guy retires at 28 years old.

Complex Example

Here is a rather more complex example of fixed point C code from the Speex project:
static inline spx_word32_t cheb_poly_eva(
spx_word16_t *coef,
spx_word16_t x,
int m,
char *stack
)
{
int i;
spx_word16_t b0, b1;
spx_word32_t sum;

/*Prevents overflows*/
if (x>16383)
x = 16383;
if (x<-16383)
x = -16383;

/* Initialise values */
b1=16384;
b0=x;

/* Evaluate Chebyshev series formulation
using an iterative approach */
sum = ADD32(EXTEND32(coef[m]),
EXTEND32(MULT16_16_P14(coef[m-1],x)));
for(i=2;i<=m;i )
{
spx_word16_t tmp=b0;
b0 = SUB16(MULT16_16_Q13(x,b0), b1);
b1 = tmp;
sum = ADD32(sum,
EXTEND32(
MULT16_16_P14(coef[m-i],b0)));
}

return sum;
}

In English (sort of), this function works the value of a polynomial at various points around a circle. Don’t ask me why. DSP engineers just like doing things like that. For more information/punishment see the file lsp.c from the Speex project.

Macros like ADD32() simulate the instruction set of the DSP chip in the simulation code, which makes porting this function actual DSP assembler a simpler step. However they also make the code much harder to read.

One thought on “A Bit Exact DSP Fairy Tale”

Comments are closed.