How to make your Blackfin fly Part 1

The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

Boring Mathematical Bit

As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

The sample code just finds the dot product of two vectors:
int dot(short *x, short *y, int len)
{
int i,dot;

dot = 0;
for(i=0; i<len; i )
dot = x[i] * y[i];

return dot;
}

It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

Down to Business

Enough talk, here is a run of the sample code from my BF537 STAMP:

root:/var/tmp> ./cycles
Theoretical best case is N/2 = 50 cycles
Test 1: Vanilla C
ret = 100: run time:
3838 3507 3373 3408 3373 3373 3373 3373 3373 3373
Test 2: data in external memory, outboard cycles function
ret = 100: run time:
442 240 239 218 218 218 218 218 218 218
Test 3: data in external memory, inboard cycles
ret = 100: run time:
242 103 103 103 103 103 103 103 103 103
Test 4: data in internal memory, inboard cycles
ret = 100: run time:
214 53 53 53 53 53 53 53 53 53

A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:
int cycles() {
int ret;

__asm__ __volatile__
(
"%0 = CYCLES;\n\t"
: "=&d" (ret)
:
: "R1"
);

return ret;
}

Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

Allocating Internal Memory

At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:
/* I know, I know - this is very naughty :-) */
short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */
short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */

I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.

Summary

Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.

Acknowledgements

Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

Measuring Stack Usage in Multi-threaded uClinux Apps

In regular Linux the MMU allows the stack to grow dynamically, the MMU just allocates more physical pages. However in uClinux, the correct amount of stack for each thread must be allocated before the thread is created.

Too little stack and your program will corrupt the system in nasty, unpredictable ways. Thread stack gets malloced from the system heap, so an overflow means writes to an arbitrary address just outside the block of memory allocated to the thread. This memory could possibly be in use by other parts of the system, perhaps for a different application. These sorts of bugs can be very difficult to track down.

If you allocate too much stack, then you are wasting memory, a valuable resource on embedded systems. For example I discovered I was allocating far too much stack and wasting Mbytes of memory, especially when multiple threads were running.

The standard approach is to try random values of stack until you find one that works. However I thought it might be a better idea to actually measure the amount of stack used by each thread. Then I could tweak the stack allocation to optimise memory usage and even check for stack overflows at run time.

Threadstack Library

I have written a small library (called threadstack):
unsigned int threadstack_free(pthread_t *thread);
unsigned int threadstack_used(pthread_t *thread);

Here is a sample run on my Blackfin BF537 STAMP, when a 100k stack was allocated to a thread:
root:/var/tmp> ./test_threadstack
stack used: 1300
stack free: 100620
root:/var/tmp>

How it Works

The functions work by examining memory allocated to the stack. The theory is that if the memory is non-zero, then it must have been used by the thread at some time (the entire block of memory used for the stack is initially set to 0 before the thread starts). So the routines search the stack memory for the first non-zero value, and that is declared the “high water mark” – the point where the stack reached it’s maximum.
0xff <- stack top
0xff
0xfe <- high water mark
0x00
0x00
0x00 <- stack bottom

The high water mark will change over time, so after your thread has been running for a while is the best time to measure stack usage.

One weakness with this approach is that if stack allocation is way too low your program may bomb before these routines get a chance to run. However in that case you will at least know there is a problem, and can increase stack to some high number (e.g. Mbytes) to get the program running, before using these functions to determine actual stack requirements.

Usage

In my uClinux Asterisk port I have added code to check for stack overflow just before a thread ends:
pthread_t thread = pthread_self();
assert(threadstack_free(&thread) > 10*1024);

This code checks that while the thread was running, the minimum free stack was 10k. The assert will kill the program with an error message and tell me straight away I need more stack. Much nicer than getting an obscure bug in the system due to a stack overflow on a thread. Now the program finds stack overflow bugs for me!

This example above runs from within the actual thread itself, hence the call to pthread_self() to discover the threads handle. You can also call the functions from another thread (e.g. the main thread), for example to periodically meter stack usage.

Links

More information on multi-threaded applications for uClinux
movies hardcoremovies ebony girls buttclips sybian moviemovies homemade sexmovie stars nudesamples movie hot xxxmovies free handjobmovies anal free Mapstar porn blonde galleryporn gallereis thumnail blondeporn videos blondepornstar blonde pussyporno teen blondepornstar blonde videoporn blondesblondes pornstars Mapporn cliphunterclipmaster pornfree clips pornclips anal porn ofof clips porn girlclips porngratuit porno clipsclipsporn Mapringtones boltblueboombastic ringtone freebreakfast club ringtonesringtone brewerspride brown ringtonescoupon ringtone cingluarcingular phones ringtones recordsurvive circa ringtones Map

Porting multi-threaded apps to uClinux

I have recently been working on improving the stability of uCasterisk, a port of Asterisk to uClinux. This required some research into memory management for multi-threaded apps on uClinux. I didn’t find any one resource that had everything I needed to know so I thought I would collate some of the information I found here as a resource for others. Thanks to all those (especially on the Blackfin forums) who helped answer my questions.

I am using the Blackfin flavour of uClinux and the uCasterisk application as an example, but this information should apply equally to other uClinux systems/applications.

MMU versus no-MMU

Asterisk is a pretty big application for uClinux, the executable is about 2.5M and when running several calls can consume 32M of system memory. The big difference between uCasterisk and other Asterisk implementations is the lack of MMU. A MMU is handy when working with large, multi-threaded apps. For example when a thread is kicked off you can allocate a virtual stack of say 2M, but physical memory will only be allocated as and when it is actually required (say due to a write to a previously unused part of the stack). If your thread never uses all of the stack, then the physical memory is available for other users.

On a MMU-less system you need to work out the maximum stack your thread may need, and allocate that. If you get it wrong, your application (and possibly the whole system) will bomb. This generally means you are wasting memory compared to the MMU case, as you always need to allocate the worst case amount of memory required.

One possible advantage of MMU-less systems is no nasty surprises – any memory allocated really does exist, and no over-commitment is possible. On a MMU-based system physical memory isn’t actually allocated until you write to it, and it may be paged to disk just when you need it (although I understand there are options to control this behaviour).

Stacks for Threads

When you start an app, you get allocated a stack for the application. This is actually a stack for the main thread of the application. When you start a new thread (say with pthread_create()) the thread gets allocated a new stack from the system heap. The two stacks are completely unrelated. The size of each stack is independent, you control the size in different ways (see below).

Tips for Porting to uClinux

Don’t enable stack checking. This feature is very useful for single-threaded apps; it causes the operating system to kill the app when it uses all of it’s stack space. Very useful, as it tells you straight away to increase the stack size. Unfortunately at present this feature hasn’t been extended to multi-thread applications; using it with multi-threaded apps (at least on the Blackfin) causes problems as pointed out in the 2005R4 RC2 release notes and discussed here.

You control the application (main thread) stack with the -s option, on my Blackfin system the command line is:

bfin-uclinux-gcc -Wl,-elf2flt='-s 1000000' \
-o thread thread.c -pthread

In this example the stack is set to 1000000 bytes.

You control the size of the stack for each thread you create using pthread_attr_setstacksize(), for example (from the Asterisk utils.c file):

pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 0x1000000);
pthread_create(&thread, &attr, thread_func, NULL);

Monitoring Memory Usage

cat /proc/meminfo can be very useful, here is the output from my Blackfin STAMP BF533 board, taken while uCasterisk was running with several SIP calls in progress:

root:/var/log/asterisk> cat /proc/meminfo
MemTotal: 59784 kB
MemFree: 11084 kB
Buffers: 100 kB
Cached: 4172 kB
SwapCached: 0 kB
Active: 3828 kB
Inactive: 444 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 59784 kB
LowFree: 11084 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 4 kB
Writeback: 0 kB
Mapped: 0 kB
Slab: 43744 kB
CommitLimit: 29892 kB
Committed_AS: 0 kB
PageTables: 0 kB
VmallocTotal: 0 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB

The most important fields are MemFree (total system memory free) and Slab (system wide heap in use).

In earlier versions of Linux the CommitLimit field indicated the maximum Slab was allowed to reach before processes were killed (with Out-Of-Memory errors). However on my distro I discovered by experiment that you can actually increase the Slab well beyond this limit, as indicated above. Looking at the kernel source file uClinux-dist/linux-2.6.x/mm/nommu.c, __vm_enough_memory() function it appears that the memory allocator uses the OVERCOMMIT_GUESS method, which ignores the CommitLimit and allows up to 97% of memory to be allocated.

It is interesting to observe MemFree as you perform different operations. For example on uCasterisk when a new SIP call starts, a thread is created, which requires stack and heap space. I also noticed MemFree decreasing when I copied files on a ram file system – this caught me for a while as uCasterisk was chewing through available system memory writing Call Data Records to the ram disk and eventually causing Out of Memory errors.

ps an top are also useful, as they indicate the amount of memory allocated to the system/application.

Links

CommitLimit and OOM Killer
Why malloc is different under uClinux
Application Debugging on the Blackfin
Intro to Linux Apps on the Blackfin (skip to bottom of page)
Blackfin forum thread where I asked some questions on this topic

Summary

I hope this was useful – pls email me or add a comment below if you have any comments/suggestions/corrections.
payday loan 6 8 australiapayday loan 8 day pay loanprocessor account mortgage loan manager processorachieve and loans studentadult loan site personals personalscredit loan secured unsecured adverse onlineloan direct student aid federal moneydirect aid financial loan student Mapringtones lg freelg ringtones howtoringtones lg4650wow basketball ringtones bow lilof ringtones funny listlocomotive ringtonesgood long friday ringtoneslow ringtones rider Mapstarfire hentaisexy secretarieshentai shemaleproposal xxxporn trailerswrestling nudepussy catanime pussy Map

The YouBox – Hardware for YouOS

I have been following an idea I originally discovered in a Paul Graham essay about the advantages of placing applications on the web server rather than the desktop PC. Hot mail and Gmail are good examples. The application and the data for that application (your email) are stored on a server.

YouOS is an interesting development along this path – it is an entire operating system that runs on a server, complete with an IDE for application development and some really powerful collaboration models. One very powerful feature is the ability to move from one computer to another, fire up YouOS on the browser, and there are all your applications and data – just as you left them.

Looking into the future a little there will come a day where ALL your applications, and ALL your data can be stored on a server.

This transition seems to be already happening in some parts of the world. This point from the Ajax web site really interested me:

  • Web as the only Platform Thanks to the widespread adoption of public internet access, the so-called technology gap between countries and between socioeconomic groups is closing. Many people don’t actually own a PC, but do have regular access to the web at internet cafes or schools or friends’ homes. For this diverse category of user, there’s no point installing applications and keeping their data locally. The web is their only platform.

So we have a large number of people who use the web as a platform for economic reasons. YouOS will increase the power of the web platform. What would also help is lowering the cost of web access.

With YouOS the only application you need to run on your PC is a browser. Which suggests to me that you don’t need a PC anymore, just some bare-bones hardware with internet connectivity capable of running a browser.

One thought I have had is adding a monochrome LCD display and keyboard to an embedded linux platform (like the hardware in a WRT54G router). These little routers retail for $60, so must cost about $20 to build. Add a keyboard and LCD and you still have a device that still costs less than $50 to build. Then you would have a small, Wifi connected computer with plenty of CPU power/memory to run a browser, basic command line tools etc.

Combined with ubiquitous connectivity we have a YouOS-Access-Device (YAD? or maybe “the YouBox”) and can replace the desktop in many applications. In a laptop form factor the YouBox could be really light and thin and almost disposable. It would be very portable and lightweight and would use much less power than a regular laptop. It’s more like a larger version of a Palm. For a few extra dollars you could add sound-blaster type audio and the device is also a telephone.

I know there is a sub-$100 laptop project out there, the YouBox is another approach that uses the web as a platfrom paradigm to optimise the hardware. One advantage of the YouBox is that it can be put together in small quantities using off the shelf components.

The YouBox concept still has many questions – for example the need for an internet backbone to connect the YouBox to a server and also the YouOS servers. However these may be a little easier to solve, for example if one backbone/server can handle X YouBox clients, you amortise the backbone/server cost by X.

free movies asswatcherbehind door green movie thesex movies bimovie tits bigroderick brande moviecelebrity archive browse moviemovies children pornfootjob movieslesbian free asian moviescheerleader sex of hardcore movies freecredit and $1000 loan bad1 unsecured loan hourvalue 125 loan 2nd mortgageno credit 20,000 loans check4.5 rate interest loansloan servicing access grouploan accredited problems homecar alabama loans Map