Renumbering components using Perl

I am designing a telephony version of the BlackfinOne DSP/uClinux board. To develop the design I am using the gEDA CAD tools. This board runs uClinux on a Blackfin BF532 processor, and has been designed using the open source gEDA tools which makes it really easy for anyone to re-use the design. Thank you very much to Dimitar Penev and Ivan Danov who designed and published this board. They have done an excellent job.

The first stage in any electronic design is usually to enter the schematic. The schematic shows the components and their wiring in a form that is easy to follow:

I wanted to add two Ethernet ports to the BlackfinOne design, so I found a reference design and entered the schematic using the gschem program.

After I entered some sheets for the Ethernet chips, I had a little problem. In the schematic above you can see that every component has a letter & number next to it, for example, C1, C2 for capacitors, or U1, U2 etc for chips. This is a unique ID for each part. The letter generally indicates what sort of part is is (capacitor, resistor, chip etc). These IDs are know as reference designators or as a “refdes”.

Now I wanted the refdes on my new Ethernet sheets to merge nicely with those on the existing BlackfinOne design. For example if the last capacitors refdes on the BlackfinOne design stopped at C51, then my C’s should start at C52.

There is a program to do this (refdes_renum described here in the gEDA tutorial) but unfortunately it renumbers all of the refdes in the design, and I only wanted to renumber the new parts that I had added. I didn’t want to mess up anything in the existing design. Please also see the Links section below for a new alternative – grenum.

Now of course it is possible to number the components manually in gschem via the GUI, but if you know a little Perl then its much more fun to hack out a solution! Now fortunately gschem uses text files to store the schematic. The refdes are stored in the schematic file like this:
C 7000 87800 1 0 0 bf-93LC46.sym
T 8200 89100 5 10 1 1 0 0 1

I think this means that the component (leading C) that is drawn using the bf-93LC46.sym symbol file has a refdes called U19. The line starting with T tells gschem where to place the refdes text on the sheet. Anyway all I am really interested in is reading and changing the refdes lines.

So the Perl script first scans all of the existing schematic files for lines that have the text /refdes=/. The number of the refdes is then extracted, and stored in a hash if it is the largest refdes so far. On the new schematics, I initially enter each refdes as C?, U? etc. The script scans these, and wherever it finds a ? it replaces it with the next refdes in that series.

Improvements and Reflections

There is one obvious improvement. The current script isn’t smart enough to use a refdes that hasn’t been used yet, for example if C1,C2,C4 are currently used it isn’t smart enough to use C3, it will use C5 instead.

This script took me about 70 minutes to write and it worked perfectly the first time. To be honest it didn’t really save me any time (I could have manually entered the refdes in less time) but it was lots more fun to do it this way. And I guess it will be useful the next time I have this problem. Perhaps now that I have posted on it some one else might also find it useful. Come to think of it writing this blog post took longer than writing the script! DOH!

BTW every time I work with Perl (which is maybe once every two months) I spend the first 20 minutes looking up web sites to work out which regexp to use, how to iterate through a hash, etc. If I don’t work on Perl and regexps constantly it all looks like line noise to me after a few weeks:
if ($line =~ /refdes=(\w)(\?)/) {

But hey, its no big deal to refresh my memory every few months.

Solving little problems like this with Perl is fun. You almost hope there isn’t a solution out there already, just so you have an excuse to write one!


I have also written a few Perl scripts for PCB (the PCB design application in the gEDA suite) here.

After I posted the above, Levente Kovacs kindly pointed out his grenum utility which is now included with gEDA! This has been recently added to gEDA and performs the same job as the Perl script above, but can also spot gaps in refdes sequences. As grenum is now part of the official gEDA distribution I would recommend using it rather than I share a trait common among engineers – a strong desire to “do it myself” rather than checking carefully for existing solutions! Thanks Levente!
14 payday 20 loan directory2500 cash quick loansquick 26 loan 37 paydaycredit bad loan 2b personal45 loan payday 31 advance uksites 37 payday loan 53loan cash payday advance 5 5006 loan online kentucky payday 8loan 6 cost 20,8 payday lowlink 60 payday loanmovies hentaimovie lesbiansex moviesmovies adultfree hentai moviesporn moviesmovie clips porn freemovies gay free Mapmp3 achha ji1990 mp3 temptationsnation acumen mp3drame mp3 adamatecktonik mp3 tepr acdgmp3 ackermann sucheachhe mp3 lagtemp3 blog 1990s Mapfree naked moviesdildo movies hugehardcore clips moviemovies dans pornblow free movies job bestwomen movies nakedgirls free teen moviesdouble penetration free movies Mapporn cliphunterporn clipmasterclips free pornporn anal of clipsof girl porn clipsporn clipsporno clips gratuitclipsporn Map

Using Make to build hardware

I have been working on some telephony hardware using the gEDA open source CAD software. Now designing Printed Circuit Boards (PCBs) is fun, but very GUI-centric. Click-Click-Click with the mouse all day. After a few frustrating experiences (OK I messed up the design a few times) I thought “there must be a better way” and “why can’t I use the command line and Make for this?”

Lucky for me the gEDA software uses text files to store the PCB layout so as an experiment I wrote some Perl scripts to handle some boring repetitive tasks. The bottom line is I am using Make to assemble a Printed Circuit Board design from smaller “modules” of PCB – just like software. I have written about it (with lots of fancy screen shots) enemafacesit moviessex women movies fatmovie favorite quotesfemale movies strippershitting females moviesfighting movie website the temptationfinal xi fantasy moviemovies free wmv adultfree sex anal movie downloadsringtone sanyo 5400 sprint freesamsung ringtone a680 midiamc 30 barrington moviesnokia easy 3585i text ringtone freeaudiovox ringtone 8910 cricketclowns 1000 ringtone nextelfree a950 ringtone samsung scha660 samsung sprint free ringtone Mapto credit allowing friends use2bcredit accept 2bcardsunion credit affinity federal 07481accreditation masters divinity aabc inmaine federal union credit acadia madawaskaaccreditation ahrpcard processingcom gambling merchant credit accountbureas 3 credit Map

How to make your Blackfin fly Part 2

This article describes how to write fast DSP code for your Blackfin.

For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).

We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.

The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.

We obtained gains from:

1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.

2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.

3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.

4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.

Why The Blackfin is Different

Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.

The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.

The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.

Some examples

To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.

The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.

A Simple Library for Profiling

To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:

for(i=0; i<10; i );


around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
root:/var/tmp> ./test_samcycles
start, 0
end, 503
TOTAL, 503

Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
root:/var/tmp> ./testenc male.wav male.out
start nb_encode, 0
move, 1352
autoc, 16149
lpc, 3180
lpc_to_lsp, 21739
whole frame analysis, 17797

Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
all of the CPU.

Care and Feeding of your Blackfin Cache

One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
"P0 = %2;\n\t"
"R0 = 0;\n\t"
"%0 = CYCLES;\n\t"
"LOOP dot%= LC0 = %3;\n\t"
"LOOP_BEGIN dot%=;\n\t"
"W[P0 ] = R0;\n\t"
"LOOP_END dot%=;\n\t"

This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.

However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
686 542 597 542 542 542 542 597 542 542

The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.

However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.

The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.

This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.

There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).

Another less exciting way is to enable “write back” cache (a kernel configuration option):
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
119 102 102 102 102 102 102 102 102 102

Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.

On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.

In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..

Internal Memory

The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.

The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.

In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
/* try alloc-ing and freeing some memory */

pa = (void*)l1_data_A_sram_alloc(0x4);
printf("pa = 0xx\n",(int)pa);
ret = l1_data_A_sram_free((unsigned int)pa);

which produces the output:
root:~> /var/tmp/test_l1_alloc
pa = 0xff803e30

i.e. a chunk of memory with an address in internal memory bank A.

To see the effect of internal versus cache/external memory:
Test 1: data in external memory
ret = 100: run time: 173 103 103 103 103 103 103 103 103 103
Test 2: data in internal memory
ret = 100: run time: 103 103 103 103 103 103 103 103 103 103

After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.

I should mention that to make this driver work I needed to add a an entry to my
uClinux-dist/vendors/AnalogDevices/BF537-STAMP/device_table.txt file:
/dev/l1alloc c 664 0 0 254 0 0 0 -

then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
root:~> ls /dev/l1alloc -l
crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc
root:~> cp var/tmp/l1_alloc_k.ko .
root:~> insmod l1_alloc_k.ko
Using l1_alloc_k.ko
root:~> /var/tmp/test_l1_alloc

Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.

Speex Benchmarks

Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:

The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).

(1) Write through cache, ext memory: 564
(2) Write through cache, int memory: 455
(3) Write back cache , ext memory: 465
(4) Write back cache , int memory: 438

So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.

Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.

To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
svn co

Or you can download a current snapshot from here.

movies sapphic contentfree adult trailers moviemovies free voyeursapphic petite moviesdancing dirty moviefree clips movie ebony pornmovie pornosex movie hidden disneyfucking granny moviesmovie the blowbarrington alfreds of island rhodetheme g ali ringtonesringtones 3ringtones 311ringtones 99missed call ringtone 1mobile ringtones absolutely freeac ringtone dc Mapviagra pills 100mg priceviagra acapulcoviagra achetertramadol $79 180200 cod overnight tramadolper hcl tramadol acetaminophentramadol all aboutversus ambien xanax Mapringtone acdc midi12 ringtone adamringtones australia fair advancesecond ringtone 203310 ringtone composebalut abs cbn ringtone64 poliphony ringtones1000 words ringtone Mapporn 69 position69ing porngrade 6th porn7 penis porn inchporno 70mivies 70s pornporn pictures 70sporn s fan 80 Map

GSM Port for the Blackfin

For my uCasterisk project I needed a couple of optimised codecs for the Blackfin. This post discusses the steps taken to port GSM to the Blackfin.

The GSM codec for the Blackfin can be downloaded here.


1/ To make:

2/ To test:

Download tgsm (test program produced by make) to your target and also download a source speech file like:

to your Blackfin hardware and type:
root:/var/tmp> ./tgsm male.wav male.out
SNR = 10.4591 dB enc 114 dec 39 k cycles/frame

When it runs it prints out the number of cycles it took to execute each 20ms encode and decode frame.

You can then upload the output file (male.out) to your host and listen to it. On my Linux box I use “play male.sw”, the sw lets “play” recognise it as a 16-bit signed-word file.


I spent a day or so optimising the code, for example:

a) I wrote Blackfin versions of the macros in gsm/inc/private.h

b) Applied the profiling macros SAMCYCLES and worked out which parts of the code needed the most optimisation.

c) I looked at the assembler output of various functions (gcc -S or -save-temps options) and modified the C code for better output, such as using the hardware loop supported by gcc 4.1. A lot of the original GSM code was written for older x86 compilers, and lots of compiler-specific mods were evident. In many cases to speed up code I just went back to vanilla C and the Blackfin compiler did a better job!

e) By inspecting the assembler I found some important routines were making function calls inside their inner loops which is very inefficient. These were modified to remove the function calls.

f) Use some assembler in the tightest, most cycle-hungry loops.


Using gcc 4.1 and testing on a Blackfin STAMP BF533 board:
encode: 114,000 cycles/fr: (114,000/0.02s) = 5.7 MIPs
decode: 39,000 cycles/fr: (39,000/0.02s) = 1.95 MIPs

The initial number of cycles per encode was 274,000, decode 82,000.

Further Work

My gut feel is it might be possible to reduce the total (encode plus decode) cycles by perhaps another 30% with further optimisation.

a) The analysis and synthesis filter functions consume about 50,000 cycles per encode/decode cycle, they could be converted to assembler.

b) The RPE algorithm (rpe.c) could be optimised.

c) Blackfin internal memory might speed some operations, such as autocorrelation.

How To Profile

I have written a set of macros (samcycles.h) to sample the Blackfin cycles counter. Here is an example on how to use them:

a) Patch code.c:
patch -p0 < code_profile.patch

b) make, download tgsm and re-run on the target:
root:/var/tmp> ./tgsm male.wav male.out
start Gsm_Coder, 0
Gsm_Preprocess, 5312
Gsm_LPC_Analysis, 11406
Gsm_Short_Term_Analysis_Filter, 23483
Gsm_Long_Term_Predictor, 11525
Gsm_RPE_Encoding, 8308
Gsm_Long_Term_Predictor, 10947
Gsm_RPE_Encoding, 5411
Gsm_Long_Term_Predictor, 10701
Gsm_RPE_Encoding, 5422
Gsm_Long_Term_Predictor, 10696
Gsm_RPE_Encoding, 5409
end Gsm_Coder, 521
TOTAL, 109141
SNR = 10.4591 dB enc 115 dec 39 k cycles/frame

c) To investigate further, just add more SAMCYCLES() macros. Its a good idea to remove or disable the macros when you are finished, as they use a few thousand cycles:
patch -R -p0 < code_profile.patch


To Jean-Marc Valin and the Speex project, I used some of their assembler code (see COPYING.xiph for the copyright message related to this code).
loans $100loan state alaskapercent 125 home equity loanloan home alabama mobilefast 10,000 loanequity accelerated loansfraud advance fee loana loan paper c b Mapporn review 1280×72013 girl porn14 lesbian pornporn year old 14vfere min 15 pornpreviews 15 porn minclips minute porn 1515 minutes porn Mapagreement laptop loanlaserpro origination loan systemva loans about learnlegal templates document loan personalrepayment loan educational for legislationlenders in loan bad wi makingof loans lenght used rvlibrary loan Map

How to make your Blackfin fly Part 1

The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

Boring Mathematical Bit

As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

The sample code just finds the dot product of two vectors:
int dot(short *x, short *y, int len)
int i,dot;

dot = 0;
for(i=0; i<len; i )
dot = x[i] * y[i];

return dot;

It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

Down to Business

Enough talk, here is a run of the sample code from my BF537 STAMP:

root:/var/tmp> ./cycles
Theoretical best case is N/2 = 50 cycles
Test 1: Vanilla C
ret = 100: run time:
3838 3507 3373 3408 3373 3373 3373 3373 3373 3373
Test 2: data in external memory, outboard cycles function
ret = 100: run time:
442 240 239 218 218 218 218 218 218 218
Test 3: data in external memory, inboard cycles
ret = 100: run time:
242 103 103 103 103 103 103 103 103 103
Test 4: data in internal memory, inboard cycles
ret = 100: run time:
214 53 53 53 53 53 53 53 53 53

A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:
int cycles() {
int ret;

__asm__ __volatile__
"%0 = CYCLES;\n\t"
: "=&d" (ret)
: "R1"

return ret;

Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

Allocating Internal Memory

At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:
/* I know, I know - this is very naughty :-) */
short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */
short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */

I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.


Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.


Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

Measuring Stack Usage in Multi-threaded uClinux Apps

In regular Linux the MMU allows the stack to grow dynamically, the MMU just allocates more physical pages. However in uClinux, the correct amount of stack for each thread must be allocated before the thread is created.

Too little stack and your program will corrupt the system in nasty, unpredictable ways. Thread stack gets malloced from the system heap, so an overflow means writes to an arbitrary address just outside the block of memory allocated to the thread. This memory could possibly be in use by other parts of the system, perhaps for a different application. These sorts of bugs can be very difficult to track down.

If you allocate too much stack, then you are wasting memory, a valuable resource on embedded systems. For example I discovered I was allocating far too much stack and wasting Mbytes of memory, especially when multiple threads were running.

The standard approach is to try random values of stack until you find one that works. However I thought it might be a better idea to actually measure the amount of stack used by each thread. Then I could tweak the stack allocation to optimise memory usage and even check for stack overflows at run time.

Threadstack Library

I have written a small library (called threadstack):
unsigned int threadstack_free(pthread_t *thread);
unsigned int threadstack_used(pthread_t *thread);

Here is a sample run on my Blackfin BF537 STAMP, when a 100k stack was allocated to a thread:
root:/var/tmp> ./test_threadstack
stack used: 1300
stack free: 100620

How it Works

The functions work by examining memory allocated to the stack. The theory is that if the memory is non-zero, then it must have been used by the thread at some time (the entire block of memory used for the stack is initially set to 0 before the thread starts). So the routines search the stack memory for the first non-zero value, and that is declared the “high water mark” – the point where the stack reached it’s maximum.
0xff <- stack top
0xfe <- high water mark
0x00 <- stack bottom

The high water mark will change over time, so after your thread has been running for a while is the best time to measure stack usage.

One weakness with this approach is that if stack allocation is way too low your program may bomb before these routines get a chance to run. However in that case you will at least know there is a problem, and can increase stack to some high number (e.g. Mbytes) to get the program running, before using these functions to determine actual stack requirements.


In my uClinux Asterisk port I have added code to check for stack overflow just before a thread ends:
pthread_t thread = pthread_self();
assert(threadstack_free(&thread) > 10*1024);

This code checks that while the thread was running, the minimum free stack was 10k. The assert will kill the program with an error message and tell me straight away I need more stack. Much nicer than getting an obscure bug in the system due to a stack overflow on a thread. Now the program finds stack overflow bugs for me!

This example above runs from within the actual thread itself, hence the call to pthread_self() to discover the threads handle. You can also call the functions from another thread (e.g. the main thread), for example to periodically meter stack usage.


More information on multi-threaded applications for uClinux
movies hardcoremovies ebony girls buttclips sybian moviemovies homemade sexmovie stars nudesamples movie hot xxxmovies free handjobmovies anal free Mapstar porn blonde galleryporn gallereis thumnail blondeporn videos blondepornstar blonde pussyporno teen blondepornstar blonde videoporn blondesblondes pornstars Mapporn cliphunterclipmaster pornfree clips pornclips anal porn ofof clips porn girlclips porngratuit porno clipsclipsporn Mapringtones boltblueboombastic ringtone freebreakfast club ringtonesringtone brewerspride brown ringtonescoupon ringtone cingluarcingular phones ringtones recordsurvive circa ringtones Map

Porting multi-threaded apps to uClinux

I have recently been working on improving the stability of uCasterisk, a port of Asterisk to uClinux. This required some research into memory management for multi-threaded apps on uClinux. I didn’t find any one resource that had everything I needed to know so I thought I would collate some of the information I found here as a resource for others. Thanks to all those (especially on the Blackfin forums) who helped answer my questions.

I am using the Blackfin flavour of uClinux and the uCasterisk application as an example, but this information should apply equally to other uClinux systems/applications.

MMU versus no-MMU

Asterisk is a pretty big application for uClinux, the executable is about 2.5M and when running several calls can consume 32M of system memory. The big difference between uCasterisk and other Asterisk implementations is the lack of MMU. A MMU is handy when working with large, multi-threaded apps. For example when a thread is kicked off you can allocate a virtual stack of say 2M, but physical memory will only be allocated as and when it is actually required (say due to a write to a previously unused part of the stack). If your thread never uses all of the stack, then the physical memory is available for other users.

On a MMU-less system you need to work out the maximum stack your thread may need, and allocate that. If you get it wrong, your application (and possibly the whole system) will bomb. This generally means you are wasting memory compared to the MMU case, as you always need to allocate the worst case amount of memory required.

One possible advantage of MMU-less systems is no nasty surprises – any memory allocated really does exist, and no over-commitment is possible. On a MMU-based system physical memory isn’t actually allocated until you write to it, and it may be paged to disk just when you need it (although I understand there are options to control this behaviour).

Stacks for Threads

When you start an app, you get allocated a stack for the application. This is actually a stack for the main thread of the application. When you start a new thread (say with pthread_create()) the thread gets allocated a new stack from the system heap. The two stacks are completely unrelated. The size of each stack is independent, you control the size in different ways (see below).

Tips for Porting to uClinux

Don’t enable stack checking. This feature is very useful for single-threaded apps; it causes the operating system to kill the app when it uses all of it’s stack space. Very useful, as it tells you straight away to increase the stack size. Unfortunately at present this feature hasn’t been extended to multi-thread applications; using it with multi-threaded apps (at least on the Blackfin) causes problems as pointed out in the 2005R4 RC2 release notes and discussed here.

You control the application (main thread) stack with the -s option, on my Blackfin system the command line is:

bfin-uclinux-gcc -Wl,-elf2flt='-s 1000000' \
-o thread thread.c -pthread

In this example the stack is set to 1000000 bytes.

You control the size of the stack for each thread you create using pthread_attr_setstacksize(), for example (from the Asterisk utils.c file):

pthread_attr_setstacksize(&attr, 0x1000000);
pthread_create(&thread, &attr, thread_func, NULL);

Monitoring Memory Usage

cat /proc/meminfo can be very useful, here is the output from my Blackfin STAMP BF533 board, taken while uCasterisk was running with several SIP calls in progress:

root:/var/log/asterisk> cat /proc/meminfo
MemTotal: 59784 kB
MemFree: 11084 kB
Buffers: 100 kB
Cached: 4172 kB
SwapCached: 0 kB
Active: 3828 kB
Inactive: 444 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 59784 kB
LowFree: 11084 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 4 kB
Writeback: 0 kB
Mapped: 0 kB
Slab: 43744 kB
CommitLimit: 29892 kB
Committed_AS: 0 kB
PageTables: 0 kB
VmallocTotal: 0 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB

The most important fields are MemFree (total system memory free) and Slab (system wide heap in use).

In earlier versions of Linux the CommitLimit field indicated the maximum Slab was allowed to reach before processes were killed (with Out-Of-Memory errors). However on my distro I discovered by experiment that you can actually increase the Slab well beyond this limit, as indicated above. Looking at the kernel source file uClinux-dist/linux-2.6.x/mm/nommu.c, __vm_enough_memory() function it appears that the memory allocator uses the OVERCOMMIT_GUESS method, which ignores the CommitLimit and allows up to 97% of memory to be allocated.

It is interesting to observe MemFree as you perform different operations. For example on uCasterisk when a new SIP call starts, a thread is created, which requires stack and heap space. I also noticed MemFree decreasing when I copied files on a ram file system – this caught me for a while as uCasterisk was chewing through available system memory writing Call Data Records to the ram disk and eventually causing Out of Memory errors.

ps an top are also useful, as they indicate the amount of memory allocated to the system/application.


CommitLimit and OOM Killer
Why malloc is different under uClinux
Application Debugging on the Blackfin
Intro to Linux Apps on the Blackfin (skip to bottom of page)
Blackfin forum thread where I asked some questions on this topic


I hope this was useful – pls email me or add a comment below if you have any comments/suggestions/corrections.
payday loan 6 8 australiapayday loan 8 day pay loanprocessor account mortgage loan manager processorachieve and loans studentadult loan site personals personalscredit loan secured unsecured adverse onlineloan direct student aid federal moneydirect aid financial loan student Mapringtones lg freelg ringtones howtoringtones lg4650wow basketball ringtones bow lilof ringtones funny listlocomotive ringtonesgood long friday ringtoneslow ringtones rider Mapstarfire hentaisexy secretarieshentai shemaleproposal xxxporn trailerswrestling nudepussy catanime pussy Map

The YouBox – Hardware for YouOS

I have been following an idea I originally discovered in a Paul Graham essay about the advantages of placing applications on the web server rather than the desktop PC. Hot mail and Gmail are good examples. The application and the data for that application (your email) are stored on a server.

YouOS is an interesting development along this path – it is an entire operating system that runs on a server, complete with an IDE for application development and some really powerful collaboration models. One very powerful feature is the ability to move from one computer to another, fire up YouOS on the browser, and there are all your applications and data – just as you left them.

Looking into the future a little there will come a day where ALL your applications, and ALL your data can be stored on a server.

This transition seems to be already happening in some parts of the world. This point from the Ajax web site really interested me:

  • Web as the only Platform Thanks to the widespread adoption of public internet access, the so-called technology gap between countries and between socioeconomic groups is closing. Many people don’t actually own a PC, but do have regular access to the web at internet cafes or schools or friends’ homes. For this diverse category of user, there’s no point installing applications and keeping their data locally. The web is their only platform.

So we have a large number of people who use the web as a platform for economic reasons. YouOS will increase the power of the web platform. What would also help is lowering the cost of web access.

With YouOS the only application you need to run on your PC is a browser. Which suggests to me that you don’t need a PC anymore, just some bare-bones hardware with internet connectivity capable of running a browser.

One thought I have had is adding a monochrome LCD display and keyboard to an embedded linux platform (like the hardware in a WRT54G router). These little routers retail for $60, so must cost about $20 to build. Add a keyboard and LCD and you still have a device that still costs less than $50 to build. Then you would have a small, Wifi connected computer with plenty of CPU power/memory to run a browser, basic command line tools etc.

Combined with ubiquitous connectivity we have a YouOS-Access-Device (YAD? or maybe “the YouBox”) and can replace the desktop in many applications. In a laptop form factor the YouBox could be really light and thin and almost disposable. It would be very portable and lightweight and would use much less power than a regular laptop. It’s more like a larger version of a Palm. For a few extra dollars you could add sound-blaster type audio and the device is also a telephone.

I know there is a sub-$100 laptop project out there, the YouBox is another approach that uses the web as a platfrom paradigm to optimise the hardware. One advantage of the YouBox is that it can be put together in small quantities using off the shelf components.

The YouBox concept still has many questions – for example the need for an internet backbone to connect the YouBox to a server and also the YouOS servers. However these may be a little easier to solve, for example if one backbone/server can handle X YouBox clients, you amortise the backbone/server cost by X.

free movies asswatcherbehind door green movie thesex movies bimovie tits bigroderick brande moviecelebrity archive browse moviemovies children pornfootjob movieslesbian free asian moviescheerleader sex of hardcore movies freecredit and $1000 loan bad1 unsecured loan hourvalue 125 loan 2nd mortgageno credit 20,000 loans check4.5 rate interest loansloan servicing access grouploan accredited problems homecar alabama loans Map