How To Make Your Kids Rich

This page describes a simple plan to generate $150,000 for your child by the time they turn 21.

The Plan

What you need to supply is:

  • Time: About 20 years is great, more or less time can work if you adjust the plan. The plan works very well if you start when your child is born.
  • Saving. You need to save $8 each work-day. You get weekends off so thats $40 a week. This is the tough bit. Most people can spare $8/day but very few can save it over the long term.

It works like this:

  1. You save $8 a day, $40 a week, or around $2000 a year.
  2. You invest this in a managed share fund that is indexed to the share market. Based on historical performance this will mean an average of around 10% appreciation every year. The share market has many ups and downs, but over the long term, 10% is about right if you reinvest dividends. You are interested in the long term, so all the swings will average out. If you don’t like the share market you could choose some other investment with similar returns like a property fund.
  3. You increase your savings each year based on inflation. For example if the average inflation is 3%, you increase your savings from $40/week in the first year to $41.20 a week in the second year.
  4. You keep this up for 20 years. Your saving and the magic of compound interest means your child now has over $150,000 at age 21. Even with inflation, that still a lot of money, just when they need it the most. Invested wisely, it could make a big difference to your childs life. The cool thing is that unlike you and me kids have a pretty long time horizon, which lets compound interest go to work with a vengeance.

The Spreadsheet

Here is an Excel spreadsheet that shows you how it all works. Here is the same table in PDF form. I wanted to put the table directly in this post but couldn’t work out a nice way of getting a formatted spreadsheet table into WordPress.

BTW I find Excel (or the spreadsheet of your choice) really cool for basic financial modelling like this. It’s a lot of fun to experiment with. If you think part of my model is wrong, try entering your own parameters and see what develops. What is the effect of 5% inflation, or putting $80/week into the plan?

The Results

Does it work? I think so. I started it about 8 years ago (when my first son was born and my daughter was 2) and while not exactly matching the model results both of them could now buy a (small) new car outright. My 8 year old probably has more savings than many 28 year olds. Still, my kids are a few years behind the curves in the spreadsheet which makes me wish I had put in more money earlier. Doh!

On problem is that I am struggling to put the $40/week in. I like to do it in lump sums of a couple of $k a year and my wife often objects when I suggest it. It’s kind of hard to visualise the compounded effect $2k can have over 20 years for many people, especially if you are focussed on paying down a mortgage or have bills to pay. However I can’t think of a better use for my money that my kids future. Can you?

In Australia the government is giving $4,000 to the parents of every new baby born. So when our 3rd child was born last year I invested this baby bonus it in my newborn sons name. I am not sure what most parents do with their baby bonus however I suspect a lot of it ends up being used for flat screen TVs. Now my 11 month old son needs to file a tax return as he made a few hundred dollars in dividends! As my daughter says “he can’t even talk and he is making money”!

capitalization of loan rate interest amountcompanies 100 top loan direct worldstafford 2008 rules loan texasadrienne sloanealberta legal default student loans courtaba loan calculator autoloan bad auto for credit $9,000payday collection loan ambassadorarticles loan adjustable postcard clients rateamerican microloan3 g ringtonetime 1 ringtones purchase100 free downloads ringtone100 free download ringtonetv ringtone 24 seriesmobile to phone ringtones add24 mp3 ringtoneg ringtone for free 3 Mappayoffs loanporthole loanmissouri quarters loanag loans minnesota loan on ratesrepayment formula loanrescue loanuk search loanloan industry servicing Map

Roads and Wealth Creation

For years I have been interested in travel to some of the more remote areas of Australia. Recently I bought an old (but tough) 4WD, and with my children Amy and William we set out for the Flinders Ranges, about 500km north of our home in Adelaide, to spend a few days camping and doing a little off road driving.

One day we visited Arkaba Station, a private property where for $35 we could spend a few hours on a self guided 4WD tour through the station. There was a mixture of terrains; creeks, some very steep (at least for me) and very rough roads. I was really an ideal trip for me, just right for my (very basic) 4WD skill level. We were the only ones on the track for the entire time, and the scenery was very nice, although I was focused mostly on the road. The kids really liked the steep bits and being shaken to pieces on some of the rough roads. They also spotted lots of wild life (Emus and Kangaroos, as well as sheep that are run on the station).

One section was covered with large rocks sticking out at all angles. Our speed was limited to about 5km/hr, and inside the car we were shaking and bouncing all over the place. We could have got out and walked faster. It occurred to me that this was probably a typical road for 99% of history, and that travel at walking pace (and being shaken to pieces if in a vehicle) was probably the norm. This must have made communications very difficult, expensive and therefore very rare.

Driving off road made me appreciate that a simple flat ribbon of bitumen that enables us to travel at highway speeds really is a miracle. The distances we can cover in a few hours on a modern road must have taken days 100 years ago using the poor roads and horse and carts of the time.

Roads as a creater of wealth

One question I have been thinking about lately is “where does wealth come from”. I mean, we all talk about economic growth, which to me means more money in a country/region as a whole, but if an economy grows, we all get richer, so where does the wealth come from? Who is pumping the money in? If we are getting richer, is some one else getting poorer?

Thinking about roads gave me a good example of wealth creation.

Roads allow us to move more efficiently, less effort is required than without them. The government took our taxes and invested them in this infrastructure, now it costs much less to get from A to B. The cost of travel has been reduced.

Lets look at an example. In 1850 Farmer Joe needs to travel a round trip of 100km. He uses his horse and cart and spends two days to travel. In those two days he cannot be doing any other work. Now in 2006 Farmer Joe jumps in his car and travels the 100km in 1 hour. He returns to the farm and can work for 2 full days (minus the 1 hour). That 2 days of extra work that Joe earns income from. This is additional wealth that has been created by the road.

The interesting thing is that no one else had lost anything to make Joe richer. It is a win-win situation. Communications infrastructure (the road) has created wealth, and no one loses.

What about the cost of the road? Well I am assuming that the government paid for this out of the same taxes Joe would have paid anyway. They just decided to spend the taxes on a road rather than a warship or new town hall or something. I am also guessing that the cost of the road is far less than it’s economic benefit to those who use it over the years.

Wealth Creation

Looking around at the people around me I notice that most of are much better off in a material sense that we were say 20 years ago. Much of the stuff we buy (especially consumer goods) is also much cheaper than it was 20 years ago. For example in 1989 I bought a VCR for $1100, I just bought a combined VCR/DVD player in 2006 for $150. Adjusting for inflation thats a cost reduction of about 15 times! Fifteen VCRs for the price on one.

Somehow, wealth has been created. We all have a lot more money. My mother tells my stories about “stretching” the food money to last until pay day in the early 70’s. Now our biggest problem is when are we going to get a flat screen LCD TV?

The only exception to this rule seems to be real estate. My theory here is that real estate prices have been pushed up by demand as people have enough discretionary money now to cover huge interest repayments. Real estate has soaked up excess wealth through the laws of supply and demand.

But I digress. The question I was looking into is “how is wealth created”?

The best answer I have worked out so far (worked out in discussions with several friends) is that: we all get richer as the cost of producing stuff is reduced.

So I think it works like this:

  • Technology is developed that allows something to be produced cheaper than before. For example fertilisers make food production more efficient. Silicon chips lower the cost of electronics.
  • This makes some product or service cheaper to deliver.
  • So we spend less to obtain a given product or service, and have more money free for other things. Like flat screen TVs, or huge interest payments on your mortgage. Or sending our kids to expensive private schools.

Advancing technology is not good for everybody. Some industries get displaced. The development or roads and cars probably put a lot of the horse and cart industry out of business. This must have caused a small group of people a lot of distress, despite the huge benefit to a large number of other people. And yet the world is clearly a better place for having efficient transport.

It is also interesting to see what we do with our increased material wealth. Somehow soaking it up in big interest payments or mobile phone bills doesn’t seem right when there are 10,000 children dying each day from preventable disease. But thats another story. I am currently reading a couple of books on these subjects, and recommend them to you:

Communications Technology for Wealth Creation

I see a lot of parallels in physical communications assets like roads and communications technologies like VOIP and the Internet. They all let us communicate faster and better.

They are engines for wealth creation.

So reducing the cost of communications is a good thing. It lets us communicate faster and cheaper. Reduce the cost enough and maybe some kid in a disadvantaged country can make his first telephone call or obtain an education over the web.

Summary

So on balance I think “decreasing the cost of stuff” is a good thing. It is a mechanism for creating wealth. Like the example of roads, it can be win-win – wealth can be created like magic without disadvantaging a large group of others.

I must admit I find this pretty cool. It suggests that we can actually increase wealth (and the material well being of everyone on the planet) in a win-win situation. For us fortunate people in the 1st world this means more gadgets or a nicer house. For the less fortunate it means hope that one day they will be able to eat 3 meals a day and not die of Malaria.
ringtone para nokia 11083589i ringtone nokiasprint free sanyo 4900 ringtonefree sanyo ringtone sprint pcs 8200ringtones vx3200 lg alltelalex etheringtonalltel ringtones lg vx3200composable free ringtone motorola 120e Mapfamosas actrices pornoadvantages analysis gravimetric ofteen porn videos amatuarsex photos amputeesex adora35p phone sexalicia machado porn3movies porn Mapfree porn cartoon moviesmovies porn privatesexy stars moviefree movie horse sexaebn streaming moviesporn movies cartoon freeasian free movies sexjob movie hand Mapnipples youngshemale analpussy bareflintstones hentaiphotos peniscock cummingunderage preteenmature wives Map

A Bit Exact DSP Fairy Tale

Once upon a time in a far off land there was an underpaid pale skinny guy working in an impoverished start up who wanted to build products with fixed point DSP chips in them. The problem was he had just about killed himself doing the same thing many times in the past.

In particular, he seemed to spend a zillion hours debugging intricate fixed point DSP assembler to get products working. All while some project manager who hadn’t written a line of code for 10 years was hovering behind him asking questions like “is it ready” and mumbling something about milestone dates that were ridiculous when proposed, and soon became hilarious after the project started.

So, our hero wondered, “is there a better way”.

How to Build a DSP Based Product

Many DSP products are developed like this:

The academics “brains trust” guys work out a clever new algorithm. Now these guys are way-smart, but tend to express themselves in algebra with lots of Greek letters. They usually have PhDs, and cost a lot, and there aren’t many of them. They might test their brain-child in a simulation algorithm like Matlab:

x = [1.1 2.0 3.0];
y = [2.5 3.0 4.0];
dot = x * y’;

Now this lets them test the algorithm quickly (dot = 20.75), but you can’t run code like this on a real world product due to (a) it runs 10 million times slower than real time and (b) Matlab licences cost $5,000 each and only run on $3000 PCs. Which is a shame when u are building a $99 product.

So some poor engineer then has to make the DSP algorithm work in real time (i.e. fast) on the “target” hardware, a $10 DSP chip. Actually, in general, these guys actually like the detail work this involves. The first step they might do is convert to floating point C:
float x = {1.1,2.0,3.0};
float y = {2.5,3.0,4.0};

float dot(float x[], float y[], int n) {
float d = 0.0;
int i;

for(i=0; i<n; i )
d = x[i]*y[i];

return d;
}

But the guys & gals in Marketing (they are a breed closely related to Managers, but tend to be more attractive and use airports more) have decided that this must be a low cost product, so a $10 fixed point DSP is chosen.

Now a fixed point DSP is a chip that can perform multiplies and adds quickly, as long as you only use integers. They don’t do floats. So the engineer converts the algorithm to fixed point:
short x = {11,20,30}; /* x*10 */
short y = {25,30,40}; /* x*10 */

int dot(short x[], short y[], int n) {
int d = 0; int i;

for(i=0; i<n; i ) d = x[i]*y[i];

return d; /* returns dot*100 */
}

Now look what has happened to the x[] and y[] arrays. We have multiplied each number by 10 so we can use an integer to hold the value. If we didn’t scale each number then we would lose information (for example converting the 1.1 to 1) which on certain algorithms would cause problems.

However because each input value is multiplied by 10, that means the result is multiplied by 100 ( 10x * 10y = 100xy ). So we need to take this output scaling into account in any other stages of the algorithm. One more piece of information we need carry around with us as we code. One more source of bugs.

This is a very simple scaling example. In practice it gets much scarier and means you need to hold a lot more information in your head when you program in fixed point (see complex sample below).

There is also the potential of overflow problems. As the output is scaled up by 100, it is easy to overflow the 32 bit range of the integer output, for example if we used x[] and y[] arrays with larger numbers.

I Feel the Need for Speed

The product requires that this routine must run as fast as possible on the DSP chips. So we need to code an optimised version of this routine in assembler:
int dot(short *x, short *y, int len)
{
int d;

__asm__
(
"I0 = %3;\n\t"
"I1 = %4;\n\t"
"A1 = A0 = 0;\n\t"
"R0 = [I0 ] || R1 = [I1 ];\n\t"
"LOOP dot%= LC0 = %5 >> 1;\n\t"
"LOOP_BEGIN dot%=;\n\t"
"A1 = R0.H * R1.H, A0 = R0.L * R1.L
|| R0 = [I0 ]
|| R1 = [I1 ];\n\t"
"LOOP_END dot%=;\n\t"
"R0 = (A0 = A1);\n\t"
"%0 = R0 >> 1;\n\t"
: "=&d" (d), "=&d" (before), "=&d" (after)
: "a" (x), "a" (y), "a" (len)
: "I0", "I1", "A1", "A0", "R0", "R1"
);

return d;
}

This is assembler for the Blackfin DSP, the actual code will of course vary for different DSP chips. The important bit is this:
A1 = R0.H * R1.H, A0 = R0.L * R1.L
|| R0 = [I0 ]
|| R1 = [I1 ]

In a single clock cycle, our intrepid DSP performs two parallel 16-bit multiplies, two 16-bit adds, and fetches 2 32 bit words from memory, post incrementing the pointers after each fetch. It does all this 500 million times a second, and sometimes it even gives you the right answer.

But not often. First you need to spend a lot of time getting it all to work right. The assembler is much harder to grok than the previous versions of the algorithm, and you use up a lot of brain power on a myriad of fine details. Get any one of the details wrong and the product is a dud.

The target development environment is also tough. In many cases the feedback you use to debug might be an analogue signal flowing into or out of the product (that is processed internally by the DSP). Or a blinking LED. Sometimes you don’t even know when its really working, bugs can simply slip through. You can’t always stop the code and single step using a debugger as parts of the algorithm depend on real-time activity.

It’s a development/testing/maintenance nightmare.

Bit Exact Assembler

There is a way to ease the pain a little. Lets say there is a bug in the assembler version. The problem is that to debug the algorithm you may need to understand both the assembler and the DSP algorithm, I mean to debug something you need to know how it works, right?

This is OK for a simple algorithm, but what if it is even moderately complex? Then you are dealing with (i) a very complex algorithm and (ii) very complex assembler all at the same time. This is usually too much for mere mortals (and even minor deities). The result is long, painful development, failed projects, late nights, angry spouses, and lots of pizza (well its not all bad I guess).

The trick is to divide and conquer. Make sure we are only hitting one tough problem at any given time.

One approach I have found very useful is bit exact porting to assembler. There are two important steps:

1. At the fixed point C stage, you test very very carefully. Run batteries of tests and simulations. These can be performed on a regular PC in non real time, using powerful debuggers. The idea is to verify that (i) the algorithm is OK and (ii) the fixed point port works OK. In particular (ii) is very tough, so its nice to handle this in a relatively benign environment like a PC or workstation.

2. Port each function to real time assembler one by one. Test each function against the fixed point C reference. Make sure the functions give IDENTICAL output – right down to the last bit. This takes a lot of discipline – near enough is NOT good enough.

The benefits of bit exact coding are:

1. When coding assembler, you don’t have to worry or even understand the algorithm. You just have to make sure that the C version matches the assembler version. When they do, you know that the overall system will perform as expected. This lightens the load on your brain, leading to less bugs.

2. As you integrate modules of code together, problems are easy to spot, just look for where the output doesn’t match the fixed point C simulation. This is much simpler that performing batteries of tests in a real time environment.

3. A less senior engineer can be used to perform the assembly coding, as they don’t need algorithm knowledge. This makes it easier to find people for the project, lowering costs. Adding people to the project is less of an issue, as they can get up to speed and be productive quickly.

4. Any bugs are exposed early (while unit testing), rather than late (after integration). You get an honest measure of progress, rather than nasty surprises late in the project.

Where it fails:

1. It takes a lot of discipline to perform (1) CAREFULLY. There is a lot of pressure from managers (and engineers) to jump onto the sexy real time/hardware integration stage. If you are part way through (1), and there is a milestone due, it is very tempting to “declare victory” and move onto the next stage, WAY before you are ready. The results are predictable…

2. It takes discipline to make the assembler bit exact. I mean, look, it’s nearly the same, just a few bits are different? The problem is these little errors may lead to much bigger flaws in performance later on. Try changing a single bit in your operating system binary image and see how well it runs. In some (most) cases bit-exactness matters!

3. It is honest. It gives you a black and white go/no-go indication at various stages of the project. For example the fixed point C stage is not complete until it has shown perfect performance against the spec. This reduces the “room to move” managers have, you can’t declare a phase “complete” just to suit a milestone date. It really has to work. This sort of honesty doesn’t sit well with some managers, for example they may have to admit problems to a customer early on in a project.

Case Study – Non Bit Exact

Now here is what typically happens. Our hero is under pressure from his manager to progress, so he short circuits the fixed point C simulation stage and codes straight into assembler. The C simulation and bit exact assembler have results that are “close” but not really the same. No effort is made to synchronise the results, there isn’t enough time!

Time is also short so he can’t “waste time” on unit testing, there is a milestone with a big fat cash payment attached.

The assembly code is loaded into flash on some brand-new, untested hardware, power is applied, and some basic tests performed.

Damn. It doesn’t work.

So our hero and his team start trying to debug the real time assembler code. However debugging in real time is very time consuming. The best brains in the company spend months tracking down bugs that end up being simple one line fixes. Bugs constantly interact with each other, leading the team off in false directions.

He doesn’t go back to the C simulation as it now gives different results to the assembler and “we don’t have enough time” to make them match.

As the project falls behind, there is no one who can be added, as only people who understand both the algorithms and the assembler can be used. The team is debugging leading edge algorithms in assembler.

A year passes. Senior management is fuming. They keep getting told “real soon now”. However no-one really knows how long it will be, because no one really knows just how many bugs there are. Frustration and stress is very high all around.

Eventually the company runs out of funds and the project is cancelled. There are mass departures of some of the best and most capable people from the company. Our hero leaves to join another start up where he repeats the same process.

Case Study – Bit Exact

Our hero and some members of the brains trust spend many months carefully porting the algorithm to fixed point and testing the entire product in a robust simulation. This apparent lack of progress is tough for the managers to handle, but they understand the benefits and bravely support the team.

Lots of complex bugs are thrown up, and the team works through them one by one. They don’t take too long to fix because these tough problems are at least occurring in a PC environment, where it easy to track them down.

In parallel, the fixed point C modules are passed to a team to perform the assembler coding. This team is made of up many junior engineers, who are instructed to make sure each module is bit exact with the reference C simulation. A few of them don’t like this, but the more senior members keep them on track. The junior members do a good job, as all they need to focus on is the assembler, not the algorithm details.

A few bugs slip through unit testing to integration. A senior member who is a veteran of previous “death march” DSP projects is assigned to fix the bug. She discovers that the the problem was due to an exceptional case not covered in the unit test and sure enough the assembler module wasn’t bit-exact under that case. She had previously been a critic of the extra effort required for bit exactness but the speed in which this bug was found and fixed impresses her.

Early in the project some unforeseen problems are discovered during the work on the fixed point C simulation. Some parts are running slower than expected which means more assembler may be required in the final product. This throws up an early indication that the project is at risk of falling behind.

This critical management information is flagged early, while there was time to do something about it. It is decided to add some extra engineers to the assembler side. As soon as they learn the DSP instruction set they quickly come up to speed. They don’t need to know or understand the algorithm details.

The schedule still slips by a few weeks, but all the stakeholders understand that this is a better outcome than delivering a lemon on time. No further slips occur, and the project manager earns a great deal of trust and respect by delivering “bad news early” rather than trying to hit arbitrary milestone dates and risking the entire project.

When the real-time DSP software is integrated with the hardware there are still a few bugs to find. However by this stage the real time software is at a very high quality level so the bug count is low. Integration actually proceeds on schedule, amazing everybody, especially veterans of past projects.

The product is a success, the start up has a great IPO, and the pale skinny guy retires at 28 years old.

Complex Example

Here is a rather more complex example of fixed point C code from the Speex project:
static inline spx_word32_t cheb_poly_eva(
spx_word16_t *coef,
spx_word16_t x,
int m,
char *stack
)
{
int i;
spx_word16_t b0, b1;
spx_word32_t sum;

/*Prevents overflows*/
if (x>16383)
x = 16383;
if (x<-16383)
x = -16383;

/* Initialise values */
b1=16384;
b0=x;

/* Evaluate Chebyshev series formulation
using an iterative approach */
sum = ADD32(EXTEND32(coef[m]),
EXTEND32(MULT16_16_P14(coef[m-1],x)));
for(i=2;i<=m;i )
{
spx_word16_t tmp=b0;
b0 = SUB16(MULT16_16_Q13(x,b0), b1);
b1 = tmp;
sum = ADD32(sum,
EXTEND32(
MULT16_16_P14(coef[m-i],b0)));
}

return sum;
}

In English (sort of), this function works the value of a polynomial at various points around a circle. Don’t ask me why. DSP engineers just like doing things like that. For more information/punishment see the file lsp.c from the Speex project.

Macros like ADD32() simulate the instruction set of the DSP chip in the simulation code, which makes porting this function actual DSP assembler a simpler step. However they also make the code much harder to read.

Renumbering components using Perl

I am designing a telephony version of the BlackfinOne DSP/uClinux board. To develop the design I am using the gEDA CAD tools. This board runs uClinux on a Blackfin BF532 processor, and has been designed using the open source gEDA tools which makes it really easy for anyone to re-use the design. Thank you very much to Dimitar Penev and Ivan Danov who designed and published this board. They have done an excellent job.

The first stage in any electronic design is usually to enter the schematic. The schematic shows the components and their wiring in a form that is easy to follow:

I wanted to add two Ethernet ports to the BlackfinOne design, so I found a reference design and entered the schematic using the gschem program.

After I entered some sheets for the Ethernet chips, I had a little problem. In the schematic above you can see that every component has a letter & number next to it, for example, C1, C2 for capacitors, or U1, U2 etc for chips. This is a unique ID for each part. The letter generally indicates what sort of part is is (capacitor, resistor, chip etc). These IDs are know as reference designators or as a “refdes”.

Now I wanted the refdes on my new Ethernet sheets to merge nicely with those on the existing BlackfinOne design. For example if the last capacitors refdes on the BlackfinOne design stopped at C51, then my C’s should start at C52.

There is a program to do this (refdes_renum described here in the gEDA tutorial) but unfortunately it renumbers all of the refdes in the design, and I only wanted to renumber the new parts that I had added. I didn’t want to mess up anything in the existing design. Please also see the Links section below for a new alternative – grenum.

Now of course it is possible to number the components manually in gschem via the GUI, but if you know a little Perl then its much more fun to hack out a solution! Now fortunately gschem uses text files to store the schematic. The refdes are stored in the schematic file like this:
--snip--
C 7000 87800 1 0 0 bf-93LC46.sym
{
T 8200 89100 5 10 1 1 0 0 1
refdes=U19
}
--snip--

I think this means that the component (leading C) that is drawn using the bf-93LC46.sym symbol file has a refdes called U19. The line starting with T tells gschem where to place the refdes text on the sheet. Anyway all I am really interested in is reading and changing the refdes lines.

So the Perl script num_refdes.pl first scans all of the existing schematic files for lines that have the text /refdes=/. The number of the refdes is then extracted, and stored in a hash if it is the largest refdes so far. On the new schematics, I initially enter each refdes as C?, U? etc. The script scans these, and wherever it finds a ? it replaces it with the next refdes in that series.

Improvements and Reflections

There is one obvious improvement. The current script isn’t smart enough to use a refdes that hasn’t been used yet, for example if C1,C2,C4 are currently used it isn’t smart enough to use C3, it will use C5 instead.

This script took me about 70 minutes to write and it worked perfectly the first time. To be honest it didn’t really save me any time (I could have manually entered the refdes in less time) but it was lots more fun to do it this way. And I guess it will be useful the next time I have this problem. Perhaps now that I have posted on it some one else might also find it useful. Come to think of it writing this blog post took longer than writing the script! DOH!

BTW every time I work with Perl (which is maybe once every two months) I spend the first 20 minutes looking up web sites to work out which regexp to use, how to iterate through a hash, etc. If I don’t work on Perl and regexps constantly it all looks like line noise to me after a few weeks:
if ($line =~ /refdes=(\w)(\?)/) {

But hey, its no big deal to refresh my memory every few months.

Solving little problems like this with Perl is fun. You almost hope there isn’t a solution out there already, just so you have an excuse to write one!

Links

I have also written a few Perl scripts for PCB (the PCB design application in the gEDA suite) here.

After I posted the above, Levente Kovacs kindly pointed out his grenum utility which is now included with gEDA! This has been recently added to gEDA and performs the same job as the Perl script above, but can also spot gaps in refdes sequences. As grenum is now part of the official gEDA distribution I would recommend using it rather than num_refdes.pl. I share a trait common among engineers – a strong desire to “do it myself” rather than checking carefully for existing solutions! Thanks Levente!
14 payday 20 loan directory2500 cash quick loansquick 26 loan 37 paydaycredit bad loan 2b personal45 loan payday 31 advance uksites 37 payday loan 53loan cash payday advance 5 5006 loan online kentucky payday 8loan 6 cost 20,8 payday lowlink 60 payday loanmovies hentaimovie lesbiansex moviesmovies adultfree hentai moviesporn moviesmovie clips porn freemovies gay free Mapmp3 achha ji1990 mp3 temptationsnation acumen mp3drame mp3 adamatecktonik mp3 tepr acdgmp3 ackermann sucheachhe mp3 lagtemp3 blog 1990s Mapfree naked moviesdildo movies hugehardcore clips moviemovies dans pornblow free movies job bestwomen movies nakedgirls free teen moviesdouble penetration free movies Mapporn cliphunterporn clipmasterclips free pornporn anal of clipsof girl porn clipsporn clipsporno clips gratuitclipsporn Map

Using Make to build hardware

I have been working on some telephony hardware using the gEDA open source CAD software. Now designing Printed Circuit Boards (PCBs) is fun, but very GUI-centric. Click-Click-Click with the mouse all day. After a few frustrating experiences (OK I messed up the design a few times) I thought “there must be a better way” and “why can’t I use the command line and Make for this?”

Lucky for me the gEDA software uses text files to store the PCB layout so as an experiment I wrote some Perl scripts to handle some boring repetitive tasks. The bottom line is I am using Make to assemble a Printed Circuit Board design from smaller “modules” of PCB – just like software. I have written about it (with lots of fancy screen shots) here.movie enemafacesit moviessex women movies fatmovie favorite quotesfemale movies strippershitting females moviesfighting movie website the temptationfinal xi fantasy moviemovies free wmv adultfree sex anal movie downloadsringtone sanyo 5400 sprint freesamsung ringtone a680 midiamc 30 barrington moviesnokia easy 3585i text ringtone freeaudiovox ringtone 8910 cricketclowns 1000 ringtone nextelfree a950 ringtone samsung scha660 samsung sprint free ringtone Mapto credit allowing friends use2bcredit accept 2bcardsunion credit affinity federal 07481accreditation masters divinity aabc inmaine federal union credit acadia madawaskaaccreditation ahrpcard processingcom gambling merchant credit accountbureas 3 credit Map

How to make your Blackfin fly Part 2

This article describes how to write fast DSP code for your Blackfin.

For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).

We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.

The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.

We obtained gains from:

1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.

2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.

3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.

4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.

Why The Blackfin is Different

Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.

The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.

The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.

Some examples

To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.

The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.

A Simple Library for Profiling

To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:
SAMCYCLES("start");

for(i=0; i<10; i );

SAMCYCLES("end");

around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
root:/var/tmp> ./test_samcycles
start, 0
end, 503
TOTAL, 503

Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
root:/var/tmp> ./testenc male.wav male.out
start nb_encode, 0
move, 1352
autoc, 16149
lpc, 3180
lpc_to_lsp, 21739
whole frame analysis, 17797

Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
all of the CPU.

Care and Feeding of your Blackfin Cache

One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
"P0 = %2;\n\t"
"R0 = 0;\n\t"
"%0 = CYCLES;\n\t"
"LOOP dot%= LC0 = %3;\n\t"
"LOOP_BEGIN dot%=;\n\t"
"W[P0 ] = R0;\n\t"
"LOOP_END dot%=;\n\t"

This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.

However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
686 542 597 542 542 542 542 597 542 542

The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.

However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.

The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.

This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.

There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).

Another less exciting way is to enable “write back” cache (a kernel configuration option):
root:/var/tmp> ./writethru
Test 1: Write 100 16-bit shorts
119 102 102 102 102 102 102 102 102 102

Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.

On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.

In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..

Internal Memory

The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.

The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.

In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
/* try alloc-ing and freeing some memory */

pa = (void*)l1_data_A_sram_alloc(0x4);
printf("pa = 0xx\n",(int)pa);
ret = l1_data_A_sram_free((unsigned int)pa);

which produces the output:
root:~> /var/tmp/test_l1_alloc
pa = 0xff803e30

i.e. a chunk of memory with an address in internal memory bank A.

To see the effect of internal versus cache/external memory:
Test 1: data in external memory
ret = 100: run time: 173 103 103 103 103 103 103 103 103 103
Test 2: data in internal memory
ret = 100: run time: 103 103 103 103 103 103 103 103 103 103

After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.

I should mention that to make this driver work I needed to add a an entry to my
uClinux-dist/vendors/AnalogDevices/BF537-STAMP/device_table.txt file:
/dev/l1alloc c 664 0 0 254 0 0 0 -

then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
root:~> ls /dev/l1alloc -l
crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc
root:~> cp var/tmp/l1_alloc_k.ko .
root:~> insmod l1_alloc_k.ko
Using l1_alloc_k.ko
root:~> /var/tmp/test_l1_alloc

Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.

Speex Benchmarks

Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:

The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).

(1) Write through cache, ext memory: 564
(2) Write through cache, int memory: 455
(3) Write back cache , ext memory: 465
(4) Write back cache , int memory: 438

So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.

Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.

To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
svn co http://svn.xiph.org/trunk/speex

Or you can download a current snapshot from here.

movies sapphic contentfree adult trailers moviemovies free voyeursapphic petite moviesdancing dirty moviefree clips movie ebony pornmovie pornosex movie hidden disneyfucking granny moviesmovie the blowbarrington alfreds of island rhodetheme g ali ringtonesringtones 3ringtones 311ringtones 99missed call ringtone 1mobile ringtones absolutely freeac ringtone dc Mapviagra pills 100mg priceviagra acapulcoviagra achetertramadol $79 180200 cod overnight tramadolper hcl tramadol acetaminophentramadol all aboutversus ambien xanax Mapringtone acdc midi12 ringtone adamringtones australia fair advancesecond ringtone 203310 ringtone composebalut abs cbn ringtone64 poliphony ringtones1000 words ringtone Mapporn 69 position69ing porngrade 6th porn7 penis porn inchporno 70mivies 70s pornporn pictures 70sporn s fan 80 Map

GSM Port for the Blackfin

For my uCasterisk project I needed a couple of optimised codecs for the Blackfin. This post discusses the steps taken to port GSM to the Blackfin.

The GSM codec for the Blackfin can be downloaded here.

Usage

1/ To make:
make

2/ To test:

Download tgsm (test program produced by make) to your target and also download a source speech file like:

to your Blackfin hardware and type:
root:/var/tmp> ./tgsm male.wav male.out
TOTAL, 0
SNR = 10.4591 dB enc 114 dec 39 k cycles/frame
root:/var/tmp>

When it runs it prints out the number of cycles it took to execute each 20ms encode and decode frame.

You can then upload the output file (male.out) to your host and listen to it. On my Linux box I use “play male.sw”, the sw lets “play” recognise it as a 16-bit signed-word file.

Optimisation

I spent a day or so optimising the code, for example:

a) I wrote Blackfin versions of the macros in gsm/inc/private.h

b) Applied the profiling macros SAMCYCLES and worked out which parts of the code needed the most optimisation.

c) I looked at the assembler output of various functions (gcc -S or -save-temps options) and modified the C code for better output, such as using the hardware loop supported by gcc 4.1. A lot of the original GSM code was written for older x86 compilers, and lots of compiler-specific mods were evident. In many cases to speed up code I just went back to vanilla C and the Blackfin compiler did a better job!

e) By inspecting the assembler I found some important routines were making function calls inside their inner loops which is very inefficient. These were modified to remove the function calls.

f) Use some assembler in the tightest, most cycle-hungry loops.

Performance

Using gcc 4.1 and testing on a Blackfin STAMP BF533 board:
encode: 114,000 cycles/fr: (114,000/0.02s) = 5.7 MIPs
decode: 39,000 cycles/fr: (39,000/0.02s) = 1.95 MIPs

The initial number of cycles per encode was 274,000, decode 82,000.

Further Work

My gut feel is it might be possible to reduce the total (encode plus decode) cycles by perhaps another 30% with further optimisation.

a) The analysis and synthesis filter functions consume about 50,000 cycles per encode/decode cycle, they could be converted to assembler.

b) The RPE algorithm (rpe.c) could be optimised.

c) Blackfin internal memory might speed some operations, such as autocorrelation.

How To Profile

I have written a set of macros (samcycles.h) to sample the Blackfin cycles counter. Here is an example on how to use them:

a) Patch code.c:
patch -p0 < code_profile.patch

b) make, download tgsm and re-run on the target:
root:/var/tmp> ./tgsm male.wav male.out
start Gsm_Coder, 0
Gsm_Preprocess, 5312
Gsm_LPC_Analysis, 11406
Gsm_Short_Term_Analysis_Filter, 23483
Gsm_Long_Term_Predictor, 11525
Gsm_RPE_Encoding, 8308
Gsm_Long_Term_Predictor, 10947
Gsm_RPE_Encoding, 5411
Gsm_Long_Term_Predictor, 10701
Gsm_RPE_Encoding, 5422
Gsm_Long_Term_Predictor, 10696
Gsm_RPE_Encoding, 5409
end Gsm_Coder, 521
TOTAL, 109141
SNR = 10.4591 dB enc 115 dec 39 k cycles/frame
root:/var/tmp>

c) To investigate further, just add more SAMCYCLES() macros. Its a good idea to remove or disable the macros when you are finished, as they use a few thousand cycles:
patch -R -p0 < code_profile.patch

Thanks

To Jean-Marc Valin and the Speex project, I used some of their assembler code (see COPYING.xiph for the copyright message related to this code).
loans $100loan state alaskapercent 125 home equity loanloan home alabama mobilefast 10,000 loanequity accelerated loansfraud advance fee loana loan paper c b Mapporn review 1280×72013 girl porn14 lesbian pornporn year old 14vfere min 15 pornpreviews 15 porn minclips minute porn 1515 minutes porn Mapagreement laptop loanlaserpro origination loan systemva loans about learnlegal templates document loan personalrepayment loan educational for legislationlenders in loan bad wi makingof loans lenght used rvlibrary loan Map

How to make your Blackfin fly Part 1

The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

Boring Mathematical Bit

As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

The sample code just finds the dot product of two vectors:
int dot(short *x, short *y, int len)
{
int i,dot;

dot = 0;
for(i=0; i<len; i )
dot = x[i] * y[i];

return dot;
}

It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

Down to Business

Enough talk, here is a run of the sample code from my BF537 STAMP:

root:/var/tmp> ./cycles
Theoretical best case is N/2 = 50 cycles
Test 1: Vanilla C
ret = 100: run time:
3838 3507 3373 3408 3373 3373 3373 3373 3373 3373
Test 2: data in external memory, outboard cycles function
ret = 100: run time:
442 240 239 218 218 218 218 218 218 218
Test 3: data in external memory, inboard cycles
ret = 100: run time:
242 103 103 103 103 103 103 103 103 103
Test 4: data in internal memory, inboard cycles
ret = 100: run time:
214 53 53 53 53 53 53 53 53 53

A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:
int cycles() {
int ret;

__asm__ __volatile__
(
"%0 = CYCLES;\n\t"
: "=&d" (ret)
:
: "R1"
);

return ret;
}

Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

Allocating Internal Memory

At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:
/* I know, I know - this is very naughty :-) */
short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */
short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */

I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.

Summary

Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.

Acknowledgements

Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

Measuring Stack Usage in Multi-threaded uClinux Apps

In regular Linux the MMU allows the stack to grow dynamically, the MMU just allocates more physical pages. However in uClinux, the correct amount of stack for each thread must be allocated before the thread is created.

Too little stack and your program will corrupt the system in nasty, unpredictable ways. Thread stack gets malloced from the system heap, so an overflow means writes to an arbitrary address just outside the block of memory allocated to the thread. This memory could possibly be in use by other parts of the system, perhaps for a different application. These sorts of bugs can be very difficult to track down.

If you allocate too much stack, then you are wasting memory, a valuable resource on embedded systems. For example I discovered I was allocating far too much stack and wasting Mbytes of memory, especially when multiple threads were running.

The standard approach is to try random values of stack until you find one that works. However I thought it might be a better idea to actually measure the amount of stack used by each thread. Then I could tweak the stack allocation to optimise memory usage and even check for stack overflows at run time.

Threadstack Library

I have written a small library (called threadstack):
unsigned int threadstack_free(pthread_t *thread);
unsigned int threadstack_used(pthread_t *thread);

Here is a sample run on my Blackfin BF537 STAMP, when a 100k stack was allocated to a thread:
root:/var/tmp> ./test_threadstack
stack used: 1300
stack free: 100620
root:/var/tmp>

How it Works

The functions work by examining memory allocated to the stack. The theory is that if the memory is non-zero, then it must have been used by the thread at some time (the entire block of memory used for the stack is initially set to 0 before the thread starts). So the routines search the stack memory for the first non-zero value, and that is declared the “high water mark” – the point where the stack reached it’s maximum.
0xff <- stack top
0xff
0xfe <- high water mark
0x00
0x00
0x00 <- stack bottom

The high water mark will change over time, so after your thread has been running for a while is the best time to measure stack usage.

One weakness with this approach is that if stack allocation is way too low your program may bomb before these routines get a chance to run. However in that case you will at least know there is a problem, and can increase stack to some high number (e.g. Mbytes) to get the program running, before using these functions to determine actual stack requirements.

Usage

In my uClinux Asterisk port I have added code to check for stack overflow just before a thread ends:
pthread_t thread = pthread_self();
assert(threadstack_free(&thread) > 10*1024);

This code checks that while the thread was running, the minimum free stack was 10k. The assert will kill the program with an error message and tell me straight away I need more stack. Much nicer than getting an obscure bug in the system due to a stack overflow on a thread. Now the program finds stack overflow bugs for me!

This example above runs from within the actual thread itself, hence the call to pthread_self() to discover the threads handle. You can also call the functions from another thread (e.g. the main thread), for example to periodically meter stack usage.

Links

More information on multi-threaded applications for uClinux
movies hardcoremovies ebony girls buttclips sybian moviemovies homemade sexmovie stars nudesamples movie hot xxxmovies free handjobmovies anal free Mapstar porn blonde galleryporn gallereis thumnail blondeporn videos blondepornstar blonde pussyporno teen blondepornstar blonde videoporn blondesblondes pornstars Mapporn cliphunterclipmaster pornfree clips pornclips anal porn ofof clips porn girlclips porngratuit porno clipsclipsporn Mapringtones boltblueboombastic ringtone freebreakfast club ringtonesringtone brewerspride brown ringtonescoupon ringtone cingluarcingular phones ringtones recordsurvive circa ringtones Map

Porting multi-threaded apps to uClinux

I have recently been working on improving the stability of uCasterisk, a port of Asterisk to uClinux. This required some research into memory management for multi-threaded apps on uClinux. I didn’t find any one resource that had everything I needed to know so I thought I would collate some of the information I found here as a resource for others. Thanks to all those (especially on the Blackfin forums) who helped answer my questions.

I am using the Blackfin flavour of uClinux and the uCasterisk application as an example, but this information should apply equally to other uClinux systems/applications.

MMU versus no-MMU

Asterisk is a pretty big application for uClinux, the executable is about 2.5M and when running several calls can consume 32M of system memory. The big difference between uCasterisk and other Asterisk implementations is the lack of MMU. A MMU is handy when working with large, multi-threaded apps. For example when a thread is kicked off you can allocate a virtual stack of say 2M, but physical memory will only be allocated as and when it is actually required (say due to a write to a previously unused part of the stack). If your thread never uses all of the stack, then the physical memory is available for other users.

On a MMU-less system you need to work out the maximum stack your thread may need, and allocate that. If you get it wrong, your application (and possibly the whole system) will bomb. This generally means you are wasting memory compared to the MMU case, as you always need to allocate the worst case amount of memory required.

One possible advantage of MMU-less systems is no nasty surprises – any memory allocated really does exist, and no over-commitment is possible. On a MMU-based system physical memory isn’t actually allocated until you write to it, and it may be paged to disk just when you need it (although I understand there are options to control this behaviour).

Stacks for Threads

When you start an app, you get allocated a stack for the application. This is actually a stack for the main thread of the application. When you start a new thread (say with pthread_create()) the thread gets allocated a new stack from the system heap. The two stacks are completely unrelated. The size of each stack is independent, you control the size in different ways (see below).

Tips for Porting to uClinux

Don’t enable stack checking. This feature is very useful for single-threaded apps; it causes the operating system to kill the app when it uses all of it’s stack space. Very useful, as it tells you straight away to increase the stack size. Unfortunately at present this feature hasn’t been extended to multi-thread applications; using it with multi-threaded apps (at least on the Blackfin) causes problems as pointed out in the 2005R4 RC2 release notes and discussed here.

You control the application (main thread) stack with the -s option, on my Blackfin system the command line is:

bfin-uclinux-gcc -Wl,-elf2flt='-s 1000000' \
-o thread thread.c -pthread

In this example the stack is set to 1000000 bytes.

You control the size of the stack for each thread you create using pthread_attr_setstacksize(), for example (from the Asterisk utils.c file):

pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 0x1000000);
pthread_create(&thread, &attr, thread_func, NULL);

Monitoring Memory Usage

cat /proc/meminfo can be very useful, here is the output from my Blackfin STAMP BF533 board, taken while uCasterisk was running with several SIP calls in progress:

root:/var/log/asterisk> cat /proc/meminfo
MemTotal: 59784 kB
MemFree: 11084 kB
Buffers: 100 kB
Cached: 4172 kB
SwapCached: 0 kB
Active: 3828 kB
Inactive: 444 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 59784 kB
LowFree: 11084 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 4 kB
Writeback: 0 kB
Mapped: 0 kB
Slab: 43744 kB
CommitLimit: 29892 kB
Committed_AS: 0 kB
PageTables: 0 kB
VmallocTotal: 0 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB

The most important fields are MemFree (total system memory free) and Slab (system wide heap in use).

In earlier versions of Linux the CommitLimit field indicated the maximum Slab was allowed to reach before processes were killed (with Out-Of-Memory errors). However on my distro I discovered by experiment that you can actually increase the Slab well beyond this limit, as indicated above. Looking at the kernel source file uClinux-dist/linux-2.6.x/mm/nommu.c, __vm_enough_memory() function it appears that the memory allocator uses the OVERCOMMIT_GUESS method, which ignores the CommitLimit and allows up to 97% of memory to be allocated.

It is interesting to observe MemFree as you perform different operations. For example on uCasterisk when a new SIP call starts, a thread is created, which requires stack and heap space. I also noticed MemFree decreasing when I copied files on a ram file system – this caught me for a while as uCasterisk was chewing through available system memory writing Call Data Records to the ram disk and eventually causing Out of Memory errors.

ps an top are also useful, as they indicate the amount of memory allocated to the system/application.

Links

CommitLimit and OOM Killer
Why malloc is different under uClinux
Application Debugging on the Blackfin
Intro to Linux Apps on the Blackfin (skip to bottom of page)
Blackfin forum thread where I asked some questions on this topic

Summary

I hope this was useful – pls email me or add a comment below if you have any comments/suggestions/corrections.
payday loan 6 8 australiapayday loan 8 day pay loanprocessor account mortgage loan manager processorachieve and loans studentadult loan site personals personalscredit loan secured unsecured adverse onlineloan direct student aid federal moneydirect aid financial loan student Mapringtones lg freelg ringtones howtoringtones lg4650wow basketball ringtones bow lilof ringtones funny listlocomotive ringtonesgood long friday ringtoneslow ringtones rider Mapstarfire hentaisexy secretarieshentai shemaleproposal xxxporn trailerswrestling nudepussy catanime pussy Map