Building a BlackfinOne

Over the past week I have been assembling, debugging and testing my own BlackfinOne board:

During this time I have had a lot of help from the BlackfinOne community, in particular via ongoing chat sessions with Wojtek (pronounced Voycheck) Tryc. Yesterday, I booted uClinux on hardware that I had helped design, soldered and debugged myself. A tremendously satisfying experience.

Building hardware is a bit like software – you go quickly through stages of construction; get stuck for a day or so on a tough bug; then rapidly scream through construction again until you hit the next bug. By documenting some of my experiences I hope I can help other people with similar projects.

The BlackfinOne

The BlackfinOne is a uClinux board based on the Blackfin DSP/RISC processor. There are several features of the BlackfinOne project that make it a great project for hackers:

  1. The design is open as in GPL. Free as in freedom/libre. Anyone can use, re-use, and modify the design, just like open source software.
  2. The BlackfinOne has been designed using the gEDA open source CAD tools. This means the design can be edited without needing expensive proprietary CAD software.
  3. Quite remarkably, it is implemented (and runs well) on a two layer PCB. To run a high speed digital design (it has a 400MHz clock and 133MHz bus) on a two layer board is generally regarded as impossible. Standard engineering practice says that you must have ground plane layers to ensure signal integrity and to make routing the PCB possible. For example the Blackfin STAMP designs uses 8 layers. The trick that the original designers (Ivan Danov and Dimitar Penev) used was to place the high speed SDRAM immediately behind the Blackfin chip, on the opposite side of the PCB. This minimises the length of the nets, ensuring signal integrity despite the absence of a continuous ground plane. Using two layers makes low quantity fabrication of the PCB cheap and easy for hackers compared to a multilayer PCB.
  4. A non-BGA version of the Blackfin was used to make soldering easy for hackers.
  5. It is based around the Blackfin community (sponsored by Analog Devices, the makers of the Blackfin chip) which has very good uClinux support and a range of open hardware building blocks (such as USB and other interfaces) for the Blackfin.

The initial BlackfinOne V1.0 design (by Ivan and Dimitar) first appeared in mid 2006. This design had SDRAM, Flash, and serial support. I was interested in using it as a DSP motherboard for my telephony work so I (along with several other people) helped add dual Ethernet ports and a USB port to create the BlackfinOne V2.0. Many other people (both commercial companies and private developers) have also contributed to the hardware and software side.

Several BlackfinOne V2.0 boards have now been made and brought to life by people all over the world. For the remainder of this post, when I say BlackfinOne (or bf1 for short) I mean BlackfinOne V2.0.

The schematics for the bf1 are useful to understand the rest of this post.

Building an Igloo JTAG cable

To bring up the BlackfinOne (for example to flash the boot loader) a JTAG cable is required. I built mine using some spare parts and a spare PCB I had laying around. I built a copy of the Igloo JTAG cable. My Igloo is not a shining example of hardware construction:

By the way I obtain the 3.3V power for the cable from the board being tested/flashed, rather than using the 5V to 3.3V regulator suggested in the Igloo design. A nicer version of the same Igloo circuit was constructed by Wojtek (pronounced Voycheck) Tryc:

I tested my Igloo using a BF533 STAMP board. The idea was to test the unknown hardware (my JTAG cable) using known working hardware (the STAMP). To drive the cable I used a version of the jtagprog software that has been modified slightly for the BF532 processor and BlackfinOne platforms.

It is important to use the CVS version of the jtagprog software as this has been modified to work with the latest versions of the BF532 chip. If you use the non-CVS version you may get a “stepping error”, which means the software doesn’t recognise this version of the BF532 silicon.

OK, so I tested the cable on a BF533 STAMP board and it seemed to work OK:
jtag> cable parallel 0x378 IGLOO
Initializing Excelpoint IGLOO JTAG Cable on parallel port at 0x378
jtag> detect
IR length: 5
Chain length: 1
Device Id: 00110010011110100101000011001011
Manufacturer: Analog Devices
Part: BF533
Stepping: 3
Filename: /usr/local/share/jtag/analog/bf533/bf533
jtag> initbus bf533_stamp
jtag> detectflash 0x2000000

The last command “detectflash” gave me lots of information on the STAMPs flash chip. As a further test I also used the cable to re-flash a BF533 STAMP with a new version of u-boot. This proved that despite the rough construction my cable was OK (thanks Wojtek for this suggestion).

Other JTAG cables can be used as well, for example other developers have successfully used the JTAGblue cable which emulates very popular Xilinx Parallel – III.

For reference here is an example of a jtag session from a working BlackfinOne. However it took me a while to get that far, as described below!

Tools and Tips for Surface Mount Assembly

The BlackfinOne uses some fine pitch surface mount chips and resistors and capacitors in the 0603 size range. So you need to think a little bit about assembly.

To load my BlackfinOne I used a stereo microscope like this one in this picture (being used by my wife Rosemary whose brief geeky phase is regretfully now over):

I bought mine new for about $700 but there are very nice ones available on e-bay for about $100 up. A much cheaper alternative is an illuminated magnifier:

But trust me – a stereo microscope will change your (soldering) life once your buy one, it really is worth the investment. My eyes aren’t that great but I can work all day under the microscope with no problems or eyestrain. Mine is variable zoom 10 to 40x but I find even 10 x is just a bit too much and I never use the variable zoom. So I suggest fixed magnification in the 5 – 7x range is just right, however it’s a personal choice.

Wojtek bought this microscope for around US$100. He finds the fixed 15x magnification perfect for him:

If you can, get a microscope that has circular illumination all around the objective lens. Mine just illuminates from one direction which gives me shadows so I have to rotate the work a lot. For US$45 Wojtek found this great LED based circular light that clamps onto his microscope:

Here is a view through the microscope of one edge of the Blackfin chip. A sewing needle is in the picture for comparison.

Here is a similar image from Wojtek, through his microscope. You can just see how much fun we are having with the toys we bought for this project!

I found Darrell Harmon’s notes on surface mount soldering very useful. I bought the tweezers and flux that he recommends from Digikey and also strongly recommend them. The flux makes a very big difference in surface mount work, I have never really needed flux previously in through hole work. The flux really makes the solder “leap” onto the parts being soldered.

I also use and recommend the same procedure Darrel suggests for soldering 0603 parts; blob of solder on one pad, then slide the part in with the tweezers. Thanks Darrell.

Wojtek has also done some research on soldering irons, from a recent email:

There is 2 leaders: the Hako 936 and Metcal PS-800. The Metcal is very small, light (!) and reliable. The tips are hot in just few seconds, and are available in variety of sizes and temperatures. For most I use 650 tips, which are designated for non-RoHS components, for ground planes and headers, I use 750 – 1.5mm tip which makes it much easier (750 is RoHS temperature). The Metcal does not have a dial to change a temperature, you change it by replacing tips (2 seconds, pull and push operation). The Metcal is available from DigiKey at very competitive price.

Flux Removal is Very Important

As Darrell notes that Digikey flux is conductive, so you really need to get it all off ASAP after soldering or it will cause problems. For example I added some flux before soldering a 10M resistor (part of the bf1 clock circuit). After soldering, the resistance across the resistor measured only 340k! Some vigorous scrubbing with the flux remover and brush fixed it, and the 10M resistor once again measured as 10M. In one or two cases I have even had to remove parts as the flux was trapped underneath and not easily removed.

I use a spray can of solvent with a little brush attached to the nozzle for flux removal. Wojtek actually gave his BlackfinOne a bath (full immersion in water) with good results. Wojtek also uses this flux remover:

Both Wojtek and I have fixed many problems that were caused by flux left on the board. So Scrub-Scrub-Scrub that flux away!

From Wojtek:

It is critical to remove flux as soon as soldering is finished. Different methods could be used, solvents and a water bath both work great. But leaving flux on PCB for hours or days could potentially affect a board in a very negative way (possible even ruining it) – this stuff contains acid and is corrosive!

Initial Assembly and Smoke Test

I started with a bare PCB board, kindly supplied by Jeff Knighton:

I first loaded all the surface mount passives, then Blackfin chip, SDRAM, Flash and other chips. By loading the surface mount parts first the board was nice and flat and so easy to work with, especially under the microscope I use for surface mount parts. I loaded the large and bulky through hole passives last, as they prevent the board from sitting flat, hindering surface mount work a little.

My approach was to load various sections of the board progressively. For example I first started with the CPU and memory but left the Ethernet and USB ports unloaded for the initial tests. In hindsight I am not sure if this is a good idea, it is easy to miss something this way (like a pull up resistor you didn’t think you needed at this stage of construction).

If you are loading a V2.0.0 PCB it is important to read and fix BlackfinOne PCB Errors, this only takes about 10 minutes. These problems will be fixed on later PCBs.

After loading the CPU/memory section of the board here are the initial “smoke test” steps I when through:

  1. I checked my parts orientation against the reference photos of the BlackfinOne, for example I made sure that the writing on my chips faced the same way as the photos. Especially that damn expensive 64M SD-RAM chip.
  2. Before powering up I checked the resistance between the 3.3V rail and ground. About 300 ohms, which seemed OK. I would have been worried about a dead short.
  3. The next step was to apply power and sniff for that familiar smell of molten silicon. I used a variable power supply first with the current limit set to 100mA. Just in case I had done something stupid (a common occurrence). Mmm, no smell. Feel the top of all chips, no burnt fingers. Excellent. A better start than most projects.
  4. I measured the 3.3V rail to make sure the switching power supply came up OK. Check.
  5. Measured the 1.2V rail – this was present which proved that the Blackfin chip was doing something, as the Blackfin contains the switching power supply circuit for the 1.2V rail.
  6. With the scope I checked that the 10MHz system clock was present. Check, another sign of Blackfin life.
  7. Check reset line is H. Ooops, it’s stuck L. That means the system will be held in reset forever!

I took me a couple of hours to find the reset problem. The capacitor C53 that sets the duration of the reset pulse wasn’t fully charging due to some residual flux on the board. I gave the reset part of the board a good scrub with the solvent and reset started working OK.

Programming the CPLD was fairly straight forward – I used the Verilog file bf1_cpld.vl from the hw-0.1 CVS repository. This needed to be renamed to bf1_cpld.v for my Xilinx synthesis tools to understand it and don’t forget to include the pin locking file bf1_cpld.ucf when you synthesise the design. Free (but not open) synthesis tools are available from Xilinx for Windows or Linux. I used an ancient Windows version and a serial JTAG cable to program the CPLD.

To save time, you are welcome to use my xlinx_bf1.jed file, just program the CPLD using your Xilinx JTAG cable.

Flash Problems

The next step was to connect the JTAG cable to my BF532 and try to flash the board. This is a “nervous” step – it’s the first time you actually get to see if the Blackfin chip is responding via the JTAG port. It’s also where you find out if you have soldered any memory chips on backwards.

The jtag tools detected the BF532 OK:
jtag> detect
IR length: 5
Chain length: 1
Device Id: 01010010011110100101000011001011
Manufacturer: Analog Devices
Part: BF533
Stepping: 5
Filename: /usr/local/share/jtag/analog/bf533/bf533

Note that the BF532 is detected as a BF533. This doesn’t seem to matter in practice.

However when I tried the next few steps:
jtag> initbus bf532_bf1
jtag> detectflash 0x2000000

It couldn’t find the flash chip.

My first suspicion was the CPLD, maybe I had messed up something like the pin locking and the control signals like AOE and FLASH_CS weren’t getting through to the flash chip. So I checked the control signals by watching them on the scope while I tried to read words to the flash:
jtag> peek 0x20000000
The signals were asserted when I did the read which suggested the CPLD was OK.

I then checked out the flash data sheet to work out how it should be responding. The flash chips have state machines built in, you can send commands to them and they respond in certain ways. One of the commands can tell you all about the flash chip. This is what JTAG tries to send when you run the “detectflash” command.

You can generate these commands and read response to the flash chip using the jtag peek/poke commands. For example on the BF533:
jtag> poke 0x20000055 0x98
jtag> peek 0x20000020
bus_read(0x20000020) = 0x00000051 (81)

Writing 0x98 to 0x55 on the chip (mapped to 0x20000050 on the bf1) puts the chip into “common flash interface” mode. In this mode, address 0x20 should always return 0x51 (the letter Q in ASCII). This is a good way to test if the Blackfin is talking to the Flash chip OK.

The same flash command/response on the bf1 showed just 0xff responses, which suggested either the command wasn’t getting through, or something was wrong with reading the response. I had already checked the control signals, so the next step was to check the address and data bus. Damn. This meant 32 nets to manually check.

To check the electrical connections of each net I used two methods:

  1. Under the microscope, I applied gentle pressure to each pin with a sewing needle. If the pin moved when I applied pressure, it would indicate a bad solder joint. This check didn’t find any problems.
  2. I then tested the continuity of every solder joint on the major chips using my multimeter. I placed one probe on the top of each pin, and the other on the pad it was soldered too. This is tricky to do on fine pitched pins, so I improvised some fine tipped probes (see below)!

While I was checking some pins on the SDRAM I accidentally touched two adjacent pins (pins 1 and 2) and my multimeter went “beep”, indicating a short. Hmmm, that doesn’t look right. Turns out that the D0 net was shorted to VCC, which was why the flash chip wasn’t responding. Any commands were being mangled as D0 was stuck H.

I spent an hour trying to work out why. It could be a soldering error or a PCB fault. However the track in question lives mainly under the the Blackfin and SDRAM so there was no way to check it visually without removing these chips, something I don’t have the tools for.

I used my multimeter to measure the resistance of the short. It was 0.4 ohms between VCC and D0 near the SDRAM, but 0.8 ohms between VCC and D0 on the flash chip. This suggests the short was nearer the SDRAM, right where I couldn’t get to it.

Fun with Burning Out Shorts

One fun technique I tried was to burn out the short. I have used this technique before on production PCBs that have shorts. You apply a high current (Amps), low voltage (say 0.5-1V) across the short and try to burn it out. Literally. You sometimes even hear a nice crack. It is safe for the logic chips as the low voltage means they won’t conduct any significant currents. It works best with shorts that measure a few ohms. Small enough to pass lots of current so they get hot, but larger than the parts of the track you want to keep. Too small and you risk burning out the wanted part of the track. To large and it won’t get hot enough.

Anyway, it didn’t work this time, the short passed 3A happily and I wasn’t going to risk going higher in case I vaporised the whole D0 net! So I cut out the offending parts of the D0 net and replaced them with some fine wire. Ugly, but when I fired up jtag again and there was my flash chip! Whew. Problem fixed after a day bug hunting. On to the next one.

Bug Hunts and the Geeky Mindset

If all the above sounds very cool and logical rest assured it wasn’t.

When you are in the middle of a bug hunt two things happen.

The first: the logic centres of your brain shut down. You don’t think straight. You bounce from one theory to another.

The second: you don’t want to work on anything else. You get obsessed and can’t leave it alone. Worse, doing anything else is impossible. Intense bug hunts lead to late nights, neglect of significant others, no exercise, fatigue and a further dip in intellect.

Despite all that, I really enjoy a good bug hunt. When you come through at the end, the satisfaction is supreme. And you always end up learning lots along the way that comes in useful later. Call me a geek, call me a nerd, or call my wife and commiserate with her.

Flashing problems

This was an strange one. Now that jtag could actually see my flash chip I tried to flash the chip:
jtag> cable parallel 0x378 IGLOO
jtag> detect
Jtag> initbus bf532_bf1
jtag> detectflash 0x20000000
jtag> flashmem 0x20000000 /path/to/u-boot.bin

However I kept getting verify errors. Whole sectors of the flash (each sector is 64k long) were not being written. I tried a few things like checking the JTAG signals for ringing (they were OK) and shortening the JTAG cables just in case I has signal integrity problems like ringing on the JTAG clock signal. Still verify errors.

Each attempt was very time consuming as flashing via the JTAG is slow, it takes about 20 minutes to flash the 128k u-boot program.

I was in chat contact with Wojtek at this time and he suggested I try the Igloo JTAG cable on a STAMP board. So I tried flashing u-boot onto a BF533 STAMP and it worked fine. U-boot even ran OK on the STAMP. This was a good test as it confirmed my Igloo was OK. One variable removed from the bug hunt.

So then back to the bf1 board. Suddenly it is now flashing OK! Despite nothing changing. So I tried it three more times and each time it flashed and verified OK. So I don’t know. It doesn’t always boil down to logical cause and effect. Sometimes stuff just happens. So I “declared victory” and moved on.

Now I had u-boot starting up OK when I powered up – I could interact via the serial port. This was pretty exciting, as it proved that all the major components – Blackfin, flash, SDRAM, and CPLD were working fine. The bf1 was talking to me. So I powered down and soldered on all the parts for one Ethernet port. The idea was to use Ethernet and tftp to download a uClinux image.

Torpedoed by U-Boot and Ethernet

After loading the Ethernet parts I booted into u-boot and tried to tftp. Something was happening, but there were lots of tftp time out “T” symbols and only the occasional OK packet. Ping worked OK from u-boot, but not tftp or dhcp.

I did some packet sniffing on my host and found out that packets were being received OK from the bf1, but it looked like anything the host sent back was being ignored by the bf1. So I checked the rx side wiring of the Ethernet port but couldn’t find any problems. During tftp attempts I could also see what looked like valid Ethernet signals (1Vpp nice looking data signals) on the tx and rx nets at the RJ45 connector.

I was using a binary u-boot image that I found on the bf1 site. I wanted a way to inspect packets being received and that meant I needed to compile u-boot from source. So I checked out the CVS version of u-boot for the BlackfinOne and tried to compile. However I kept getting linker errors. This is very frustrating, you are trying to fix one bug and keep getting caught by others! Such is life.

Eventually a post to the bf1 forum solved the linker problem – there was a recent bf1 u-boot fix just checked into CVS. Now I could compile u-boot and start inserting some test code. So I flashed the CVS version of u-boot and hit another problem – now Ethernet wouldn’t work at all (not even ping) and the MAC address was always a string of 0x00s. This was a step backwards from the previous binary where the MAC was set up OK and at least ping worked!

One other problem was slowing me down – that 20 minute time it took to flash u-boot every time I wanted to add a printf to test something. This made debugging very tedious and frustrating. It felt like I was getting no where fast.

One thing I did discover is that it is a good idea to erase the flash sectors before flashing using jtag, just to make sure:
jtag> eraseflash 0x20000000 2
When I didn’t perform this step I sometimes found I was booting a previous version of u-boot, not the one I had just flashed!

After a break I thought of another idea to help speed up debugging. I used a trick I read about in the Blackfin u-boot documentation – you can download a new u-boot image into SDRAM using u-boot, then execute the new u-boot image using the “go” command. As I didn’t have Ethernet working I used ymodem to download u-boot images to SDRAM. This worked really well and enabled me to test and debug in 2 minute cycles – much faster.

Note that this technique doesn’t work for all u-boot images, for example when I tried to run a u-boot image configured for 64M RAM with a 32M u-boot it failed and I needed to reflash instead. Some people have reported that ymodem doesn’t work for them, they use the kermit protocol instead.

Two iterations later I had the bug nailed. The problem was that u-boot was trying to read the MAC from the little serial EEPROM used by the Ethernet chip. However I hadn’t loaded this chip, and even if I had it would be un-programmed and the MAC would be all 0xff’s. So after some reading and grepping of the u-boot source I changed this line in include/configs/bf1.h from:
#define CONFIG_DM9000_HAVE_EEPROM 1
to:
#undef CONFIG_DM9000_HAVE_EEPROM
This allows the MAC to be set by the values compiled into bf1.h, or the environment variables used by u-boot. If this option is defined then the eeprom contents always overrides the MAC from the environment.

Suddenly my u-boot Ethernet was up and downloading images via tftp just fine. Cool. After 1.5 days of chasing this bug it never looked so good to see those tftp “*” symbols flashing across the screen as an image was downloaded. I also learnt a lot about u-boot and how Ethernet drivers work which was interesting.

So now we are getting real close – time to build a uClinux image!

Building uClinux for the BlackfinOne

To create a uClinux-dist for the BlackfinOne I copied all of the files in
uClinux-dist-R06R2-RC2-bf1-diff-files.061219.tar.bz2
on top of a fresh 2006R2 Rc5 uClinux-dist using “cp -R”.

This was set up for the jffs2 file system, however I wanted to try a ramfs config as I was more familiar with that from my STAMP work. This was probably a mistake, as I spent several hours working out how to get this working. uClinux was starting OK but kernel panicking when it couldn’t mount root. I was using the u-boot command:
bf1> tftp 0x1000000 uImage
bf1> bootm 0x1000000

To download and boot the uImage. The uImage file combines the kernel and the root files system in one compressed image.

At Wojtek’s suggestion (again via chat) I compared the boot sequence of the bf1 to my STAMPs. I noticed something odd about the Memory Map section of the boot sequence. When I booted the STAMP I was getting something like this.
Memory map:
text = 0x00001000-0x00107f94
init = 0x00108000-0x00115508
data = 0x00115b4c-0x00145e60
stack = 0x00116000-0x00118000
bss = 0x00145e60-0x00153394
available = 0x00153394-0x03800000
rootfs = 0x07700000-0x07f00000
DMA Zone = 0x07f00000-0x08000000

Note the “rootfs” line. This says that 8M is allocated to the rootfs. This was the BF533 STAMP, which has 128M available, which means the top address is 0x8000 0000. Note that the “available” memory is about 57M. Now when I compared the rootfs line for the bf1, I saw:
rootfs = 0x00600000-0x001f00000
Which is 25M allocated for rootfs. Now at this stage I only had my bf1 configured for 32M, so that rootfs was consuming most of the available memory. In fact when I checked the available line, only 416k was free for the entire operating system! I traced the problem to the vendors/IvanDanov/BLACKFINONEV2/Makefile. The BLOCKS line that controls the size of the rootfs was set to 25600 rather than 8000. As I mentioned earlier, this is only really a problem if you are using ramfs.

I changed the line back to BLOCKS=8000, rebuilt the uImage and managed to (finally) boot uClinux. So I ran around the house telling everyone but nobody cared much (except my baby son, as discussed below). Oh well, it still made my day!

A little more work to u-boot and kernel configuration and I had a 64M BlackfinOne booting OK. Wojtek showed me how to configure u-boot for 64M:

  1. In board/bf1/config.mk make TEXT_BASE = 0x03fe0000.
  2. In include/configs/bf1.h set the #define CONFIG_MEM_SIZE to 64.
  3. make mclean;make mrproper; make bf1_config;make.

To configure uClinux for 64M “make linux_menuconfig and set “Blackfin Processor Options – Use SDRAM chip” to your memory chip, then rebuild uClinux-dist with the usual “make”.

If you have built a 64M BlackfinOne you are welcome to try my u-boot and uClinux uImages. They might be useful for initial testing and save you the steps of compiling your own. Once you have flashed u-boot-bf1v2-64M.bin, place uImagebf1v2-64M on your tftp server and on the bf1 console:
bf1> tftp 0x1000000 uImagebf1v2-64M
bf1> tftp 0x1000000

Note that this uImage uses ramfs not jffs2, and I used the -75 memory chip. This should work OK with the slightly faster -7E chips.

While experimenting I noticed that you need a u-boot configured for 64M to boot a 64M uClinux, otherwise uClinux bombed for me shortly after the uImage was uncompressed. However you can boot a 32M linux using a 64M u-boot.

I set the Ethernet MAC and IP for uClinux using environment variables in u-boot that are passed to uClinux, using the addnet script:
bf1>print
ethaddr=02:80:ad:20:31:b8
gatewayip=192.168.1.1
netmask=255.255.255.0
ipaddr=192.168.1.30
serverip=192.168.1.2
bootargs=root=/dev/mtdblock0 rw
addnet=setenv bootargs $(bootargs) ip=$(ipaddr):$(serverip):$(gatewayip):$(netmask):$(hostname):eth0:off
flashboot=run addnet;bootm 0x20040000

Flashboot is called when the bf1 starts, which calls the addnet script, and boots uClinux. I would rather have u-boot set just the MAC and let uClinux set the IP through /etc/rc, however I haven’t worked out how to do this just yet.

Open Hardware

The BlackfinOne project is a great example of open hardware development. There is a growing community of developers that have contributed to hardware, software, and even tracking down problem parts for each other. There are parts being air-mailed all over the world as one person helps another get parts that are tough to find in their country. An interesting example of open/community development being applied to hardware – obtaining parts is not a problem for open software development!

Web-based parts supply companies like Digikey have helped make this sort of project possible. You can now easily buy many obscure parts and get them delivered anywhere in the world in a week or so. The web has helped hardware hacking get easier, and made it possible to build complex, sophisticated designs like the bf1.

Chat and mailing lists proved invaluable. Wojtek and I were in constant chat contact, and we used the bf1 forums to help us when we got stuck. Even though at this time of year we are about 12 hours apart in time (and about 40 degrees C in temperature). Just being able to explain each our problems was very useful – it made the problem clear in your own head when you have to express it to another person. I have been developing hardware for about 20 years and this as truly a new way of working for me. Nothing new for software development I guess, but very different for hardware (at least for me). In my experience (mainly in smaller companies), the hardware guy has often worked more or less alone.

Conclusion

This has been a really great experience. I assembled my very own uClinux hardware, then brought it up through various stages until it booted. I still can’t believe that I actually soldered together my very own Linux machine! If you are interested in trying out a hardware project, I thoroughly recommend this project. Great design, great community. Open hardware in action.

When I told my baby son about my working BlackfinOne, this was his reaction:

I know exactly how he feels!

Thanks

Wojtek for your review of this post and photos.

amount interest negative loan of capitalizationloans 10 dollar paydaypayday law 10 new 15 loan100 payday loan faxemergency 1000 loan1000 loan business opploan no check credit 10000il equity 125 value home loan Mappanasonic ringtone 55 gdmobile ringtone t 6010ringtone ppc sprint 6700ringtone nextel 7100ito how ringtone 8125911 ringtone 69999a 670 ringtonesyour a900 make ringtones Mapringtones chocolate to lg convertcool metallica ringtonescountry ringtoneringtones crazy freecreate ringtones free share andcreate ringtone my owncreate cellular us for ringtonesmidi ringtone creating Map

Building an Embedded Asterisk PBX Part 2

Here is the next installment in my adventures of building an embedded IP-PBX around the Blackfin-Asterisk. The big news is that we now have a working 4-port embedded IP-PBX and low cost hardware for sale!

DTMF Fixed Point Port

I spent a few days converting the Asterisk floating point DTMF detection code (dsp.c) to fixed point. You see the Blackfin doesn’t have a FPU so any significant floating point work (like DSP) needs to run in fixed point. This work brought the MIPs per channel down from about 200 to 5 (The Blackfin has about 500 MIPs available). It could run much faster if I ported the inner loop code to assembler however I think it’s fast enough for now.

To test the Asterisk DTMF detector I used Steve Underwood’s dtmf_rx_tests.c program from his very well written spandsp library. I moved from floating point to fixed point in a series of very small steps. After each step I ran Steve’s unit test to make sure I hadn’t screwed anything up. This is really the only way to test DSP code, you can’t just hack real time code then push a few buttons on the phone and hope it dials OK!

Here is some typical output from the unit test:
Test 4: Acceptable amplitude ratio (twist)
1 normal twist = 8.00dB
1 reverse twist = 4.20dB
5 normal twist = 8.40dB
5 reverse twist = 4.60dB
9 normal twist = 8.40dB
9 reverse twist = 4.60dB
D normal twist = 8.70dB
D reverse twist = 4.30dB
Passed
Test 5: Dynamic range
Dynamic range = 41dB
Passed
Test 6: Guard time
Guard time = 25ms
Passed
Test 7: Acceptable signal to noise ratio
Acceptable S/N ratio is 10dB
Passed
Test: Dial tone tolerance.
Acceptable signal to dial tone ratio is 15dB
Failed

Note the last test failed. This test also fails on the floating point code (i.e. running on a PC, before I ported it to the Blackfin). I am not sure why. Could be a switch I forgot to turn on or a bug in the dsp.c code. Need to look into that some day.

Echo Canceller Optimisation

I also spent some time looking at the mec2.h echo canceller in the zaptel package with a view to speeding up code execution. You see if we are running 4-8 analog channels we need to make sure the echo canceller is fairly efficient. In fact, the echo canceller is likely to dominate the CPU load of the PBX; Asterisk and the other DSP code uses a relatively small amount of MIPs in comparison.

I have identified a few areas where mec2.h could be optimised. One example is in the tap update code:
for (k=0; k<ec->N_d; k ) {
grad2 = CONVOLVE2(yada yada);
ec->a_i[k] = grad2 / two_beta_i;
ec->a_s[k] = ec->a_i[k] >> 16;
}

BTW I have deleted a lot of code for clarity. On the Blackfin the divide is a function call which is a no-no for real time DSP code. In fact divides are generally a bad idea for real time DSP, you want everything to be expressed in terms of multiplies and adds.

However, we are in luck. As we are dividing by a constant the divide can be pulled out of the inner loop:
inv_two_beta_i = 1/two_beta_i;
for (k=0; k<ec->N_d; k ) {
grad2 = CONVOLVE2(yada yada);
ec->a_i[k] = grad2 * inv_two_beta_i;
ec->a_s[k] = ec->a_i[k] >> 16;
}

There are also several other places where the echo canceller could be optimised. This would also help performance on x86 platforms, for example there is no reason why much larger tails (or larger spans) couldn’t be handled on a PC with a little more optimisation.

Multiple Analog Ports

Once I had the DSP code moving along nicely it was time to port the driver to handle multiple analog ports. Here is the output from the driver as it boots and auto detects 4 modules:
root:/var/tmp> insmod wcfxs.ko debug=1
Using wcfxs.ko

Registered Span 1 ('WCTDM/0') with 8 channels
Span ('WCTDM/0') is new master
iRxBuffer1 = 0xff803e58
iTxBuffer1 = 0xff803ed8
ISR installed OK
port: 1 port_type: O
port: 2 port_type: O
port: 3 port_type: S
port: 4 port_type: S
port: 5 port_type: -
port: 6 port_type: -
port: 7 port_type: -
port: 8 port_type: -

O means an FXO port was detected, S means an FXS port. In this case just four ports are loaded, out of a possible 8. You know I really should have added the letters “FX” in front of those strings. Hmmmmm. Maybe when I finish this blog post.

Here is what it all looks like when configured for four ports:

A pretty red light means an FXO port, green means FXS. The whole thing isn’t very big, about the size of a phone handset:

Want more than 4 ports? No problem. Just stack another board on top:

In this example I didn’t populate all the ports as I hadn’t soldered up enough modules at the time. Can you guess from the lights how each port is configured?

It might be useful to introduce a few terms:

  1. The mother board is the Blackfin STAMP card on the bottom. These are made by Analog Devices and are available off the shelf for about $200. They run uClinux and also support way-fast DSP work.
  2. On top of that I plug in a daughter board (why are boards always girls?). This puppy holds some glue logic and sockets for the modules and SD card.
  3. The modules are the little boards that plug into the daughter board. There are two types of modules, FXS and FXO. The daughter board holds four modules.

So the whole thing is very similar to the Digium TDM400 design (and other companies who use modular approaches I guess), except that here the mother board is an embedded system and the daughter board uses a serial bus rather than PCI.

Stack Overflow

I am pretty happy with the hardware stacking architecture, here are some other cool things it can do:

  1. Although I haven’t tried it you might be able to stack more boards on top, to give a total of 12, 16 ports etc.
  2. It would be easy to design a daughter card with sockets for 8 or even 12 modules, that way you wouldn’t have to stack it so high. You could then make an IP-PBX in the shape of a channel-bank.
  3. It’s possible to combine analog and other interfaces in one stack. For example you could combine analog ports and say BRI-ISDN using the fourfin board.
  4. If the Blackfin DSP starts to glow cherry red we can always add a DSP daughter card to handle say echo cancellation.

Status

So how well does it work? Well it’s early days but so far so good:

  1. It works (really) and stays up until I bring it down, i.e. as far as I can tell it’s stable.
  2. I can make calls between ports and have run calls on 3 out of 4 ports at the same time. I ran out of phones and phone lines at that point!
  3. I can play the “Congratulations, you have successfully installed….” demo and even call Digium via the IAX2 demo.
  4. It makes and receives IAX2 & SIP calls OK.

Getting Involved

There are still plenty of things to do. If you would like to work on a leading-edge project with open hardware and software, you are very welcome to join our community and get involved.

Corporate sponsorship is welcome, however please don’t ask me to close the hardware designs (I get a lot of that). Some thoughts on the business and social possibilities are here. Some ways to contribute are engineering time, donation of test equipment, and direct financial support. In return you get high quality, well tested, open hardware designs and quality open DSP software.

We already have people working on software, hardware, and some companies donating test equipment and engineering time.

Next Steps

  1. Lots of testing. I would like to give the platform a good hammering using automated tests, for example have FXS ports call FXO ports continually and pass a few tones back and forth while measuring signal quality automatically.
  2. I would like to improve the echo canceller algorithm. I have a bunch of ideas and a “brains trust” of strong DSP guys who I am in email contact with to help on this one. I don’t see any reason why an open echo canceller can’t be made just as good at the proprietary echo cancellers being used in “hardware” echo cancellers today. After all, they are just software running on DSP chip. I am not saying it is a trivial problem (echo cancellation is tough DSP voodoo), but I am saying is is do-able. Any echo cancellation gurus out there – please email me if you would like to help with effort or even just advice.
  3. Implement booting via the SD-card.
  4. Complete the port to a late model Asterisk.
  5. Compliance Testing. I have booked the first set of compliance tests and will be aiming at approvals for the US, Canada, Australia and New Zealand. Once testing is complete you will be able to build and deploy real world products that are approved for connection to the telephone networks in these countries.
  6. The ultimate test. I will install one at my Mums house. If she can’t break it no one can. She is death to anything with IT in it. She doesn’t need a GUI, rather a RPI (rotary phone interface).

Hardware for Sale

I have started manufacture of 20 Beta units, they are due to ship in mid October. The price for a kit consisting of 1 daughter card and a total of 4 FXS/FXO modules (see photo below) is US$299 plus shipping (McDonalds ruler not included unless you really want one).

Combined with a US$226 BF537 STAMP card from Digikey (enter ADDS-BF537-STAMP-ND in the search box) you can start experimenting with your very own embedded Asterisk PBX with 4 analog ports for around US$500. Please email me if you are interested.

Buy purchasing my products you directly support open telephony hardware development.

Links

  • Building an Embedded Asterisk PBX Part 1
  • Building an Embedded Asterisk PBX Part 3
  • loan 13 payday 19 online arizona6 city payday 4 central loan6 advance 4 loan payday paydayadvance6 8 loan payday vapayday loan bad credit 8direct consolidation 9 loancredit loans personal bad 90 daytax for loans abandoment purposes Mappornos bondagesubmission bondage pornsites porn bondeprone porn bonebonnie british pornstar porn bonniefucking boob porngame boob porn Mapporn pimp 50centsoldr 6and porn95991 tattoo closet artists pornmag a3 pornadept porn aaslincoln porn abeabuelas y porn sexo madresporn accept creditporn accion enny porn accord Map

    Building an Embedded Asterisk PBX Part 1

    Over the last few days I have been bringing some telephony hardware to life. I have finally obtained all the parts I need and am assembling, testing, and blogging as I go!

    This work is part of a project to develop and build “open” IP-PBX hardware. Now by building I mean really building. Like designing the circuits and Printed Circuit Boards (PCBs) and then hand-loading the PCBs with a soldering iron. The PBX is an embedded Asterisk design running on a Blackfin STAMP platform. This work is part of the Free Telephony Project.

    Now the first priority is to start with a clean, tidy, professional work area:

    Mmmmmmmmm. Oh well, I will tidy it up one day.

    Lets start with the 4fx board. The 4fx board interfaces the Blackfin STAMP to the FXS & FXO modules. It has a little Xilinx XC9536 programmable logic chip (or CPLD). I had previously designed and simulated the Verilog code for this CPLD so all I had to do was program the chip using a JTAG cable that connects between my PC and a header on the 4fx card:

    It actually took me a few hours of head scratching to get the chip to program. The problem was I has accidentally selected the wrong chip type when I synthesised the CPLD code. DOH! Anyway once I worked that out it programmed straight away which was a relief – you never know with a new design if you have messed up something fundamental like connecting power in reverse. So the first “sign of life” you get from a new board is always a big relief.

    I then poked around with the scope while running a unit test program on the Blackfin that put the CPLD through a few tests. Just like software, it is very important to make sure the components of a hardware design a working before integrating the components into a larger design. The typical trap is that we get excited and try to move forward too fast, for example testing several new and unknown parts of the design all at once. Simple errors compound to tough bugs when combined with other untested hardware and software.

    So I always try to test thoroughly at the earliest possible stage. In fact I often organise my designs so they can be broken apart into little chunks and tested, rather than thinking about testing as an after thought. In the case of the CPLD I ran many simulations using the Icarus Verilog simulation tools before even going near the hardware. Experience (OK plenty of screw-ups) has taught me that it takes much less effort to test carefully earlier than to debug later.

    Anyway, back to the story. On the CPLD I messed up one pin’s position in the pin-locking file (easily fixed by recompiling the CPLD image), but apart from that the CPLD appears to be working fine. All the chip select signals are being generated in response to the commands from my test software.

    OK, the next step is to see if I can make some LEDs on the board light under software control. The LEDs are connected to the CPLD and will be used to show the status of each telephony port. So if we can make the LEDs do their thing this will prove another chunk of the CPLD code is OK.

    I modified the unit test program to write to the register that controls the LEDS:
    bfsi_spi_init(baud, (1<<NCS_A) | (1<<NCS_B));

    for(i=0; i<tests; i ) {
    bfsi_spi_write_8_bits(NCS_B, select);
    bfsi_spi_write_8_bits(NCS_A, data);
    }

    In the for loop, the first write sets the “destination” of the data (which SPI device we wish to write to). The second write sets the actual value. The way the LED is wired up if we write a 01 (binary) we should get the LED to glow red, and 10 (binary) to make it glow green. The for loop makes it repeat many times, just so I can see what is going on with my ancient analog scope. Only one write is actually neeeded.

    I peer at the LED. It stares back, blank and just daring me to try:

    I hit the magic command line:
    root:~> insmod tspi_4fx.ko data=0x1

    Hey – it worked! Thats not meant to happen! Not first time! WHOO-HOO! OK, lets try making it green:
    root:~> insmod tspi_4fx.ko data=0x2

    Coooooooool……..

    It is hard to explain feeling of achievement you can get from just making a LED light. You never really understand how much complex technology is between the vision and reality of making a simple LED come on – until you start to build chunks of that technology, solder the LED yourself, write the driver etc. Then you realise, and a simple LED turning on when you tell it to seems like an unlikely miracle! Anyone who has ever worked on making computers talk to hardware will understand what I mean.

    Especially if you have had your share of times when that LED wouldn’t turn on. For like days or weeks.

    OK so the next step was to test the FXO and FXS modules. Here they are all soldered and ready to smoke up, errr I mean test. The large, ugly resistors hanging off them are because I couldn’t easily source some very high (15M) and very low (0.5 ohm) resistors I needed in 0603/0805 packages. Can anyone send me a few please?

    First I wanted to test the FXO module. I connected it directly to the Blackfin STAMP card, rather than using the 4fx card just yet. Golden rule – always test the minimum possible:

    I already had some Asterisk software for the Blackfin running and tested (using other hardware). That meant I had tested and working software to test the unknown hardware. So it was just a matter of firing that up and seeing if it detected the card:
    Welcome to:
    ____ _ _
    / __| ||_| _ _
    _ _| | | | _ ____ _ _ \ \/ /
    | | | | | | || | _ \| | | | \ /
    | |_| | |__| || | | | | |_| | / \
    | ___\____|_||_|_| |_|\____|/_/\_\
    |_|

    For further information see:
    http://www.uclinux.org/
    http://blackfin.uclinux.org/

    BusyBox v1.00 (2006.08.25-23:13 0000) Built-in shell (msh)
    Enter 'help' for a list of built-in commands.

    root:~> eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
    Zapata Telephony Interface Registered on major 196
    Registered Span 1 ('WCTDM/0') with 1 channels
    Span ('WCTDM/0') is new master
    iRxBuffer1 = 0xff800000
    iTxBuffer1 = 0xff800080
    ISR installed OK
    Testing for ProSLIC
    ProSLIC not loaded...
    Testing for DAA...
    VoiceDAA System: 04
    ISO-Cap is now up, line side: 03 rev 06
    Module 0: Installed -- AUTO FXO (FCC mode)
    Found: Blackfin STAMP (1 modules)
    Registered tone zone 0 (United States / North America)
    4294895942 Polarity reversed (0 -> 1)

    root:~> /var/tmp/asterisk -vc

    Thats a pretty good result – the FXO port was detected OK. So then I started Asterisk and put a few calls through it. I placed a call into the PBX (using another Asterisk PBX running on an x86 box) and it detected the ring signal and went off hook OK:
    *CLI> RING on 1/1!
    NO RING on 1/1!
    RING on 1/1!
    NO RING on 1/1!
    Jan 1 02:51:59 NOTICE[96]: chan_zap.c:5406 ss_thread: Got event 2 (Ring/Answer)

    However the audio had lots of sharp clicks and pops. Crack-Crack-Crack every few seconds. Damn.

    I spent half a day chasing this bug. I puzzled me a bit as I knew the circuit was straight out of the Silicon Labs data sheet and that I (and a few others) had carefully checked it. So I figured it must have been an assembly error like a wrong component or bad solder joint. Actually I wasn’t quite that logical: in the real world bugs tend to get your emotions involved. You really want it to work so you get a little stressed and start doing and thinking stupid things. So you end up checking a bunch of things you don’t need to (like the schematic five times) and perhaps missing some other more sensible checks – you don’t always think straight when your emotions are in play. Such is the psychology of bug hunts.

    I started checking signals on the header and had trouble getting a good contact with my scope probe. I looked at the pin and there was some flux residue stuck to it. So I gave that part of the board a scrub with a fine brush and some solvent and then fired it up again to check that signal. Huh – now the audio is OK – clicks gone! WTF? I am still now sure what happened here – perhaps the brush dislodged a small short or the flux was conducting a little.

    So anyway the FXO module (fxomod) seems to work OK now.

    I then tried the FXS module and it worked on the first try. I was really happy about that – I was placing calls over it 5 minutes after the first time I applied power. Hardware development isn’t meant to work like that! Anyway I guess I will get my fair share of bugs later (it’s the conservation of bugs law), there is still plenty of development to go.

    My next step is to integrate the FXS and FXO modules with the 4fx board. More on that in a later post.

    This is what the whole thing looks like when put together with the STAMP, 4fx, and (for now) a single FXO module:

    The idea is that you can stack more 4fx boards to get multiples of 4 ports. You could also stack other cards, for example BRI-ISDN, E1/T1, or cards that give you additional DSP horsepower.

    You might have also noticed the SD-card. The driver for that was developed by Hans Eklund and the team at Rubico. They have done a fantastic job. I compiled the latest uClinux version with SD/MMC card support and it worked perfectly first time. It is really cool to read and write files to a SD card on the Blackfin, then transfer the card to a PC and find the files all there and readable. Such a simple hardware interface too (just a few wires).

    Geekiness is contagious. Just last week I convinced my wife Rosemary to help me with some board stuffing. I started here off on a simple thru-hole kit to teach here soldering. A few days later here she is soldering tiny 0603 resistors and doing a fine job:

    Thats all for now. I’ll might blog some more later as I work through the steps to bring up the rest of the board.

    Links

    1. Building an Embedded Asterisk PBX Part 2
    2. Building an Embedded Asterisk PBX Part 3
    3. More information on the Free Telephony Project here.
    4. Blackfin MMC/SD card how-to.
    5. More information on the design I am building here.
    6. Here is the (current) 4fx schematic in PDF form. It will probably change as the bugs are found and fixed.
    7. You can download source files for the schematics, PCB design, CPLD code here. Grab the latest hardware-x.y.tar.gz file. In the cpld directory there is a README that explains the CPLD code as well as “test benches” – Verilog code that tests other Verilog code.

    mortgage calculator rate loan 2ndremortgage http home advice uk loanloan abacuspayday loans 30 dayhour faxing loans no 1loan physician bank america$3000 loan credit with badloan education acsa bad loan with personal creditloan 1003 applicationdaphne pornporn daphnyporn vain star darienangel bio dark pornbbs porn darkporn collection dark bbsporn dark girlporn dark portal Map

    How to make your Blackfin fly Part 2

    This article describes how to write fast DSP code for your Blackfin.

    For the last month or so I have been working with Jean-Marc Valin from the Speex project to optimise Speex for the Blackfin. Jean-Marc had previously performed some optimisation work in the middle of 2005 (sponsored by Analog Devices).

    We built on this work, reducing the complexity for the encode operation from about 40 MIPs to 23 MIPs. On a typical 500MHz Blackfin this means you can now run (500/23) = 21 Speex encoders in real time.

    The really cool thing is that you can compress 21 channels of toll-quality speech using open source hardware (the STAMP boards), using an open source voice codec.

    We obtained gains from:

    1. Algorithm improvements, for example Jean-Marc ported large parts of the code from 32-bit to 16-bit fixed point.

    2. Profiling the code and optimising the implementation of the most CPU intensive parts, for example LSP encoding and decoding, vector quantisation.

    3. Experimenting with and learning about the Blackfin (e.g. care and feeding of instruction and data cache). It took us a while to work out just how to make code run fast on this processor.

    4. The gcc 4.1 compiler, which uses the Blackfin hardware loop instruction, making for() loops much faster.

    Why The Blackfin is Different

    Most DSPs just have a relatively small amount (say 64k) of very fast internal memory. In a uClinux environment, the Blackfin has a large amount (say 64M) of slow memory, and small amounts of fast cache and internal memory.

    The advantage of this arrangement is that you can run big programs (like an operating system) on the same chip while also performing hard-core DSP operations. This really reduces systems costs over designs that need a separate DSP and micro controller.

    The disadvantages for crusty old DSP programmers like me is that things don’t always run as fast as you would like them to, for example if your precious DSP code doesn’t happen to be in cache when it is called then you get hit with a big performance penalty.

    Some examples

    To get a feel for the Blackfin I have written a bunch of test programs, some of them based around code from Speex. They can be downloaded here.

    The cycles program shows how to optimise a simple dot product routine, I have previously blogged on this here.

    A Simple Library for Profiling

    To work out where to optimise Speex I developed a simple library to help profile the code. It works like this. You include the samcycles.h header file and insert macros:
    SAMCYCLES("start");

    for(i=0; i<10; i );

    SAMCYCLES("end");

    around the functions you wish to profile. Then, when you run the program it dumps the number of cycles executed between each macro:
    root:/var/tmp> ./test_samcycles
    start, 0
    end, 503
    TOTAL, 503

    Which shows that between “start” and “end” 503 cycles were executed. Here is a more complex output from the “dark interior” of the Speex algorithm:
    root:/var/tmp> ./testenc male.wav male.out
    start nb_encode, 0
    move, 1352
    autoc, 16149
    lpc, 3180
    lpc_to_lsp, 21739
    whole frame analysis, 17797

    Ignoring the magical DSP incantations here, we can see that some routines are much heavier on the cycles than others. So those are the ones that get targeted for optimisation. You often get big gains by optimising a small number of “inner loop” operations that are hogging
    all of the CPU.

    Care and Feeding of your Blackfin Cache

    One interesting test was “writethru” – this simply tests writing to external memory using a really tight inner loop:
    "P0 = %2;\n\t"
    "R0 = 0;\n\t"
    "%0 = CYCLES;\n\t"
    "LOOP dot%= LC0 = %3;\n\t"
    "LOOP_BEGIN dot%=;\n\t"
    "W[P0 ] = R0;\n\t"
    "LOOP_END dot%=;\n\t"

    This also illustrates why DSPs are so good at number crunching – that inner “W[P0 ] = R0” instruction executes in one cycle, and the hardware loop means 0-cycles for the loop overhead. Try doing that on your Pentium.

    However look at what happens when we try this on the Blackfin target which has “write through” data-cache enabled:
    root:/var/tmp> ./writethru
    Test 1: Write 100 16-bit shorts
    686 542 597 542 542 542 542 597 542 542

    The write test runs 10 times. On each run we print out the number of cycles it took to write 100 shorts. You can see the execution time decreasing as the instruction code and data gets placed into cache.

    However there is something funny going on here. Even in the best case (542 cycles) we are taking something like 5.4 cycles for each write, and it should be executing in a single cycle. My 500 MHz DSP is performing a like a 100 MHz DSP. I think I am going to cry.

    The reason is that in “write through” mode every write must flow through the “narrow pipe” that connects to the Blackfins external memory. This external memory operates at 100 MHz (at least on my STAMP), so a burst of writes gets throttled to this speed.

    This is not good news for a DSP programmer, where you often have lots of vectors that need to get written to memory. Very quickly.

    There are a couple of solutions here. One is to take a hammer to your Blackfin STAMP hardware and go buy a Texas Instruments DSP (just kidding).

    Another less exciting way is to enable “write back” cache (a kernel configuration option):
    root:/var/tmp> ./writethru
    Test 1: Write 100 16-bit shorts
    119 102 102 102 102 102 102 102 102 102

    Now we are getting somewhere. Writing 100 shorts is taking about 100 cycles as expected. Note that the first run takes a little longer, this is probably because the program code had to be loaded into the instruction cache. In “write back” cache the values get stored in fast cache until the cache-line is flushed to external memory some time later.

    On a system like the Blackfin, we may run a lot of other code between calls to the DSP routines. This effectively means that the instruction and data caches are often “flushed” between calls to our DSP routines. In practice this leads to extra overhead as our DSP instructions and data need to be reloaded into cache.

    In the example above the overhead was about 20%. This is very significant in DSP coding. A way to reduce this overhead is to use internal memory…..

    Internal Memory

    The Blackfin has a small amount of internal data (e.g. 32k) and instruction memory (e.g. 16k). Internal memory has single cycle access for reads and writes. The Blackfin uClinux-dist actually has kernel-mode alloc() functions that allow internal memory to be accessed.

    The Blackfin toolchain developers are busy working on support for using internal memory in user mode programs, see this thread from the Blackfin forums.

    In the mean time I have written a kernel mode driver l1 alloc that allows user-mode programs to access internal memory:
    /* try alloc-ing and freeing some memory */

    pa = (void*)l1_data_A_sram_alloc(0x4);
    printf("pa = 0xx\n",(int)pa);
    ret = l1_data_A_sram_free((unsigned int)pa);

    which produces the output:
    root:~> /var/tmp/test_l1_alloc
    pa = 0xff803e30

    i.e. a chunk of memory with an address in internal memory bank A.

    To see the effect of internal versus cache/external memory:
    Test 1: data in external memory
    ret = 100: run time: 173 103 103 103 103 103 103 103 103 103
    Test 2: data in internal memory
    ret = 100: run time: 103 103 103 103 103 103 103 103 103 103

    After a few runs there is no difference – i.e. on Test 1 the data from external memory has been loaded into cache. However check out the difference in the first run – Test 2 is much faster. This means that by using internal memory we avoid the overhead where the DSP code/data is out of cache, for example when your DSP code is part of a much larger program.

    I should mention that to make this driver work I needed to add a an entry to my
    uClinux-dist/vendors/AnalogDevices/BF537-STAMP/device_table.txt file:
    /dev/l1alloc c 664 0 0 254 0 0 0 -

    then rebuild Linux as for some reason I couldn’t get mknod to work. Then:
    root:~> ls /dev/l1alloc -l
    crw-rw-r-- 1 0 0 254, 0 /dev/l1alloc
    root:~> cp var/tmp/l1_alloc_k.ko .
    root:~> insmod l1_alloc_k.ko
    Using l1_alloc_k.ko
    root:~> /var/tmp/test_l1_alloc

    Another problem I had was that insmod wouldn’t load device drivers in /var/tmp, which is where I download files from my host. Hence the copy to / above.

    Speex Benchmarks

    Here are the current results for Speex on the Blackfin, operating at Quality=8 (15 kbit/s), Complexity=1, non-VBR (variable bit rate) mode:

    The terms ext/int memory refers to where the Speex state and automatic variables are stored. The units are k-cycles to encode a single 20ms frame, averaged over a 6 second sample (male.wav).

    (1) Write through cache, ext memory: 564
    (2) Write through cache, int memory: 455
    (3) Write back cache , ext memory: 465
    (4) Write back cache , int memory: 438

    So you can see that write-back cache (3) gave us performance close to that of using internal memory (2 & 4) – quite a significant gain.

    Optimisation work is in progress so we hope to reduce these numbers a little further in the near future. Also, there is still plenty of scope for optimisation of the decoder, which currently consumes about 5 MIPs with the enhancer enabled.

    To test out the current Speex code for the Blackfin (or other processors for that matter) you can download from Speex SVN:
    svn co http://svn.xiph.org/trunk/speex

    Or you can download a current snapshot from here.

    movies sapphic contentfree adult trailers moviemovies free voyeursapphic petite moviesdancing dirty moviefree clips movie ebony pornmovie pornosex movie hidden disneyfucking granny moviesmovie the blowbarrington alfreds of island rhodetheme g ali ringtonesringtones 3ringtones 311ringtones 99missed call ringtone 1mobile ringtones absolutely freeac ringtone dc Mapviagra pills 100mg priceviagra acapulcoviagra achetertramadol $79 180200 cod overnight tramadolper hcl tramadol acetaminophentramadol all aboutversus ambien xanax Mapringtone acdc midi12 ringtone adamringtones australia fair advancesecond ringtone 203310 ringtone composebalut abs cbn ringtone64 poliphony ringtones1000 words ringtone Mapporn 69 position69ing porngrade 6th porn7 penis porn inchporno 70mivies 70s pornporn pictures 70sporn s fan 80 Map

    GSM Port for the Blackfin

    For my uCasterisk project I needed a couple of optimised codecs for the Blackfin. This post discusses the steps taken to port GSM to the Blackfin.

    The GSM codec for the Blackfin can be downloaded here.

    Usage

    1/ To make:
    make

    2/ To test:

    Download tgsm (test program produced by make) to your target and also download a source speech file like:

    to your Blackfin hardware and type:
    root:/var/tmp> ./tgsm male.wav male.out
    TOTAL, 0
    SNR = 10.4591 dB enc 114 dec 39 k cycles/frame
    root:/var/tmp>

    When it runs it prints out the number of cycles it took to execute each 20ms encode and decode frame.

    You can then upload the output file (male.out) to your host and listen to it. On my Linux box I use “play male.sw”, the sw lets “play” recognise it as a 16-bit signed-word file.

    Optimisation

    I spent a day or so optimising the code, for example:

    a) I wrote Blackfin versions of the macros in gsm/inc/private.h

    b) Applied the profiling macros SAMCYCLES and worked out which parts of the code needed the most optimisation.

    c) I looked at the assembler output of various functions (gcc -S or -save-temps options) and modified the C code for better output, such as using the hardware loop supported by gcc 4.1. A lot of the original GSM code was written for older x86 compilers, and lots of compiler-specific mods were evident. In many cases to speed up code I just went back to vanilla C and the Blackfin compiler did a better job!

    e) By inspecting the assembler I found some important routines were making function calls inside their inner loops which is very inefficient. These were modified to remove the function calls.

    f) Use some assembler in the tightest, most cycle-hungry loops.

    Performance

    Using gcc 4.1 and testing on a Blackfin STAMP BF533 board:
    encode: 114,000 cycles/fr: (114,000/0.02s) = 5.7 MIPs
    decode: 39,000 cycles/fr: (39,000/0.02s) = 1.95 MIPs

    The initial number of cycles per encode was 274,000, decode 82,000.

    Further Work

    My gut feel is it might be possible to reduce the total (encode plus decode) cycles by perhaps another 30% with further optimisation.

    a) The analysis and synthesis filter functions consume about 50,000 cycles per encode/decode cycle, they could be converted to assembler.

    b) The RPE algorithm (rpe.c) could be optimised.

    c) Blackfin internal memory might speed some operations, such as autocorrelation.

    How To Profile

    I have written a set of macros (samcycles.h) to sample the Blackfin cycles counter. Here is an example on how to use them:

    a) Patch code.c:
    patch -p0 < code_profile.patch

    b) make, download tgsm and re-run on the target:
    root:/var/tmp> ./tgsm male.wav male.out
    start Gsm_Coder, 0
    Gsm_Preprocess, 5312
    Gsm_LPC_Analysis, 11406
    Gsm_Short_Term_Analysis_Filter, 23483
    Gsm_Long_Term_Predictor, 11525
    Gsm_RPE_Encoding, 8308
    Gsm_Long_Term_Predictor, 10947
    Gsm_RPE_Encoding, 5411
    Gsm_Long_Term_Predictor, 10701
    Gsm_RPE_Encoding, 5422
    Gsm_Long_Term_Predictor, 10696
    Gsm_RPE_Encoding, 5409
    end Gsm_Coder, 521
    TOTAL, 109141
    SNR = 10.4591 dB enc 115 dec 39 k cycles/frame
    root:/var/tmp>

    c) To investigate further, just add more SAMCYCLES() macros. Its a good idea to remove or disable the macros when you are finished, as they use a few thousand cycles:
    patch -R -p0 < code_profile.patch

    Thanks

    To Jean-Marc Valin and the Speex project, I used some of their assembler code (see COPYING.xiph for the copyright message related to this code).
    loans $100loan state alaskapercent 125 home equity loanloan home alabama mobilefast 10,000 loanequity accelerated loansfraud advance fee loana loan paper c b Mapporn review 1280×72013 girl porn14 lesbian pornporn year old 14vfere min 15 pornpreviews 15 porn minclips minute porn 1515 minutes porn Mapagreement laptop loanlaserpro origination loan systemva loans about learnlegal templates document loan personalrepayment loan educational for legislationlenders in loan bad wi makingof loans lenght used rvlibrary loan Map

    How to make your Blackfin fly Part 1

    The Blackfin processor is one of the fastest DSPs available today. It also runs uClinux and has a great open source community and there are even open (free) hardware designs available.

    I am interested in using the Blackfin for telephony applications, where DSP grunt is required for codecs and echo cancellation. Now that I have a reasonable port of Asterisk running on the Blackfin, I am exploring the DSP capabilities of the Blackfin.

    Boring Mathematical Bit

    As a first step I have written some test program called cycles.c that demonstrates how to optimise the Blackfin for DSP operations. A tar-ball including a Makefile is here.

    The sample code just finds the dot product of two vectors:
    int dot(short *x, short *y, int len)
    {
    int i,dot;

    dot = 0;
    for(i=0; i<len; i )
    dot = x[i] * y[i];

    return dot;
    }

    It’s a really common operation for DSP, and DSP hardware is carefully designed to compute dot products efficiently. Actually thats all a DSP really is, a processor designed to compute dot-products quickly.

    The core operation is called a multiply-accumulate, or MAC. One multiply, one add. A DSP chip is defined by how fast this can be done.

    Theoretically, the Blackfin can perform two MACs in a clock cycle. That means on a 500MHz Blackfin you get 1000 MACs.

    Down to Business

    Enough talk, here is a run of the sample code from my BF537 STAMP:

    root:/var/tmp> ./cycles
    Theoretical best case is N/2 = 50 cycles
    Test 1: Vanilla C
    ret = 100: run time:
    3838 3507 3373 3408 3373 3373 3373 3373 3373 3373
    Test 2: data in external memory, outboard cycles function
    ret = 100: run time:
    442 240 239 218 218 218 218 218 218 218
    Test 3: data in external memory, inboard cycles
    ret = 100: run time:
    242 103 103 103 103 103 103 103 103 103
    Test 4: data in internal memory, inboard cycles
    ret = 100: run time:
    214 53 53 53 53 53 53 53 53 53

    A low number of cycles is good. A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.

    Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.

    The Blackfin has a handy CYCLES register that tells you how many clock cycles have passed. By sampling this before and after the function-under-test you can measure how long the function takes to execute. I wrote a simple C function to read this register:
    int cycles() {
    int ret;

    __asm__ __volatile__
    (
    "%0 = CYCLES;\n\t"
    : "=&d" (ret)
    :
    : "R1"
    );

    return ret;
    }

    Between Test 2 and Test 3 I moved the CYCLES register sampling inside the dot product function. The C-function version was consuming too many clock cycles, Jean-Marc suggested this was due to cache misses when you perform function calls. I suppose as an alternative I could have inlined the cycles() function.

    For best performance place the input vectors into different banks of internal memory. Test 3 and Test 4 shows how clock cycles can be halved using this technique. In Test 3 the arrays are initially in SDRAM, after a run they get to L1 cache, but they are still in the same bank of physical memory, hence a 100% speed penalty.

    Allocating Internal Memory

    At the time of writing I understand there are kernel-mode malloc functions for obtaining blocks of internal memory, but I am not sure about how to access them in user mode. So I hacked it:
    /* I know, I know - this is very naughty :-) */
    short *x=(short*)0xff904000 - N*sizeof(short); /* Top of Data B SRAM */
    short *y=(short*)0xff804000 - N*sizeof(short); /* Top of Data A SRAM */

    I am sure I will be condemned to uClinux-hell for this, but hey, I got my 50 cycles, didn’t I?

    BTW I haven’t turned any optimisation flags on for the C code, as my gut feel was the difference wouldn’t be significant compared to what hand-optimised assembler can produce.

    Summary

    Even though the Blackfin is designed for DSP, it is really easy to slow your DSP program down by a factor of about 80 (3838/50 between test1 and test4). However with a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.

    I know coding hand-optimising assembler sounds terrible, but usually it’s just a few “inner loop” routines. The whole cycles.c program took me about 2 hours to write (having Jean-Marcs samples handy was very useful), and it was my first attempt at Blackfin assembler. So it’s no big deal, especially given the speed increases you can obtain.

    Acknowledgements

    Thanks to Jean-Marc Valin of Speex for his comments and code samples. He really has done a fantastic job with Speex, all that optimised fixed point DSP code makes my head spin!movies sex adultfucking movie black clipsmovie free boobs bouncingmovie euro triplinks adult free moviefree movie fuck sampleserotic free japanese moviemovies porn homemademasterbation moviesfree movies porn

    Measuring Stack Usage in Multi-threaded uClinux Apps

    In regular Linux the MMU allows the stack to grow dynamically, the MMU just allocates more physical pages. However in uClinux, the correct amount of stack for each thread must be allocated before the thread is created.

    Too little stack and your program will corrupt the system in nasty, unpredictable ways. Thread stack gets malloced from the system heap, so an overflow means writes to an arbitrary address just outside the block of memory allocated to the thread. This memory could possibly be in use by other parts of the system, perhaps for a different application. These sorts of bugs can be very difficult to track down.

    If you allocate too much stack, then you are wasting memory, a valuable resource on embedded systems. For example I discovered I was allocating far too much stack and wasting Mbytes of memory, especially when multiple threads were running.

    The standard approach is to try random values of stack until you find one that works. However I thought it might be a better idea to actually measure the amount of stack used by each thread. Then I could tweak the stack allocation to optimise memory usage and even check for stack overflows at run time.

    Threadstack Library

    I have written a small library (called threadstack):
    unsigned int threadstack_free(pthread_t *thread);
    unsigned int threadstack_used(pthread_t *thread);

    Here is a sample run on my Blackfin BF537 STAMP, when a 100k stack was allocated to a thread:
    root:/var/tmp> ./test_threadstack
    stack used: 1300
    stack free: 100620
    root:/var/tmp>

    How it Works

    The functions work by examining memory allocated to the stack. The theory is that if the memory is non-zero, then it must have been used by the thread at some time (the entire block of memory used for the stack is initially set to 0 before the thread starts). So the routines search the stack memory for the first non-zero value, and that is declared the “high water mark” – the point where the stack reached it’s maximum.
    0xff <- stack top
    0xff
    0xfe <- high water mark
    0x00
    0x00
    0x00 <- stack bottom

    The high water mark will change over time, so after your thread has been running for a while is the best time to measure stack usage.

    One weakness with this approach is that if stack allocation is way too low your program may bomb before these routines get a chance to run. However in that case you will at least know there is a problem, and can increase stack to some high number (e.g. Mbytes) to get the program running, before using these functions to determine actual stack requirements.

    Usage

    In my uClinux Asterisk port I have added code to check for stack overflow just before a thread ends:
    pthread_t thread = pthread_self();
    assert(threadstack_free(&thread) > 10*1024);

    This code checks that while the thread was running, the minimum free stack was 10k. The assert will kill the program with an error message and tell me straight away I need more stack. Much nicer than getting an obscure bug in the system due to a stack overflow on a thread. Now the program finds stack overflow bugs for me!

    This example above runs from within the actual thread itself, hence the call to pthread_self() to discover the threads handle. You can also call the functions from another thread (e.g. the main thread), for example to periodically meter stack usage.

    Links

    More information on multi-threaded applications for uClinux
    movies hardcoremovies ebony girls buttclips sybian moviemovies homemade sexmovie stars nudesamples movie hot xxxmovies free handjobmovies anal free Mapstar porn blonde galleryporn gallereis thumnail blondeporn videos blondepornstar blonde pussyporno teen blondepornstar blonde videoporn blondesblondes pornstars Mapporn cliphunterclipmaster pornfree clips pornclips anal porn ofof clips porn girlclips porngratuit porno clipsclipsporn Mapringtones boltblueboombastic ringtone freebreakfast club ringtonesringtone brewerspride brown ringtonescoupon ringtone cingluarcingular phones ringtones recordsurvive circa ringtones Map

    Porting multi-threaded apps to uClinux

    I have recently been working on improving the stability of uCasterisk, a port of Asterisk to uClinux. This required some research into memory management for multi-threaded apps on uClinux. I didn’t find any one resource that had everything I needed to know so I thought I would collate some of the information I found here as a resource for others. Thanks to all those (especially on the Blackfin forums) who helped answer my questions.

    I am using the Blackfin flavour of uClinux and the uCasterisk application as an example, but this information should apply equally to other uClinux systems/applications.

    MMU versus no-MMU

    Asterisk is a pretty big application for uClinux, the executable is about 2.5M and when running several calls can consume 32M of system memory. The big difference between uCasterisk and other Asterisk implementations is the lack of MMU. A MMU is handy when working with large, multi-threaded apps. For example when a thread is kicked off you can allocate a virtual stack of say 2M, but physical memory will only be allocated as and when it is actually required (say due to a write to a previously unused part of the stack). If your thread never uses all of the stack, then the physical memory is available for other users.

    On a MMU-less system you need to work out the maximum stack your thread may need, and allocate that. If you get it wrong, your application (and possibly the whole system) will bomb. This generally means you are wasting memory compared to the MMU case, as you always need to allocate the worst case amount of memory required.

    One possible advantage of MMU-less systems is no nasty surprises – any memory allocated really does exist, and no over-commitment is possible. On a MMU-based system physical memory isn’t actually allocated until you write to it, and it may be paged to disk just when you need it (although I understand there are options to control this behaviour).

    Stacks for Threads

    When you start an app, you get allocated a stack for the application. This is actually a stack for the main thread of the application. When you start a new thread (say with pthread_create()) the thread gets allocated a new stack from the system heap. The two stacks are completely unrelated. The size of each stack is independent, you control the size in different ways (see below).

    Tips for Porting to uClinux

    Don’t enable stack checking. This feature is very useful for single-threaded apps; it causes the operating system to kill the app when it uses all of it’s stack space. Very useful, as it tells you straight away to increase the stack size. Unfortunately at present this feature hasn’t been extended to multi-thread applications; using it with multi-threaded apps (at least on the Blackfin) causes problems as pointed out in the 2005R4 RC2 release notes and discussed here.

    You control the application (main thread) stack with the -s option, on my Blackfin system the command line is:

    bfin-uclinux-gcc -Wl,-elf2flt='-s 1000000' \
    -o thread thread.c -pthread

    In this example the stack is set to 1000000 bytes.

    You control the size of the stack for each thread you create using pthread_attr_setstacksize(), for example (from the Asterisk utils.c file):

    pthread_attr_init(&attr);
    pthread_attr_setstacksize(&attr, 0x1000000);
    pthread_create(&thread, &attr, thread_func, NULL);

    Monitoring Memory Usage

    cat /proc/meminfo can be very useful, here is the output from my Blackfin STAMP BF533 board, taken while uCasterisk was running with several SIP calls in progress:

    root:/var/log/asterisk> cat /proc/meminfo
    MemTotal: 59784 kB
    MemFree: 11084 kB
    Buffers: 100 kB
    Cached: 4172 kB
    SwapCached: 0 kB
    Active: 3828 kB
    Inactive: 444 kB
    HighTotal: 0 kB
    HighFree: 0 kB
    LowTotal: 59784 kB
    LowFree: 11084 kB
    SwapTotal: 0 kB
    SwapFree: 0 kB
    Dirty: 4 kB
    Writeback: 0 kB
    Mapped: 0 kB
    Slab: 43744 kB
    CommitLimit: 29892 kB
    Committed_AS: 0 kB
    PageTables: 0 kB
    VmallocTotal: 0 kB
    VmallocUsed: 0 kB
    VmallocChunk: 0 kB

    The most important fields are MemFree (total system memory free) and Slab (system wide heap in use).

    In earlier versions of Linux the CommitLimit field indicated the maximum Slab was allowed to reach before processes were killed (with Out-Of-Memory errors). However on my distro I discovered by experiment that you can actually increase the Slab well beyond this limit, as indicated above. Looking at the kernel source file uClinux-dist/linux-2.6.x/mm/nommu.c, __vm_enough_memory() function it appears that the memory allocator uses the OVERCOMMIT_GUESS method, which ignores the CommitLimit and allows up to 97% of memory to be allocated.

    It is interesting to observe MemFree as you perform different operations. For example on uCasterisk when a new SIP call starts, a thread is created, which requires stack and heap space. I also noticed MemFree decreasing when I copied files on a ram file system – this caught me for a while as uCasterisk was chewing through available system memory writing Call Data Records to the ram disk and eventually causing Out of Memory errors.

    ps an top are also useful, as they indicate the amount of memory allocated to the system/application.

    Links

    CommitLimit and OOM Killer
    Why malloc is different under uClinux
    Application Debugging on the Blackfin
    Intro to Linux Apps on the Blackfin (skip to bottom of page)
    Blackfin forum thread where I asked some questions on this topic

    Summary

    I hope this was useful – pls email me or add a comment below if you have any comments/suggestions/corrections.
    payday loan 6 8 australiapayday loan 8 day pay loanprocessor account mortgage loan manager processorachieve and loans studentadult loan site personals personalscredit loan secured unsecured adverse onlineloan direct student aid federal moneydirect aid financial loan student Mapringtones lg freelg ringtones howtoringtones lg4650wow basketball ringtones bow lilof ringtones funny listlocomotive ringtonesgood long friday ringtoneslow ringtones rider Mapstarfire hentaisexy secretarieshentai shemaleproposal xxxporn trailerswrestling nudepussy catanime pussy Map