Mesh Potato Spectrum Analyser

For a while I have been thinking about how to fix the problem Wifi links in the Dili Village Telco. About 20% of the nodes either can’t make a phone call, or phone calls can’t be made at certain times of day. We think the problems are due to interference, or perhaps Line Of Site (LOS), however it’s hard to tell. We need some more visibility into the problem. For the Village Telco in general, we need a way to fix these problems that non-geeks can understand. Ease of use is key.

Elektra pointed me at a console based Wifi analyser tool called Horst by Bruno Randolf. Horst captures any and all Wifi packets on the current channel then presents them in interesting ways. Horst can even scan over a range of Wifi channels. For the last few weeks I have been thinking of some way to scan Wifi activity. Horst already has a framework that does just that. So I decided to add a Spectrum Analyser display mode to it. I found the Horst code really nice to work with, in just one hour I was making useful changes and building up my new feature. This says a lot for Bruno’s code and the design of Horst.

Here is an example of the spec an in operation, taken from a Mesh Potato (Node 36) on my roof (click for a larger image):

Each “*” represents a packet from any radio on any other network, the numbers are the IPs of nearby Mesh Potatoes. To save space I only print the last octet of each IP. So if there is a * above the level of your mesh IPs, it indicates an interference problem. So .21 and is good, .14 is OK, and .56 is down in the mud. This is a good match to the ping results and batman scores.

But in general the spectrum around my mesh on Channel 10 isn’t very busy, there is more activity on the lower channels. I have asked Lemi to try this out in Dili, and see if it is useful for debugging his problem links.

Here is README describing the patch and how to use it (note the patch may not apply cleanly as the Horst code has moved on). You can download Linux x86 and Mesh Potato executable versions (you also need this start-up script for the MP). It works really well on an x86 Linux machine and it’s amazing to see what other traffic is floating around your Wifi spectrum.

It’s still early days, and I am working together with Bruno to integrate a spec-an feature into Horst. The latest version of Horst is available here.

Interference/LOS Example

From node 10.130.1.36 (which I’ll just call 36) on my roof I have a good link to node 14, about 200m away. However the link to node 56 is poor. When I ping 56 I get poor results, that vary with the time of day.

To use Horst in scanning mode you need to connect to the Mesh Potato via Ethernet, as scanning all channels would cause the Wifi link to go down. As 14 is on a friends roof it wasn’t convenient to connect an Ethernet cable to it. So I ssh-ed into 14 via Wifi, and ran Horst in single channel mode. Here is the result (click for larger image):

This simple image tells me a lot. We can see a nice strong signal back to 36 (the node on my roof), but the link to 56 is down among interfering packets from other networks. The two nodes are only 400m apart so it’s not a signal strength issue caused by distance. It’s probably a Line of Sight (LOS) issue caused by trees and other obstructions like two story homes. This suggests I need to raise either 56 or 14 on higher masts. However that’s a problem I can’t fix, as tall masts are required to clear the local obstacles around here. At present 14 and 56 are mounted discretely at TV antenna height on borrowed chimneys, it isn’t feasible (or polite to my friends) to put tall masts up at each house.

Even though I can’t fix the problem right now, it’s very nice to get a deeper understanding. Ping is great for telling you when a link is good, but doesn’t tell you much about why it is bad. Some interference comes and goes over the day. Ping can’t help you with this in advance, it can only tell you the state of the link right now. However the spec-an plots show the presence of nearby networks that may become interferers when those networks get busy.

Sometimes the Line of Site (LOS) feature of Wifi can work for you. Because node 14 is not mounted very high, it doesn’t receive much interference from other networks. Node 36 on my roof is mounted on a higher mast, so receives more interference from other networks. This can be seen on the plots above. The top plot is taken from node 36, you can see node 14 is “OK” relative to the interference. However on the plot above taken from 14, 36 is way above the interference. This suggest another solution to my problem – some more nodes in the neighbourhood mounted low down but with good LOS to 14 and 56.

Now I wonder. Could we somehow present the spectrum information (like the presence and signal strength of packets from networks) in audio form, along the lines of the audio ping concept? Something that could be played through a telephone handset when you are half way up a mast. Hmmmm, this would mean integrating chunks of Horst into an Asterisk application.

WindSurfer Antenna

In Dili we have some cases of a strong signal (i.e. no LOS problems) but ping still indicates a poor link. This is probably interference. So once we have found an interference problem, how do we fix it? Well we have had some success in Dili using Nanostation 2s that sport a built-in directional antenna. The directional antenna can effectively block interfering signals. However Nanostations are expensive in Dili, and don’t do telephony.

I recall Bart and Paul Gardner-Steven mentioning some home made reflectors. So I Googled and found the Windsurfer. It really is a clever and elegant design. In 10 minutes of snipping and gluing I made one:

I ran across the road to the local park with a Mesh Potato and pointed it at another MP on the roof of my house. However the signal strength from packet to packet was bouncing around making it was hard to make any meaningful measurements. To smooth these out I added a low pass filter to Horst to average the signal strength for each node. Much better.

I selected a fixed channel (i.e. stopped scanning), filtered on BSSID so all Horst could “see” was the MP on my house where I was pointing. The signal was reading -59/60 (+/- 1dB) with a 2dB omni. Then I changed to the 9dB Windsurfer antenna and got -52/53. Very nice.

Even better, when I turned the directional antenna around it attenuated the MP signal (and other signals in that direction) by 20dB. Great for removing interference. Quite amazing this can all be done on a low cost router, some clever software, with a cardboard + foil antenna that took 10 minutes to make.

I have asked Lemi to try this antenna in Dili, it should give his Mesh Potatoes equivalent interference rejection capability to a Nanostation 2.

Wifi Tuning and 802.11b Signals

Mysteriously, it appears that Wifi packets transmitted on one channel can be received on other channels. At first I thought this was an artefact of the scanning process. However when I monitored a fixed channel (say 6), I still picked up packets transmitted on 4 and 10. With a Mesh Potato set to Channel 9 or 11, I could reliably ping other nodes on Channel 10.

This is apparent on the spectrum below of two radios within 1m of the machine running Horst. Here we are plotting the BSSID of each packet (click for larger image):

You can see very low level “babe” packets being received a long way from Channel 10 where they were transmitted. Likewise with packets from my AP on Channel 4 “88a3”. These beacons (or Batman broadcast messages) are sent at the lowest bit rate, a 1 Mbit/s 802.11b spread spectrum signal.

Here is the spectrum of such a signal on a commercial spectrum analyser (captured by Elektra and the helpful people at the CSIR last year):

Each vertical division is 10MHz. You can see significant energy 20MHz away from the centre. The side-lobes actually continue even further in each direction, but are beneath the noise floor of the spectrum analyser. Each Wifi channel is only 5MHz apart, so it’s easy to see that there is spread spectrum 802.11b energy at some distance from the nominal Wifi channel. This is why packets can be detected by receivers tuned well away from the transmit frequency. Especially nearby transmitters. I think this is only possible for the lower rates, 802.11g uses a different modulation scheme that is more sensitive to frequency offsets.

This also shows how nearby transmitters can mess up reception of more distant WiFi signals. Even if the transmitters are tuned several channels away.

Here is a plot taken from a spectrum analyser in my home office. It is sampling two Wifi networks a few metres away. The centre is Channel 4 .Three divisions (30MHz, 5MHz per channel) to the right is Channel 10. I am sending ping broadcasts on Channel 10. You can see the clear shape of 802.11b modulation. On Channel 4 it’s a mixture of 802.11b broadcast packets and other 802.11bg traffic from the AP. You can see that the side lobes just about overlap, despite being 6 channels away.

7 thoughts on “Mesh Potato Spectrum Analyser”

  1. Hello Mike,

    No Qos, we just have the policy of voice only on the Village Telco mesh networks. Not sure what the size of the tx queues are. Our evidence for the 20% of links that are poor links is pretty solid, e.g. we can’t ping a node. When the links are good, the calls sound great.

    David

  2. You’ll never see a problem when your links are *not* loaded; it’s latency under load that goes through the roof. Empty queues don’t cause problems.

    The linux “ip” or “ipconfig commands will tell you your transmit queue length: but many / most / all current network devices may also have excessive transmit rings in the device driver/hardware on top of the transmit queue.

    On the OLPC hardware (which implements a mesh), there is the transmit queue, and 4 packets of buffering out in the network device itself; I’m trying to check about the driver (and bus transport).

    1. Hi Jim,

      Yes we haven’t characterised Village Telco mesh networks under load yet, most of the time they are lightly loaded with just a few voice calls. From ipconfig/ip on a Mesh Potato I think the tx queue is 195:

      root@Mesh-Potato:~# ip link

      1: lo: mtu 16436 qdisc noqueue state UNKNOWN
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      2: eth0: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
      link/ether 00:09:45:57:71:0d brd ff:ff:ff:ff:ff:ff
      3: wifi0: mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 195
      link/ieee802.11 00:09:45:57:71:0c brd ff:ff:ff:ff:ff:ff
      4: ath0: mtu 1500 qdisc noqueue state UNKNOWN
      link/ether 00:09:45:57:71:0c brd ff:ff:ff:ff:ff:ff

      Actually the mesh networking we use (Layer 3 Batman algorithm) is quite different to the OLPC Mesh networking.

      But thanks for the tip Mike and Jim. I think that our current problems (e.g. poor packet loss on a lightly loaded links) are not related to buffers. However it is something to watch for in the future as we scale up Village Telco networks.

      Cheers,

      David

      1. My mesh is also a layer 3 routing protocol (babel). Under poor conditions (rain, not interference, in my case) in addition to seeing packet loss, tcp retransmits, duplicate naks, and overall performance drop to 4Mbit/sec I would see latencies in the 400 ms range for basic udp packet types such as DNS and ping over three hops – in addition to loss!

        Señor Gettys basically identified the problem of bufferbloat, and in checking into my hardware I was horrified to discover that in some components (openrd) of the architecture I had a txqueuelen of 1000, and 256 DMA TX buffers. The radios themselves (nanostation m5s) were defaulting to txqueuelens of 1000 and DMA TX buffer sizes of 64, too. This meant that no matter how much traffic shaping I did in the higher levels of the stack, once the network got flooded, under poor conditions, it stayed that way.

        The steering wheel gets disconnected from the road, basically. I never noticed it under good (eg, lab) conditions.

        After knocking down the buffer size end to end to the lowest values I can – which are still too high – txqueuelen of 16 and DMA bufs of 64 – I am hoping to cut the dma bufs in the ath9k hardware down to 16 or less –

        and closely analyzing/fiddling with traffic shaping – I have dramatically reduced the overall jitter and delays of critical packets such as link local multicast (babel routing in my case), ping, and sip under such conditions as I’ve simulated since leaving Nicaragua (artificially cutting the rate on the radios down to 5.5Mbit/sec for example, misaligning the radios as another)

        Admittedly my net runs more and different kinds of traffic than yours.

        In your case, I think (?) you are using the ath5k driver, which, in grepping the code, appears to have 10 TX queues, each with *200* entries… which your txqueuelen of 200 feeds into.

        And your radios go as low as a Mbit.

        It’s not clear to me how packet traffic is classified into these buckets:

        enum ath5k_tx_queue_subtype {
        AR5K_WME_AC_BK = 0, /*Background traffic*/
        AR5K_WME_AC_BE, /*Best-effort (normal) traffic)*/
        AR5K_WME_AC_VI, /*Video traffic*/
        AR5K_WME_AC_VO, /*Voice traffic*/
        };

        and end up in the appropriate queues.

        AND… on top of all this the default TX_RETRY for this driver is.

        #define AR5K_INIT_TX_RETRY 10

        So a good theory at this point that all the transmit buffers (dma, txqueue) are far too deep for voice traffic under bad conditions. Further, exploring traffic shaping (you aren’t doing ANY?) and packet marking so that voice traffic ends up in the right (and smaller) queues in the driver/hardware (higher priority and less aggressive TX_RETRY) seems worthwhile.

        ethtool -g device # will generally show you the DMA TX queue
        ethtool -G device tx N # will for some devices let you set it at run time.

        There appears to be no match between the driver’s 10 TX queues and the other internals, I’m still digging into the code…

        1. And lastly, I note that the bufferbloat problem may have been endemic on all the internet gateways down in Nica – with typical uplink speeds of 128KB/sec or (best case) about 1Mbit – critical packets were not getting through and tcp’s congestion control would fail to work, leading to all sorts of network misbehavior that brings back my own nightmares of the mid-80s congestion collapse….

          With a slow asymmetric internet network uplink you have to drop the priority of the tcp acks well below most other kinds of packets AND have short queues… I’ve got a guy experimenting with this in his cybercafe right now… the famed wondershaper is not quite aggressive enough.

          So all those other radios you are getting interference from are probably ALSO having a merry time being congested on their connections as well, and feeding back into the overall problem.

  3. From my experience in this field LOS (or lack of it ) can cause many problems especially when the traffic levels increase. This causes hidden node problems where two or more sites which cannot “see” each other transmit at the same time to a common AP and therefore cannot/ do not attempt any form of collision avoidance. The result is errors and dropped packets.
    Using master slave modes and RTS/CTS type control for the host APcan improve performance but not in all cases.
    Do a search for “hidden node” and see what you can find.

    Ian

Comments are closed.