Tuesday 17 June 2014

By Far The Biggest Issue I Encounter in Wi-Fi Deployments Is…

… high airtime utilisation caused by infrastructure support for low data rates. Actually, it is one of the two biggest issues that I see however the solution for this issue is far more plug and play than the other. If you’re interested, the other issue is poor coverage caused by automatic AP transmit power algorithms (RRM) – more on that in a future post.

Now, this will not be a revelation for anyone working in Wi-Fi. Unfortunately the vast majority of WLANs are deployed by people with minimal Wi-Fi knowledge. This post is not a criticism of those folks however. The purpose of this post is also not to delve into the technical details behind this issue but to look at the party that could essentially solve (or at the very least, significantly reduce) this issue overnight – the enterprise Wi-Fi vendors.

Out of the box, all enterprise vendors equipment that I’ve looked at ship with the lowest data rates enabled – the 1 and 2 Mbps rates from the original 802.11 standard and the 5.5 and 11 Mbps rates that the 802.11b amendment brought us. These rates came about 17 and 15 years ago, respectively and yet vendors still ship equipment supporting these rates by default despite the devastation they cause.

The term ‘junk band’ is sometimes used to describe the 2.4 GHz band that these data rates operate in and as a reason why Wi-Fi often performs poorly in this band. The huge irony here is that many Wi-Fi deployments are by far the biggest contributor of ‘junk’ in the band – the junk of course are these airtime-hogging frames. Yes, non-Wi-Fi interference does consume some airtime (in many cases, less than it did when Wi-Fi was much younger!) and yes, APs outside of the customers WLAN also consume a portion of airtime but often it is these low rates supported by the customers own WLAN that consumes by far the largest amount of airtime. In addition, most non-Wi-Fi interfering devices also do not operate 24/7 so the interference is sporadic. Many low data rate frames are operating as long as clients are using the WLAN (for example, 8 hours in the day) whilst others are sucking up airtime, 24/7/365!

You may be thinking that it isn’t the vendor’s job to design the WLAN for the customer and that the vendor stresses the importance of disabling these low rates through documentation, training and vendor seminars. Whilst this is all true, it clearly isn’t enough or this wouldn’t be such a massive issue. The complexity of Wi-Fi is only matched, inversely, by the degree to which it is poorly understood. It just isn’t fair to push all of the blame on the customer. If the WLAN was deployed by a VAR then it may be fair to push some of the blame in their direction however once again, the reality is that most VARs, like customers, have minimal Wi-Fi knowledge.

These default rates also hurt the vendors. Numerous times the Wi-Fi vendors are blamed for a performance issue that is simply a result of a poor WLAN design. If these rates were disabled out of the box it would be one less (but significant) issue that uninformed customers could throw back at the vendor.

These are certainly other default, 'out of the box' pieces of configuration that I feel should be changed so why single this one out? Simply put, no other default that I’ve come across causes anywhere near as many issues and on such a large scale. Not only does it affect the customers WLAN but also the neighbouring business and home users.

Despite all vendors having qualified staff that realise this is an issue, why haven’t they made a change? Most likely because, yes, there are still some 802.11b (and even some problematic 802.11g/n) clients out there and by disabling these rates out of the box, these clients will be unable to associate. But so what? Out of the box, many things have to be configured to work and this will just be another. For example, there is a good chance that the out of the box WLAN you create will only support WPA2/AES by default. So if you have clients that only support WPA/TKIP they’re not going to be able to associate. You’re going to have to change those defaults to support your legacy clients. How is this any different? In fact it would be preferable if clients could NOT associate due to these issues. At least this way, the problem would be identified and fixed before the WLAN went into production. Most low data rate utilisation issues persist for a long time, often years, many of which will never be fixed. 

It doesn’t have to be a brute force approached either. I can see a number of options to  ease customers into a life of low channel utilization!
  • A setup wizard used to create the WLAN could ask whether the customer has any 802.11b clients that need to be supported (or problematic 802.11g/n clients that won’t associate with the rates disabled – ok, most customers won’t know this until they flip the switch!) .
  • The setup wizard could ask what vertical the WLAN is being deployed in and if one of the likely candidates (retail, warehousing and healthcare) are chosen, suggest that low data rates may need to be left enabled but that the customer should start at 11 Mbps and work backwards. If these verticals are not selected, the low rates are disabled.
  • If customers do enable low data rates, the wizard might suggest that this configuration could have a significant negative impact on their WLAN and that if they must support low rates, to minimise the number of WLANs advertised on each AP.
  • Back in the day Cisco APs shipped with the default SSID of tsunami configured. Cisco obviously realized this was something of a security issue and removed the SSID from the default configuration, shipped with the radios disabled and put a nice bright yellow sticker on the AP box informing the customer of the fact. Maybe APs could have such a sticker or a slip put into the top of the AP or, where applicable, WLC box. 
So now you’re thinking, “oh but the 5 GHz band will save us!” A recent piece of customer trouble-shooting showed why this is short sighted; the issue – severe performance problems. One of the first things I checked was the airtime utilisation reported by the APs; the highest I saw was a new record for me, an AP at 93% channel utilisation (beating my old record by 1%!). The rest of the APs weren’t much better. I couldn’t work it out at first though; 80% of the clients were associating to the 5 GHz band where the utilization was typically low so why such big performance issues? A look at client association history showed the majority of clients were fluctuating between bands. Yes, a driver update would likely have helped somewhat, but even the latest clients with the latest drivers may still prefer the 2.4 GHz band – I saw this often with early dual-band 802.11n clients. Whilst it’s been 7 years since these initial 802.11n clients came about and more client vendors have started to prefer 5 GHz over 2.4 GHz, this is not a universal truth. I expect a large enough percentage of 802.11ac clients will still make significant use of the 2.4 GHz band and therefore the importance of ‘cleaning up’ the band remains. This trouble-shooting experience was certainly not unique; I’ve seen this many times.

Yes, this proposed solution won’t help with all Wi-Fi airtime issues - non-Wi-Fi interference, external and internal ACI and CCI from SOHO rogues, non-overlapping ACI (AP co-location), sticky clients, clients probing at low rates, clients probing for every WLAN they’ve ever associated to, overly high AP density, overly high transmit power… the list goes on. It will however help with one significant Wi-Fi issue and one that has a very simple plug and play solution. Would this have been advisable 10 years ago? Of course not! Even 5 years ago? Perhaps not even then. But 17 years is a long time in technology circles – it’s time we moved on!

Lastly, I need to acknowledge the fantastic proposal from Cisco’s Brian Hart and Andrew Myles. In short, they’re proposing that the Wi-Fi Alliance start looking at making low data rates optional. Whilst I suspect that the onus behind this proposal may have come from the issues seen in stadium Wi-Fi in recent years (1 Mbps probes + very high client density + very open space = choas), it would obviously benefit all new Wi-Fi deployments where the equipment had this certification. But this leads to the next logical question, “how about cleaning up Cisco’s backyard first?” Obviously this isn’t a Cisco-specific issue but even if this proposal sees the light of day, it will likely be several years before we see it bear fruit. Why wait? – take the lead!

Despite the marketing claims of 802.11ac, Wi-Fi in the 2.4 GHz band is going to be around for the foreseeable future and it’s about time the mess was cleaned up!

Sunday 15 June 2014

You down with TDD (yeah you know me)

Recently I was performing a piece of wireless trouble-shooting and came across something I hadn't seen before. I was called out because of wireless issues. You know; those vague, all-too-common wireless issues!

Fast-forward to me being on-site. Whilst surveying I often try to simultaneously perform as many of the required tasks as is practical. So I performed a survey to check out the customers WLAN coverage, looked for internal and external CCI and ACI and performed a spectrum analysis. Later on came a spot of analysis and sniffing.

In one area I noticed a high level of utilisation on channels 44 and 48.

40 MHz Wi-Fi channel... right?
This was clearly a 40 MHz channel where a file transfer or something similar was occurring… wasn’t it? I looked at my survey results but none of the customers APs were on these channels in this area. I then took a look for rogue APs in the vicinity.

Found the culprit?
OK – this looks like it. According to the customer this AP was being used because the corporate Wi-Fi wasn’t working well. Well yes, that was indeed why I was on-site. To confirm what I was seeing I pulled out the Fluke AirCheck.

What the.....
Hold up, what do we have here? 89% of utilisation from non-Wi-Fi sources? The customer mentioned they were running a Raspberry Pi. I know nothing of the Pi’s and wondered if it was performing some non-Wi-Fi Wi-Fi look-alike transmissions. Something along the lines of the Nuts About Nets AirHORN? A few questions later and it was established that the Pi was 2.4 GHz-only and that a dual-band Netgear wireless router was also in use. The Netgear was broadcasting the Swifty5 SSID pictured above. So was the Netgear to blame or was it a bug in the AirCheck reporting Wi-Fi transmissions as non-Wi-Fi? I powered off the Netgear and the non-Wi-Fi utilisation didn’t stop. I continued on with the survey, planning to return later on.

Later back at my desk I was going through my notes and remembered a screengrab from the WLAN controller I took the day before when doing some pre-visit preparation. I probably should have remembered this earlier but at least 18 hrs had elapsed! – so right there, you can see the problem!

Light bulb moment!

Ah ha! A quick confirmation of AP location and it was confirmed; TDD was the source. Yes, channel 36 is reported but later I noticed another AP in the area reporting TDD on channel 44 also. I had seen TDD transmitters detected by the APs on-board spectrum analyser previously and had seen reference to it in vendor documentation countless times however I had never delved any deeper. TDD stands for Time Division Duplex. Just from the name it sounded like something a licensed microwave, outdoor P2P link would use but was in fact operating in an unlicensed band. I suspected a P2P link mounted on a nearby building shooting a narrow beam of non-Wi-Fi ‘bite me’ through the customers building. Further analysis revealed this to be the case.

I suspected that what I could see on channels 36 + 40 in the first spectrum analysis image was another P2P link, albeit causing lower utilisation. A quick Google later and I suspect this may in fact be FDD – Frequency Division Duplexing with the uplink and downlink running on 36+40 and 44+48, respectively. Whilst the transmission was a continuous transmitter (100% utilisation) it did not operate 100% of the time, like some continous transmitters. The AirCheck showed it was bursty which is what you may expect to see on a P2P link.

As you would hope, the result of these interferers is that the RRM algorithm in the wireless infrastructure has chosen to use other channels on this side of the building. I can see that another business on the bottom floor of the building is running an enterprise WLAN also and those APs have also chosen not to use these channels. Losing four channels is not ideal, fortunately the customer is running 20 MHz channels so another eight are available (supporting UNII-2e is far from plug and play, particularly in this part of the world, so enabling these channels is unlikely). Before discovering this issue I was considering moving the customer to 40 MHz channels but that may not be worthwhile now.

As for the previously mentioned rogue AP that the customer had decided to use, it just happened to be running on the exact two channels that the interferer is running on. This presented a red herring whilst trouble-shooting due to the very similar signature (Wi-Fi vs. TDD). It also meant that the customer shot themselves in the foot – the SOHO wireless router remained on the problematic channels despite high utilisation whilst the enterprise WLAN performed as you would hope and didn’t use those channels. A pat on the back for me having tweaked the WLAN infrastructures AP spectrum analysis configuration 12 months earlier ;). 

A few closing thoughts
  • Whilst many non-Wi-Fi interferrers have unique signatures, some are misleadingly similar.
  • Metageek features I'd love to see in the future:
    • As much as I like Chanalyzer, I hope to see improved hardware from Metageek in the future to allow better signature detection to become a reality.
    • Tabbed support in Chanalyzer – it would really help when examining multiple files, post-capture.
    • Utilisation-specific 802.11 frame analysis; despite this example of severe non-Wi-Fi interference, the majority of interference I see is still from CCI. I’m not talking packet sniffer level stuff; even something as simple as what the AirCheck can do (x% Wi-Fi utilisation / x% non-Wi-Fi utilisation).
  • Whilst you shouldn’t rely on spectrum analysis signatures, they can certainly be helpful. Sure, you can purchase a whole bunch of non-Wi-Fi interferers for you lab in order to learn the different signatures (certainly a worthy venture) but you’re unlikely to ever get your hands on all of them – I’d certainly never have forked out for this interferer in order to learn its signature!
  • AP-based spectrum analysis is not a replacement for a stand-alone spectrum analyser and vis-versa; they complement one another.
  • Although the majority of non-Wi-Fi interference is seen in the 2.4 GHz band, 5 GHz is not immune.
  • In Australia, much like the US, UNII-1 is restricted to indoor-use only. It is likely that a call to the ACMA (FCC equivalent) may be required. The US is in the process of opening up UNII-1 for outdoor use and I expect Australia will follow at some point.
  • When the utilisation was closer to 60% (when I initially noticed the issue) it was my backup device (the AirCheck) that raised the red flag that this utilisation wasn’t from Wi-Fi - my favourite new toy of late! 
  • Finally, a side-by-side – Wi-Fi vs. TDD/FDD. The amplitude differs as expected but the significant difference is the lack of side-lobes on the TDD/FDD.
    Wi-Fi (left) vs. (TDD/FDD)
  • Despite the images above, in one instance (admittedly out of many) the TDD/FDD did actually show side lobes making it all the more difficult to identify. The 100% utilisation gives it away though.

If you’ve dealt with this type of interferer before and can provide any more detail please provide a comment or hit me up on twitter.