When it was but wasn’t WiFi’s fault – CWNE Essay 3
One of the problems with large organizations is that the individual departments are often disjointed. I’d spent a month working on a problem with one of our remote distribution centers. Countless hours and trips to this site to troubleshoot a wireless issue that was and was not. The problem was, the application server was 350ms from the site and nothing could be done to bring it closer. The application team would see this figure and immediately blame the network. Although we’d show that there was only a small variation between LAN and WiFi, the fact remained, the devices were WiFi only, and there was a performance issue.
We couldn’t rule out the wireless as being the cause. There were no doubt wireless issues at this site, the design was bad. Omni-directional Access Points (APs) were used throughout the facility, amongst warehouse isles. The site was 2.4GHz only, the clients were programmed only to use 802.11b, the Co-Channel Interference (CCI) was through the roof, everything was set to 100mW. This incident quickly launched into a project.
After fixing the immediate flaws with the network by installing directional antennas in strategic places, implementing 5GHz, upgrading the clients so that 802.11g/n/a was available, controlling CCI as much as possible, the problem still remained. The design was more or less fixed.
I then turned to spectrum analysis and found a number of intermittent narrowband interferers. The premises had deployed a number of motion detectors and cameras around the facility which wirelessly transmitted a short burst of data back to a control unit every time motion was detected. The interference existed on the 2.4GHz band and was quite strong, with a high duty cycle. Clients were moved to 5GHz and a request was placed to cable the devices.
The problem still existed, so I moved up the OSI model and began troubleshooting using a packet analyzer. I captured traffic close to the client, at the AP, and exciting to the WAN. The AP was configured using local switching which made this task easier. My first step was to set up a single access point, with a new SSID so that I didn’t have to worry about roaming events. Then I performed a series of steps reproducing the problem. I noticed a small number of re-transmissions in both directions between the client and the access point. The number was around 3% over the duration of 30 minutes, it was not considered an issue. Channel utilization throughout the facility was on average around 5%, also not an issue.
The problem started to reveal itself once I moved past the 802.11 headers and into the payload. The data was being transmitted using Telnet, which meant that I could analyze the application traffic. I noticed that the application was truncating the traffic, sending 8 characters to a frame. This added with 350ms of latency meant that the average barcode would take a minimum of 700ms to send to the server. Once this was resolved, the performance was improved dramatically. Still, one issue was present for a specific process used by the distribution team. We performed further packet captures, with the same setup as previously used. This time we noticed that the traffic was not truncated, but there was a 2-second delay for all return traffic. Once escalated to the application team, they found that the index of a specific table was corrupt, it was corrected which resolved the issue.
Thankfully the knowledge I’d taken from the CWAP curriculum allowed me to rectify the issues with the wireless network, and help defend the network once the issue remained. I now find that almost half of my time troubleshooting the wireless network will involve me defending wireless and proving that the issue lies in things like application performance, geographic latency, DHCP/DNS/Captive Portal, unrealistic expectations. At the end of the day, what appears to be a wireless issue, most likely is not a wireless issue.