How to chase down a bug

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

This is a good story, and it communicates how insane things can get when one is chasing down a bug.

Having calculated the theoretical peak throughput, I decided there was no good reason this microprocessor shouldn’t be able to maintain a much higher level of throughput. Time to do some low-level packet analysis.

I set up Wireshark and started capturing packets. At first, everything seemed ok but looking at the timestamps showed clearly that the transmissions were very bursty. Sometimes there were delays of a few seconds between packets! No wonder it was taking so long for a full status dump… but what was causing this?

Looking at the IP layer, I decoded and inspected the session piece by piece, from the very first packet. SYN, SYN-ACK, ACK… All good so far. But after transmitting only a few data packets: NAK. Retries? Backoff? Delays! What on earth was going on? The trace showed the micro was resending packets it had successfully sent. Yet by matching up the sequence numbers, it showed the packets were being ACKed by the other end. Eventually after receiving a few out-of-order packets, the receiver tried to back off by increasing timeouts. This perfectly illustrates the bursty nature of the traffic. But what could be causing it?

Not leaving anything to chance, I tried changing Ethernet cables to make sure it wasn’t a dodgy connection causing the fault. No dice.

Post external references

  1. 1
    http://antonym.org/2011/12/bug-of-the-year-2011.html
Source