How difficult is it to work with TCP?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I am thinking of creating a new protocol on top of TCP. I’ve been thinking that, with all the good libraries out there, I would not have to do much work. But then I read this, and it gives me pause: maybe TCP is tougher than I thought:

Zero windows are another frequent source of grief, to such lengths that at multiple mobile operators the technical staff have quizzed us extensively on our use of zero windows. I don’t quite know why zero windows have that reputation, but we have definitely seen that class of problems in the wild occasionally (for example the FreeBSD problem from a few years back was very annoying).

But here’s a new one we saw recently, which was good for an afternoon of puzzling and is a case that I hadn’t heard any scary stories about. A customer reported failures for a certain website when using our TCP implementation, but a success when using a standard one. Not consistently though, there were multiple different failure / success cases.

Sometimes we were seeing connections hanging right after the handshake; the SYNACK would have no options at all set (a big red flag), advertised a zero window, and the server would never reply to any zero window probes or otherwise open any window space:

19:53:40.384444 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [S], seq 2054608140, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:53:40.779236 IP x.x.x.x.443 > 10.0.1.110.34098: Flags [S.], seq 3403190647, ack 2054608141, win 0, length 0
19:53:40.885177 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [S], seq 2054608140, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:53:41.189576 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 29200, length 0
19:53:41.189576 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 29200, length 0
19:53:42.189892 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0
19:53:43.391186 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0
19:53:44.832112 IP 10.0.1.110.34098 > x.x.x.x.443: Flags [.], ack 1, win 64000, length 0

Other times the SYNACK would be a lot more reasonable looking, and the connection would work fine:

19:29:16.457114 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [S], seq 1336309505, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:29:17.264497 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [S.], seq 2619514903, ack 1336309506, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 6], length 0
19:29:17.264556 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [.], ack 1, win 229, length 0
19:29:17.265665 IP 10.0.1.110.33842 > x.x.x.x.443: Flags [P.], seq 1:305, ack 1, win 229, length 304
19:29:18.059278 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [.], ack 305, win 995, length 0
19:29:18.087425 IP x.x.x.x.443 > 10.0.1.110.33842: Flags [.], seq 1:1461, ack 305, win 1000, length 1460

And there were also occasions where we’d get back two SYNACKs with different sequence numbers, which of course didn’t always work too well:

19:37:41.677890 IP 10.0.1.110.33933 > x.x.x.x.443: Flags [S], seq 2689636737, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:37:41.877046 IP 10.0.1.110.33933 > x.x.x.x.443: Flags [S], seq 2689636737, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
19:37:42.076611 IP x.x.x.x.443 > 10.0.1.110.33933: Flags [S.], seq 3107565270, ack 2689636738, win 0, length 0
19:37:42.275471 IP x.x.x.x.443 > 10.0.1.110.33933: Flags [S.], seq 3109157454, ack 2689636738, win 0, length 0

You might be able to guess the problem just from the above traces, but actually verifying it required quite a few attempts with slightly tweaked parameters to find the boundary conditions. Who knew that there are systems around that can’t handle receiving a duplicate SYN? The three different behaviors seem to correspond with no SYN being retransmitted, the retransmission arriving to the middlebox before it emits a SYNACK, and the retransmission arriving after the middlebox has emitted a SYNACK.

The middlebox was located in Australia, but most likely that IP was just a loadbalancer, transparent reverse proxy, or some similar form of traffic redirection with a real final destination somewhere in the US. When being accessed from Europe, this resulted in an aggregate RTT of something like 450-550ms. Our TCP implementation has a variable base SYN retransmit timout, and in this case it was roughly 500ms. So most of the time the page load would fail with our TCP stack, but succeed with an off the shelf one that had a SYN retransmit timeout of 1 second.

Post external references

  1. 1
    https://www.snellman.net/blog/archive/2014-11-11-tcp-is-harder-than-it-looks.html
Source