Many large Internet companies violate the TCP protocol to boost their own speed

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

If everyone did this, it would put a terrible strain on the infrastructure of the Internet. These companies violate the TCP protocol in ways that are probably unfair to others.

My first step was to measure the load time of www.google.com over my home cable modem connection. As a first pass, I timed the download with curl:

$ time curl www.google.com > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 8885 0 8885 0 0 115k 0 –:–:– –:–:– –:–:– 173k
real 0m0.085s

Holy smokes, that was fast! We were able to open a tcp connection, make an http request, receive an 8KB response, and close the connection all in 85ms! That’s even faster than I expected, and demonstrates that it should be possible to build an app with a page load time below the threshold that humans perceive as instantaneous (about 150ms, according to one study). Sign me up.

Curious about how they pulled that off (did someone sneak into my house and install a GGC node in the attic?), I fired up tcpdump to take a closer look. What I saw surprised me:

$ tcpdump -ttttt host www.google.com

# 3-way handshake (RTT 16ms)
00:00:00.000000 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [S], seq 2726806947, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 949329348 ecr 0,sackOK,eol], length 0
00:00:00.016255 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [S.], seq 3505557820, ack 2726806948, win 5672, options [mss 1430,sackOK,TS val 688795316 ecr 949329348,nop,wscale 6], length 0
00:00:00.016376 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 1, win 65535, options [nop,nop,TS val 949329348 ecr 688795316], length 0

# client sends request and server acks
00:00:00.017437 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [P.], seq 1:180, ack 1, win 65535, options [nop,nop,TS val 949329348 ecr 688795316], length 179
00:00:00.037139 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], ack 180, win 106, options [nop,nop,TS val 688795338 ecr 949329348], length 0

# server sends 8 segments in the space of 3ms (interspersed with client acks)
00:00:00.067151 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 1:1419, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 1
00:00:00.069693 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 1419:2837, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 2
00:00:00.069814 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 2837, win 65405, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.069918 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 2837:4255, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 3
00:00:00.070374 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 4255:4711, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 456 # segment 4
00:00:00.070486 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 4711, win 65525, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070796 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 4711:6129, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 5
00:00:00.070847 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [.], seq 6129:7547, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1418 # segment 6
00:00:00.070853 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 7547:8109, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 562 # segment 7
00:00:00.070876 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 7547, win 65228, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070900 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 8109, win 65512, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.070962 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [P.], seq 8109:9501, ack 180, win 106, options [nop,nop,TS val 688795368 ecr 949329348], length 1392 # segment 8
00:00:00.070990 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 9501, win 65408, options [nop,nop,TS val 949329349 ecr 688795368], length 0

# connection close (RTT 22 ms)
00:00:00.071300 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [F.], seq 180, ack 9501, win 65535, options [nop,nop,TS val 949329349 ecr 688795368], length 0
00:00:00.093299 IP 74.125.227.16.http > 192.168.1.21.52238: Flags [F.], seq 9501, ack 181, win 106, options [nop,nop,TS val 688795393 ecr 949329349], length 0
00:00:00.093469 IP 192.168.1.21.52238 > 74.125.227.16.http: Flags [.], ack 9502, win 65535, options [nop,nop,TS val 949329349 ecr 688795393], length 0

On the performance front, this is really exciting. They actually managed to deliver the whole response in just 70ms, 30ms of which was spent generating the response (come on Google, you can do better than 30ms). That means that a load time under 50ms should be possible.

How they accomplished that is what surprised me. The rate at which a server can send data over a new connection is limited by the tcp slow-start algorithm, which works as follows: The server maintains a congestion window which controls how many tcp segments it can send before receiving acks from the client. The server starts with a small initial window (IW), and then for each ack received from the client, increases the window size by one segment until it either reaches the client’s receive window size or encounters congestion. This allows the server to discover the true bandwidth of the path in a way that’s fair to other flows and minimizes congestion.

If you look at the trace, though, you’ll notice that the server is actually sending the entire 8 segment response before there’s time for the first client ack to reach it. This is a clear violation of RFC-3390, which defines the following algorithm for determining the max IW:

The upper bound for the initial window is given more precisely in
(1): min (4*MSS, max (2*MSS, 4380 bytes))

Note: Sending a 1500 byte packet indicates a maximum segment size
(MSS) of 1460 bytes (assuming no IP or TCP options). Therefore,
limiting the initial window’s MSS to 4380 bytes allows the sender to
transmit three segments initially in the common case when using 1500
byte packets.

www.google.com is indeed advertising an MSS of 1460, allowing it an IW of 3 segments according to the rfc. In our trace, they appear to be using an IW of at least 8, which allows them to shave off 2 round trips (~50ms) over an IW of 3 for this request. This raises the question: just how far will they go? Let’s request a larger file and see what happens:

$ tcpdump -i en1 -ttttt host www.google.com

# 3-way handshake (RTT 22ms)
00:00:00.000000 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [S], seq 2589435808, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 949341091 ecr 0,sackOK,eol], length 0
00:00:00.022780 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [S.], seq 4085145017, ack 2589435809, win 5672, options [mss 1430,sackOK,TS val 990595778 ecr 949341091,nop,wscale 6], length 0
00:00:00.022913 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 1, win 65535, options [nop,nop,TS val 949341092 ecr 990595778], length 0

# client request and server ack
00:00:00.023699 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [P.], seq 1:193, ack 1, win 65535, options [nop,nop,TS val 949341092 ecr 990595778], length 192
00:00:00.048205 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], ack 193, win 106, options [nop,nop,TS val 990595802 ecr 949341092], length 0

# server sends 9 segments in 4ms (interspersed with client acks)
00:00:00.082766 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 1:1419, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.083077 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 1419:2837, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.083118 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 2837, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.083284 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 2837:3966, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1129
00:00:00.083318 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 3966, win 65441, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.085550 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 3966:5384, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.085875 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 5384:6802, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.085976 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 6802, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086045 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 6802:8062, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1260
00:00:00.086067 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 8062, win 65425, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086601 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 8062:9480, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.086709 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 9480:10898, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1418
00:00:00.086728 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 10898, win 65405, options [nop,nop,TS val 949341092 ecr 990595836], length 0
00:00:00.086820 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 10898:12158, ack 193, win 106, options [nop,nop,TS val 990595836 ecr 949341092], length 1260
00:00:00.086836 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 12158, win 65425, options [nop,nop,TS val 949341092 ecr 990595836], length 0

# 24ms after first client ack was sent, we get 2 more segments
00:00:00.107116 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [.], seq 12158:13576, ack 193, win 106, options [nop,nop,TS val 990595860 ecr 949341092], length 1418
00:00:00.107403 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [P.], seq 13576:14651, ack 193, win 106, options [nop,nop,TS val 990595860 ecr 949341092], length 1075
00:00:00.107518 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 14651, win 65448, options [nop,nop,TS val 949341092 ecr 990595860], length 0

# connection close (RTT 25ms)
00:00:00.107938 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [F.], seq 193, ack 14651, win 65535, options [nop,nop,TS val 949341092 ecr 990595860], length 0
00:00:00.129947 IP 74.125.227.50.http > 192.168.1.21.52287: Flags [F.], seq 14651, ack 194, win 106, options [nop,nop,TS val 990595884 ecr 949341092], length 0
00:00:00.130071 IP 192.168.1.21.52287 > 74.125.227.50.http: Flags [.], ack 14652, win 65535, options [nop,nop,TS val 949341093 ecr 990595884], length 0

Interestingly, the server waits for ~1 RTT after sending 9 segments, indicating an IW of 9. This suggests that the value was tuned for the home page (or for the similarly-sized search results page).

How Common is This?

So, is this common practice that I just never noticed before, or is Google the only one doing it? I thought I’d run traces against a few more sites and try to deduce their IWs. Here’s what I found:

akamai: 4
amazon: 3
cisco: 2
facebook: 4
limelightnetworks: 4
yahoo: 3

It looks like goosing the IW to 4 is pretty common practice, but I was about to give up on finding anyone pushing as far as Google until, almost as an afterthought, I tried www.microsoft.com.

Post external references

  1. 1
    http://blog.benstrong.com/2010/11/google-and-microsoft-cheat-on-slow.html
Source