Great programmers make simple mistakes

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

This sounds like the kind of mistake I’ve made on TMA, many times:

On a comment thread, a new user had posted some replies as siblings instead of children. I posted a comment explaining how HN worked. But then I decided to just fix it for him by doing some surgery in the repl. Unfortunately I used the wrong id for one of the comments and created a loop in the comment tree; I caused an item to be its own grandchild. After which, when anyone tried to view the thread, the server would try to generate an infinitely long page. The story in question was on the frontpage, so this happened a lot.

For some reason I didn’t check the comments after the surgery to see if they were in the right place. I must have been distracted by something. So I didn’t notice anything was wrong till a bit later when the server seemed to be swamped.

When I tailed the logs to see what was going on, the pattern looked a lot like what happens when HN runs short of memory and starts GCing too much. Whether it was that or something else, such problems can usually be fixed by restarting HN. So that’s what I did. But first, since I had been writing code that day, I pushed the latest version to the server. As long as I was going to have to restart HN, I might as well get a fresh version.

After I restarted HN, the problem was still there. So I guessed the problem must be due to something in the code I’d written that day, and tried reverting to the previous version, and restarting the server again. But the problem was still there. Then we (because by this point I’d managed to get hold of Nick Sivo, YC’s hacker in residence) tried reverting to the version of HN that was on the old server, and that didn’t work either. We knew that code had worked fine, so we figured the problem must be with the new server. So we tried to switch back to the old server. I don’t know if Nick succeeded, because in the middle of this I gave up and went to bed.

When I woke up this morning, Rtm had HN running on the new server. The bad thread was still there, but it had been pushed off the frontpage by newer stuff. So HN as a whole wasn’t dying, but there were still signs something was amiss, e.g. that /threads?id=pg didn’t work, because of the comment I made on the thread with the loop in it.

Eventually Rtm noticed that the problem seemed to be related to a certain item id. When I looked at the item on disk I realized what must have happened.

So I did some more surgery in the repl, this time more carefully, and everything seems fine now.

Source