Why is the technology for blogs so difficult?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

The web is broken and can never be fixed. The only way forward is to start over again, to get rid of the Internet Protocol, the Transmission Control Protocol and the Hypertext protocol, and to invent a new protocol that fixes the problems with the old protocols. I am doubtful this can be achieved.

Back in 2005, David Heinemeier Hansson offered a Rails tutorial showing how you could create a blog in 15 minutes

:

This was a world changing moment. Everyone I knew watched that video and talked about it. Here was a huge shift away from the overly complex frameworks of the past, and yet here was a framework that really worked, something we could use instead of dealing with the chaos of writing everything ourselves.

You could build blog software in 15 minutes!

That was 2005.

Now it is 2014 and David Heinemeier Hansson’s blog is broken. I wanted to find a quote from something he wrote in 2006. (I no longer read him because he no longer says interesting things, but there was a stretch in the early 00s when he did say interesting things, so I read him then).

I wanted to quote the article:

The inevitable destruction of the WS-Deathstar

This no longer exists as a standalone article, it only exists in the monthly archive and, in particular, all the comments are gone. You can click on the link where it says “34 comments” but that link takes me to a 404 page.

Creating blog software seems simple, yet the popular options seem unable to solve the longevity issue. I wonder if we misunderstand the problem?

Posterous closed in 2013.

In 2010, the makers of the ultrasimple blogging service Posterous went on an aggressive recruiting campaign to snatch users from “dying platforms.” The startup released 15 importers that made it easy for users to migrate their blogs and photos from services like Ning, Twitpic, and Blogger. “Whatever the reason, whatever the site, we want you to switch to Posterous,” the company said.

Anyone who was compelled by that campaign probably regrets it now. After growing quickly for three years, Posterous started struggling and pivoted to become Posterous Spaces, a Google Groups-like discussion app. When Twitter bought Posterous in the spring of last year, it was widely understood to be an “aquihire,” a way for Twitter to absorb the Posterous team. In February of this year, Posterous announced it would be shutting down all blogs after ten weeks.

David Winer has incorrectly argued that he invented weblogs (though the distinction probably belongs to Blogger.com), but his weblog service ended in disaster. Wikipedia disguises the rage that I recall existed at the time, the disgust that was aimed at Winer:

Weblogs.com provided a free ping-server used by many blogging applications, as well as free hosting to many bloggers. After leaving Userland, Winer claimed personal ownership of the site, and in mid-June 2004 he shut down its free blog-hosting service, citing lack of resources and personal problems. A swift and orderly migration off Winer’s server was facilitated by Rogers Cadenhead, whom Winer then hired to port the server to a more stable platform. In October, 2005, VeriSign bought the Weblogs.com ping-server from Winer and promised that its free services would remain free. The podcasting-related web site audio.weblogs.com was also included in the $2.3 million deal.

I love the line “swift and orderly migration off Winer’s server was facilitated by Rogers Cadenhead”, remember that Winer later threatened Cadenhead with multiple lawsuits (Winer has a long history of threatening people with lawsuits).

Shelley Powers was an early pioneer with weblogs, has re-organized her site many times over the years, leaving most old links broken. For awhile she favored sub-domains over directories, then later she switched to using directories, then later she switched back. I read her blog everyday, for many years, but nowadays, when I want to quote something, I find it difficult to find the old posts.

I started my blog in 2001 but I ran out of money and my web hosting company closed my account and I did not have a backup, so I lost everything pre-2004. I started another blog that ran for some years, and I lost all of that too. My current blog only goes back to 2007, so the first 6 years of my blogging is lost.

The longevity issue is a serious one. From about 1750 to 2000 Western countries made great strides in improving the preservation of their knowledge, in particular with the spread of libraries. The Internet is revealing itself as a double-edged sword: it makes data more available, yet it is a transitory and ephemeral medium, enabling the loss of vast treasures of information. If I want to read how people felt about the re-election campaign of President Carter, and why people favored Reagan, I can read many thousands of essays published in many thousands of newspapers and preserved in many thousands of libraries. But if I want to read the great debates that surrounded the re-election campaign of George W. Bush, most of the great essays that I read are gone forever now, because they were on blogs that have either ceased to exist, or which re-formatted themselves in ways that made their old archives inaccessible.

Is there a solution to this problem? Not exactly: nothing will change the fundamental fact that TCP/IP is a transitory and ephemeral medium. But there are small steps we can take to make preservation a little easier. The main thing that breaks down, in all of these systems, is the step where data is taken from a database and formatted to a template. Therefore, clearly, saving every blog post as a fixed HTML file is an important step in the right direction. And these must be universally downloadable to anyone who owns them, so they can be exported as easily as possible. Raw HTML is the only truly universal language of the Web.

By the way, are you using a Javascript framework, such as Angular or Ember, to render your web pages? You realize there is absolutely no hope that these things will work 10 years from now?

[UPDATE May 12, 2014] — I just noticed another example of the disappearing web. Here is a post from 2008 wherein Violet Blue talks about being erased from the BoingBoing archives. She links to dozens of other sources, most of which 404. It is fascinating to realize how much the web is the opposite of what I once thought it was — because of the low cost of maintenance I thought it would be one of the most permanent forms of publishing, and yet it has turned out to be the most ephemeral type of publishing. Nothing lasts on the web.

Consider this link, which goes to the LA Times, which is a major media outlet:

http://latimesblogs.latimes.com/webscout/2008/07/xeni-jardin-and.html

I get a 404 error.

Using Google:

https://www.google.com/search?q=xeni+jardin+site%3Alatimes.com&oq=xeni+jardin+site%3Alatimes.com&aqs=chrome..69i57.14103j0j4&sourceid=chrome&es_sm=91&ie=UTF-8#q=xeni+jardin+violet+blue+site:latimes.com

Only shows me 1 article which is this:

http://articles.latimes.com/2008/jul/09/entertainment/et-webscout9

but if I go here:

http://boingboing.net/2008/07/01/that-violet-blue-thi.html

the page links to 2 posts at the LA Times, with the titles:

(1) BoingBoing bloggers talk about Violet Blue controversy’s implications
(2) BoingBoing’s Xeni Jardin on unpublishing the Violet Blue posts

Google can not find that second article, nor can the LA Times.

I will admit, this has been one of the biggest shocks that I have felt about the web: the web disappears. The web is maturing in a manner very different than what I expected. Back in the 1990s I expected the web to be a permanent form of publishing, but the opposite is happening. Everything gets erased. Even when the companies are large and have substantial resources and tech teams, everything gets erased. My own experience certainly reflects this: all of the big media companies that I have worked at have changed their URLs and lost their old articles.

I suspect that this problem is unfixable. The very forces that make it inexpensive to publish on the web also make the web impermanent.

In this particular case, because the turmoil was high-profile, someone created a spreadsheet that links all of the deleted posts to posts in the Wayback Machine:

https://spreadsheets.google.com/pub?key=pzVyO44trg7yCes1ugr7DFg

However, this is extremely fragile: this spreadsheet belongs to someone, and I assume their account will shut down someday, as everything ends at some point, and the spreadsheet will disappear at that time.

Also, there is the problem, even on the Wayback Machine, most of the image links are broken:

http://web.archive.org/web/20060219113007/http://www.wired.com/news/culture/0,69907-0.html?tw=wn_tophead_1

Please note, the problem goes far beyond a few people being petty, or a few corporations cutting corners, it seems endemic to the hypertext-protocol: URLs change all the time. Consider the irony that I was, just now, trying to learn more about Xeni Jardin deleting all of the posts of Violet Blue from BoingBoing, and I happened upon this old post of Violet Blue’s, which apparently once showed photos of Xeni Jardin and Violet Blue together:

http://www.tinynibbles.com/blogarchives/2006/10/at-kinkcom-with-xeni.html

All the images are gone. The image URLs look like this:

http://www.tinynibbles.com/DSC06226-thumb.jpg

This is seriously post-modern: Xeni Jardin erases all posts mentioning Violet Blue, and on Blue’s own blog, all the old image links are eventually broken.

As an engineer, my first instinct is to think about ways this can be fixed. But the rational side of my brain is telling me that this can not be fixed, rather, this is inherent to IP and TCP and HTTP: everything on the web will eventually disappear. To a very limited extent, a new protocol could fix some of the issues. In particular, a new protocol that inlines all images (that is, embeds the binary data for images inside the file, rather than referencing the image with an URL) would help keep images with their pages, but the problem of broken URLs would remain. The only way to fix the problem of link rot would be a new protocol that automatically establishes a second URL for every URL, with a text-copy of a page being stored there. The goal would be something like what the Wayback Machine does, but with the activity made a formal part of the transmission and routing protocol.

Also, this page:

http://www.tinynibbles.com/blogarchives/2006/10/jadore.html

has this text:

Xeni Jardin, Jonathan Moore, Haight and Masonic this afternoon. It looks like they’re plotting something — they are. I love this photo. Our conversation ranged from dating to cryptography, and far beyond. What a wonderful way to spend the afternoon. My other favorite photo is after the jump.

Aside from being a reminder that love does not last, the page also reminds us that the Web does not last: all the image links are broken.

There is some risk that my tone of voice will be misunderstood, so I should clarify: I think people who fight to maintain the integrity of links on the Web are heroes. I think everyone should do what they can to fight against link rot. But, at this point, I am convinced that the fight is almost entirely futile.

The first place I encountered the philosophy of “links on the web should be eternal” was in the book Philip & Alex’s Guide to Web Publishing written by Phillip Greenspun in 1998. And, a few years later, this was probably my first experience with how fragile this philosophy was. In the book Greenspun talks about the creation of Photo.net (I have no idea if the current site has anything in common with the site that Greenspun built in the mid 90s) and in the book he sounded confident about keeping the site up forever, because running a big server was cheap, especially if you have to run that big server anyway because you’ve got some kind of business that you run. But then around 2001/2002 Greenspun’s business went bankrupt, and it looked like he’d have to close Photo.net. I think in the end he sold it to someone.

Many others have promoted the cause of link integrity, and I have been on the bandwagon too, since the 1990s. Erosblog commented on the Jardin/Blue fight using language that I think became an accepted belief for some of us, for a long time:

I’m on record as being something of a grump and a curmudgeon about the value of internet links — I think they’re valuable even when they’re trivial, and I get pissed when people smash them needlessly and in job lots. Apparently this idea of “links as valuable structure” is incomprehensible to plenty of smart people; that seems to be why I got such a negative reaction to my “vandals” post, and also to be why I got treated as a troll during the great Xeni Deletes Violet Blue kerfluffle. In that latter case, my expressed disappointment at the wholesale smashing of links was apparently just not believed by the Boing Boing moderator — and since it was assumed that I was raising arguments I didn’t believe in, the natural explanation was that I was trolling and/or taking sides in the bizarre personal fight that was going on behind the scenes.

and includes this quote from Cory Doctrow:

You and me and anyone who’s ever made a link between two web pages helped to create an underlying structure to the Internet – a citational structure that Google and other search engines come along and hoover up, and then analyse to see who links to which pages, which pages are most linked-to and therefore thought to be most authoritative, where those pages link to and how they’ve had their authority conferred on them. This sounds familiar to anyone who’s an academic – it’s more or less how citations work if you’re trying for a better job at the university, and of course Google was founded by a couple of PHD candidates; when all you’ve got is a hammer, everything looks like a nail.

Those are beautiful sentiments about the importance of URLs. We should all fight against link rot. However, the IP and TCP and HTTP protocols all work against us. We would need a new set of protocols if we actually cared about preservation of the web, and I’m not sure the new protocols would be economic. The greatest think about IP/TCP/HTTP is it is very cheap.

DNS (domain name system) currently forces everyone to sign up with a name registrar, if you want to get a domain name on the Internet. It might be that we need a similar protocol for URL 404s — a new protocol, to replace HTTP, that establishes a default fallback for any URL. If you want to show material via this new protocol, you need to sign up with a registrar who promises to show your content even when you are unable to. I imagine current CDNs would be happy to take the extra business. Again, I have doubts about whether this would ever be economic, and therefore I doubt if there is any fix for the problem of impermanence on the web.

Source