Saving the web for posterity

I posted here about how knowledge on the web, and on digital media generally, disappears – risking the impoverishment of future historical research.

Just before I could post this follow-up, Jessica anticipated me and commented that I should try Archive.org. Well, guess what – this is all about that.

A recent interview with British Library chief Lynne Brindley in The Guardian discussed some positive efforts to archive the web, notably the San Francisco-based  Internet Archive.

In San Francisco, the non-profit Internet Archive automatically scrapes parts of the web and its Wayback Machine allows people to surf back in time to see what their favourites sites looked like as far back as 1996. It already contains three petabytes of data, which equates to more than three million gigabytes.

All well and good. But what it doesn’t mention is that the Internet Archive itself is losing its digital information.

Way back, I used to know someone called Tim Worman, who became better known as Tim Polecat – lead singer of the UK rockabilly band The Polecats. We lost touch, obviously (I don’t really move in pop star circles), but about 10 years ago, I thought I’d see if he was on the web. 

He was! He had a fun site with all the usual stuff about his interests and current news – which also benefited from the fact that he was also a good artist and designer, so it looked pretty cool. 

I checked back every so often, but then a few years later was disappointed to find it no longer seemed to exist. Aha – but no: there it was. Archived by the Internet Archive and accessible through its Wayback Machine (though sadly without some of the graphics and MP3 downloads).

I visited occasionally and then – guess what? Yes – his site had vanished from the Internet Archive too. 

The obvious question, then, is what use is an internet archive that just archives for a few years? If Tim Polecat’s site was valuable at all, surely it should be kept in perpetuity. If it’s not actually valuable, then why keep it at all – for any length of time? 

Maybe the Internet Archive scrapes the web automatically and then real people wade through the content it stores to decide what’s valuable and what isn’t – a process that would obviously take a while. So perhaps his site was only archived until someone got a chance to have a look at it and then decide it was of no use. 

But that undermines the very principle of archiving ephemera that the British Library is so concerned about. After all, it is from some of the most trivial material that we gain some of our most important insights into the lives of ancient peoples. What they considered important at the time is not necessarily what concerns historians today – and we have no idea what future historians will want to know about us.


  • You pointed out some important things. But I also think that’s present is way more … prensent to people than future. And some might think of the future, but a lot of people only care for their immidiate needs. I don’t say that’s good, I only describe my observations.

    History is always one way of showing things. It’s a problem with virtual data, they *are* ephemeral.

  • Just another thought, couldn’t it be deleted by accident? There seems to be no rating system… http://www.archive.org/about/about.php

    Or xould this be an explanation:
    “How can I remove my site’s pages from the Wayback Machine?

    The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.”
    (quoted from http://www.archive.org/about/faqs.php#The_Wayback_Machine, 21 April 2009, 16:38)

  • freelanceunboundNo Gravatar
    April 21st, 2009 at 3:36 pm

    Interesting – I think the owner of the site concerned in this case would not have noticed it was archived. I bet he wouldn’t have made an effort to remove it. But I could be wrong…

  • This a fascinating subject. Have you found any good solutions for those who would like to preserve such sites? Do you hope your work on http://www.freelanceunbound.com/ will be preserved in such a manner or have you taken any steps to see that this is accomplished?

    The best I’ve been able to come up with so far is to use Google’s free services for things such as my blog. See the post at: http://timoey.blogspot.com/2006/04/longevity-durability-and-death.html

    While I don’t expect Google to be around forever, I think it has a good chance to be around a very long time. Plus I suspect that if Google wanes, whoever comes after will acquire Google’s assets. Google has established an amazing mission “to organize the world’s information and make it universally accessible and useful.”

    That goal is scary but also beneficial. So I’ve hooked my digital archives to Google in the hopes that this will preserve my own work for a very long time.


  • I had to smile at this – I find the idea that Freelance Unbound should be worth reading at all slightly bemusing, let alone be worth preserving for posterity. But it is, as you say, a fascinating subject – and very significant for future historians.

    Paradoxically, as information becomes ever more central to the modern world, it becomes far more ephemeral. A thousand years ago or more, writing was preserved largely because there was a religious imperative to do so. Writing was scarce and hence focused on a culture’s most important aspects (essentially, God). People went to great lengths to preserve sacred texts. Now, of course, there’s so much around that we don’t value it at all (we certainly don’t want to pay to read it much).

    But more importantly, there is so much information around that we don’t consider the possibility that it is vulnerable. We swim in the stuff, and we don’t notice that it may not be the same stuff that was around a few years ago.

    Frankly, if it’s the choice between information being preserved by monks because they believe they have a sacred duty to do so, or being preserved by a tech company, I know which I’d choose. Even if Google does think it is on a mission…

  • Jeff here at archive.org. Someone brought your comments to my attention.

    Please respond with the URL of his site so that I can do a quick check to see it can be located in the archives.

    Retaining the immense amount of sites created since 1996 present many complexities, some technical, some legal. I can assure you that the Internet Archive does not delete any sites from the archive based on the merits of the content. But often there are legal considerations that come into play.

    I can’t promise I can find the site but I’ll certainly take a look.

    Best, Jeff

