The ever-expanding task of preserving internet backpages

Within the walls of a beautiful former church in San Francisco’s Richmond neighborhood, racks of computer servers hum and flicker with activity. They contain the internet. Well, a very large amount.

The Internet Archive, a non-profit organization, has been collecting web pages since 1996 for its famous and beloved Return machine. In 1997, the collection amounted to 2 terabytes of data. Colossal then, now you could fit it on a $50 thumb drive.

Today, archive founder Brewster Kahle tells me, the project is about to exceed 100 petabytes, about 50,000 times larger than it was in 1997. It contains more than 700 billion web pages.

The job isn’t getting any easier. Websites today are highly dynamic and change with every update. Walled gardens like Facebook are a source of great frustration for Kahle, who fears much of the political activity that has taken place on the platform may be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.

News organization paywalls (like FTs) are also “problematic,” says Kahle. Storing news used to be taken very seriously, but changes in ownership or even just a site redesign can mean content disappears. Technology journalist Kara Swisher recently complained that some of her early work at the Wall Street Journal has “gone to hell” after the paper refused to sell her the material several years ago.

As we begin to explore the possibilities of the metaverse, the work of the Internet Archive will only get more complex. Its mission is to “provide universal access to all knowledge” by archiving audio, video, video games, books, magazines and software. Currently, he is working to preserve the work of independent news organizations in Iran and is archiving Russian TV news. Sometimes keeping things online can be an act of justice, protest, or accountability.

Yet some question whether the Internet Archive has the right to provide the material. It is currently being sued by several major book publishers over its “OpenLibrary” ebook lending platform, which allows users to borrow a limited number of ebooks for up to 14 days. Publishers say it’s hurting revenue.

Kahle says it’s ridiculous. He likes to describe the archive’s task as no different than a traditional library. But while a book won’t disappear from a shelf if the publisher goes bankrupt, digital content is more vulnerable. You cannot own a Netflix show. News articles are only available as long as the publishers want them. Even the songs we pay to download are rarely ours, they are simply licensed.

Set up to not rely on anyone else, the Internet Archive built its own server infrastructure, much of it hosted within the church, rather than using a third-party host like Amazon or Google. All of this comes at a cost of $25 million a year. A steal, says Kahle, noting that San Francisco’s public library system alone costs $171 million.

Unless we think today’s first draft of history isn’t worth preserving, the Internet’s disappearing acts should concern us all. Consider how hollow the coverage of Queen Elizabeth’s death would have been had it not been illustrated with deep archival material.

Can we confidently say that the journalism produced around his death will be just as accessible in 20 years time? And what about all the social media posts made by ordinary people? We will come to regret not having competently preserved “everyday” life on the Internet.

Dave Lee is an FT correspondent in San Francisco