Duck Soup: A Memory of Webs Past

[ed. Excerpt from an interesting article describing the massive effort involved in archiving the web.]

by Ariel Bleicher

The task of preserving what's put online has proved, to no one's surprise, monumental. And it's only getting more so as the Internet expands, as Web sites become more dynamic, and as concern grows over online privacy. Increasingly, much of what people put online is being diffused across social networks and distributed through personalized apps on smartphones and tablet computers. The classic Web site, it seems, is already starting to slide toward obsolescence. "I'm convinced the Web as we know it will be gone in a few years' time," Illien says. "What we're doing in this library is trying to capture a trace of it." But to do even that is requiring engineers to build a new, more sophisticated generation of software robots, known as crawlers, to trawl the Web's vast and varied content.

Illien sees himself as a steward of an ancient tradition; he believes he is helping pioneer a revolution in the way society documents what it does and how it thinks. He points out that since the end of the 19th century, the French National Library has been storing sales catalogs from big department stores, including the famous Galeries Lafayette. "Today," he says, "this exceptional collection…is the best record we have of how people dressed back then and who was buying what." One day, he insists, the archives of eBay will be just as valuable. Capturing them, however, is a task that's very different from anything archivists have ever done.

The Web is regularly accessed and modified by as many as 2 billion people, in every country on Earth. It's a wild bazaar of scripting languages, file formats, media players, search interfaces, hidden databases, pay walls, pop-up advertisements, untraceable comments, public broadcasts, private conversations, and applications that can be navigated in an infinite number of ways. Finding and capturing even a substantial portion of it all would require development teams and computing resources as large as, or probably larger than, Google's.

But Google, aside from saving previously indexed pages for caching, has mostly abandoned the Webs of the past—the complete set of Web pages as they existed a month, six months, a year ago, and so on, back to a site's origins. Thus the job of preserving them has fallen to nonprofit foundations and small, overworked teams of engineers and curators at national libraries. Illien, for example, manages a group of nine.

Part of the difficulty in fetching the contents of the Web is that no one really knows how much is out there to be fetched. Brewster Kahle, a U.S. computer engineer who in the late 1980s invented the Wide Area Information Servers, a pre-Web publishing system, paid a visit to AltaVista's offices in Palo Alto, Calif., in 1995. He was shocked to see that the then-popular search engine had indexed 16 million Web pages "on a set of machines that were the size of two large Coke machines," he recalls. "You could actually wrap your arms around the Web."

The apparent compactness of the Web inspired Kahle to found, in San Francisco in 1996, the nonprofit Internet Archive. Wary of infringing on copyrights, AltaVista made sure to delete old pages in its cache. But the Internet Archive, emboldened by its status as a trustworthy nonprofit, was willing to be brazen. "We have an opportunity to one-up the Greeks," Kahle says, referring to the ancient philosophers who collected hundreds of thousands of papyrus scrolls in the great Library of Alexandria. The invention of the Internet, he argues, has made it possible to create an archive of human knowledge that anyone can access from anywhere on the planet. And Kahle, for one, wasn't going to let a bunch of lawyers talk him out of it.

By March 1997, he had compiled what was arguably the first true time capsule of the global Web. In fact, a substantial portion of the French National Library's electronic archive was simply bought from Kahle's Internet Archive. One of the archive's major successes has been its online access interface, called the Wayback Machine, which lets anyone who knows the address of a Web site see archived versions of its pages. Today the Internet Archive stores more than two petabytes of Web data in a portable Sun Microsystems (now Oracle America) data center built into a shipping container. Back in 1997, Kahle had captured nearly 2 terabytes, which he calculated was about a tenth the amount of text stored in the entire U.S. Library of Congress. It was a substantial collection of the Web of the time, but it wasn't nearly everything.

Kahle knew there were still hundreds of thousands of sites and perhaps millions of "hidden" documents, images, and audio clips that his crawler program missed. It couldn't access password-protected sites, for example, or isolated pages with just a few if any hyperlinks, such as outdated product postings on eBay. More troubling, it couldn't probe "form-fronted" databases, which require typing keywords in search boxes to call up information (such databases include those at the National Climate Data Center in the United States and the British Census). Still, Kahle believed that with the right tools and enough human curators to guide the crawlers, it was possible to get almost all online data. The Web may have been big, but ultimately it was manageable.

That is no longer the case. The part of the Web indexed by search engines such as Google has ballooned from some 50 million unique URLs in 1997 to about 3 trillion today, according to the latest update last November by Majestic SEO, a search optimization service. A URL, or uniform resource locator, designates a single document, such as a JPEG image or an HTML text file. Those files, however, are just a tiny piece of the Internet. By some estimates, the total "surface" Web visible to crawlers is six times the size of the indexed Web, and the "deep" Web of hidden pages and databases is some 500 times larger still.

Counting URLs, though, has become a fairly pointless exercise. For instance, it's possible and increasingly common that a single site is capable of generating vast numbers of unique URLs, all pointing to the same content: advertisements or pornography, typically. Though engineers have devised tricks for steering crawlers away from such spam clusters, even Google's crawlers still from time to time capture billions of unique URLs redirecting to the same place.

"In reality, the Web is infinite in all the wrong ways," laments Julien Masanès, who introduced Web archiving at the French National Library in 2002 and managed the collection until 2004, when he left to start what is now the nonprofit Internet Memory Foundation, headquartered in Amsterdam and Paris.

Read more:

Saturday, September 10, 2011

A Memory of Webs Past