DIY Web Archiving

When it comes to online experiences, the technology for creating information experiences is often more advanced than the technology designed to document and capture them. But art collective (kind of?) Rhizome has a new tool for capturing online experiences—particularly social media interactions. The goal here is to create a contextual archive that is more like the original experience. You can think of them like video game emulators that allow you to play old Nintendo games in a web browser, the interactive game collection at MoMa, or even the emulated version of Salman Rushdie’s Macintosh Performa 5400/180. Rhizome focuses on art and digital media, but the idea of archiving online experiences has applications for a broader digital culture.

Image by Nina Frazier, Mashable. http://on.mash.to/1z7Ybpq
Image by Nina Frazier, Mashable. http://on.mash.to/1z7Ybpq

This is sort of a shift in the way archiving digital or online documents has traditionally been carried out. Web crawlers like Heritrix (used by Internet Archive, among others) treat websites as somewhat static objects—or giant data sets—to be crawled one file at a time. The Colloq tool is more like a digital recorder, and you capture sites by navigating through them. It’s based off of the same open source Python tools that Webrecorder.io uses, which you can check out on GitHub—especially now that we all have a pretty good primer on GitHub!

Emory University's emulation of Rushdie's Macintosh
Emory University’s emulation of Rushdie’s Macintosh

Another thing that differs dramatically from traditional digital archival practice is the emphasis (or lack thereof) on file formats and embedded metadata. You don’t have to be well acquainted with archival practices to want to preserve online interactions, and many of the platforms we use today are not conducive to capturing this kind of data anyway. Web crawlers are generally unable to capture things like Vimeo, time-based Flash media, or some kinds of complex scripting in a way that’s faithful to the original interactions. Web recorders don’t always capture the same data that is available on a live site; it’s recording the experience of going through a site, not crawling documents. But an experience like Amalie Ulman’s Instagram performance isn’t really about the documents. It’s performance, so the interaction—not the photography—is what matters most. Capturing EXIF data (which Instagram strips anyway) of isolated JPEGs is kind of beside the point.

Amalia Ulman
Amalia Ulman’s digital art

If you’re not working with art collections (or you’re unable to work with Rhizome or their specific tool), you can do some DIY captures yourself using essentially the same suite of free tools. Though there are ways to automate the process if you’re willing and able to do some coding, web archiving via web recorders requires a little more human effort during the actual capturing part. It’s hard to “set it and forget it” like a crawler, but it avoids some of the logical traps that crawlers fall into (endless calendar crawling, missing pages from client-side scripting, etc.) and you end up with a version of the web that feels much more human.

Note: It’s worth mentioning that the Heritrix crawler is also open source. You don’t have to subscribe to Archive-It’s services to use the web crawler, but a local install will have similar limitations in its approach toward online data.

Further Reading:

First look: Amalia Ulman

A dynamic new tool to preserve the Friendsters of the future

New Rhizome tool preserves net art for future generations

Colloq at Rhizome: Preserving social media

The many uses of Rhizome’s new social media preservation tool

The digital life of Salman Rushdie

Salman Rushdie’s Macintosh

The digital archives of Salman Rushdie (overview of access)

Snooping through Salman Rushdie’s computer

The artifactual elements of born digital records, part 1

The artifactual elements of born digital records, part 2

Information or artifact: Digitizing a book

Society of American Archivists Standards Portal