Old news: digital newspaper archives at DH2014

(Originally posted on the HyperStudio blog October 27, 2014)


Old News: Digital newspaper archives at DH2014

Books and manuscripts are an archivist’s bread and butter, respectively. Librarians have honed techniques for storing, maintaining, and retrieving their contents for millennia—go into any stack in the library, organized by call number, for ample evidence. But newer media artifacts often don’t fit into old ways of storing and finding information. Digital media brings this problem into full relief, but centuries ago, the newspaper might have been the modern archivist’s first challenge.

Today, archives face the challenge of digitizing their collections, an issue of particular importance for us at HyperStudio, as our research focuses on the potential for digital archives to provide new opportunities for collaborative knowledge creation. For archivists, the digitization of newspapers raises unique questions when compared to their traditional stock. At the DH2014 conference in Lausanne, Switzerland, one panel in particular addressed historical newspaper digitization head-on.

Newspapers are rich archival documents, because they store both ephemera and history. The saying goes that “newspapers are the first draft of history,” but not all news becomes history. In a typical paper, you might find today’s weather sitting next to a long story summarizing a major historic event; historians have traditionally been more interested in the latter. Journalists sometimes divide these types of news into “stock” and “flow”. Flow is the constant stream of new information, meant for right now (think of your Twitter feed). Stock is the “durable stuff,” built to stand the test of time (for instance, a New Yorker longread).

For archivists, everything must be considered “stock”: stored forever. Some historians may be in search of ephemera in an effort to glean insight from fragments of local news snippets, advertisements or classifieds—so everything is of potential historical importance. The Europeana Newspapers project has digitized over 2 million pages with the help of a dozen key partner libraries around Europe, but by their calculations, 90% of European culture is not digitized. The project anticipates reaching 10 million records by 2015, along with metadata for millions more, but it is still a small fraction of Europe’s newspapers.

It is also no surprise that many biases exist even in this wide net of 10 million records. The 10% of culture that is digitized generally consists of culture’s most well-known and well-funded fragments. The lamentable quality of OCR (Online Character Recognition—a technology that turns scans into searchable text) likewise means that better image scans lead to better discovery. Moreover, groups like Europeana must work across dozens of countries, languages, and copyright laws; some of these will inevitably be better represented and better funded than others. So it seems you’re much more likely to find a major piece in a highbrow English paper than a blurb in the sports section of an obscure Polish daily.

Even taking as a given that everything is potentially important, newspapers present a unique metadata challenge for archivists. A newspaper is a very complex design object with specific affordances; Paul Gooding, a researcher at University College London, sees digitized newspapers as ripe for analysis due to their irregular size and their seriality. A paper’s physical appearance and content are closely linked together, so simply “digitizing” a newspaper changes it massively, reshaping a great deal of context.

Seriality and page placement also extend the ways in which researchers might want to query the archive. For some researchers, placement will be important (was an article’s headline on the first page? Above or below the fold? Was there an image, or a counterpoint article next to it?). Others could be examining the newspaper itself over time, rather than the contents within (for instance, did a paper’s writing style or ad placement change over the course of a decade?) Still others may be hoping to deep-dive into a particular story across various journals. Each of these modes of research requires different data, some of which is remarkably difficult to code and store.

In order to learn more about how people use digitized newspaper archives, Gooding analyzed user web logs from Welsh Newspapers Online, a newspaper portal maintained by the National Library of Wales, hoping to gain insight from users’ behavior. He found that most researchers were not closely reading the newspapers page by page, but instead searching and browsing at a high level before diving into particular pages. He sees this behavior as an accelerated version of the way people used to browse through archives—when faced with boxes of archived newspapers, most researchers do not flip through pages, but instead skip through reams of them before delving in. So while digital newspapers do not replace the physical archive, they do mostly mimic the physical experience of diving into an archive; in Gooding’s words, “digitized newspapers are amazing at being digitized newspapers.” Portals like Welsh Newspapers Online are not fundamentally rethinking archive access, but they certainly let more people access it.

The TOME project at Georgia Tech is aiming to rethink historical newspaper analysis from a different angle. Instead of providing an interface for qualitative researchers to dive in, TOME hopes to facilitate automatic topic modeling and entity recognition, to quickly get a high-level glance of a vast archive with quantitative methods. They are beginning with a set of 19th-century American newspaper archives focused on abolition. The project simplifies statistical analysis tools into a visually compelling interface, but at the risk of losing the context that seriality and page placement provide.

Perhaps the biggest challenge is how to present such a vast presence — and such a vast absence — to historians, curious researchers and individuals, all of whom may be after something slightly different. Where Gooding divided queries into three types — “search,” “browse,” and “content” — the TOME group follows John Tukey’s divide between “exploration” and “investigation”—or those who know what they want, and those who are looking for what they want. A good portal into a newspaper archive requires all of these avenues to be covered, but it remains to be decided how best to turn news into data, to visualize troves of ephemera, and to represent absence and bias.

Important books and manuscripts — the “great works” that line history books — tend to present a polished and completed version of events. Newspapers offer another angle into history, where routines, patterns, and debates are incidentally documented forever. Where a book is usually written for posterity, the newspaper is always written for today, reminding the archive diver of history’s unprepared chances and contingencies. The historians who mine old newspapers — and the archivists who enable them — have many new digital tools at their disposal to unearth promising archives, but much effort remains to fairly represent news archives, and determine how we might best use them.