Author Archives: landrew

Saved you a click

Presented at DjangoCon 2017 in Spokane, WA.

Saved you a click (or three): Supercharging the Django admin with actions and views.

The goal of this talk is to “save you a click”: while this is a pejorative phrase in The Texas Tribune’s newsroom, we love saving our reporters and editors clicks in the Django admin when they’re filing a breaking story. We’ve combined several helpful Django features, libraries, patterns and customizations – ranging from the obvious to the advanced – to supercharge our admin, allowing for live previews, shortcuts, and dynamic updates. Taken together, these tools suggest a framework for developing new admin features quickly and maintainably.



Switching CMSes

Co-facilitated session with Pattie Reaves at SRCCON 2017.

Is your newsroom moving to WordPress? Moving away from WordPress? Moving to your parent company’s CMS? Moving to Arc? Building a new CMS from scratch using Node? Rails? Django? (Are you using Django-CMS or Mezzanine or Wagtail?) Going headless with WordPress? Going headless with React?

…or is your newsroom paralyzed by the sheer magnitude of the task of choosing and migrating to a new CMS, let along upgrading your current one?

This session explored the why and how of migrating content to new systems. When is it time to change up your CMS, and why? When is it better to repair your ship instead of jumping off? What does the transition process look like— for instance, how do you handle your archival stories, or make sure your frontend and backend features are in sync? How do you pull it off (technically)? How do you pull it off (organizationally)? Most importantly: is it worth it?

Notes & Resources

News Shaman

News Shaman logo

News Shaman aims to highlight the emotional impact and rhythmic flow of reading a collection of news stories. The tool offers a quick view of the emotional analytics of a given story or list of stories– a sort of “mood ring” for your content. This allows a publisher to gauge whether their feature series or newsletter digest is setting an emotional tone that fits their readers, and begin to iterate on the best flow for their content; is it better to open with a downer and finish with an inspiring glimmer of hope, or start irreverent and end with a sober reflection? Combined with analytics tools, News Shaman can begin to answer these questions, and help gain editorial insight into how readers are reading and reacting to stories.

This project was built by Liam Andrew, Kathryn Beaty, and Ben Hasson at The Huffington Post and Editors Lab hackathon in NYC on April 8-9, 2016. Code is available on GitHub.

The ideas behind News Shaman were developed further at the Coral Project’s Communities of Data hackathon in Washington, DC, May 2016. Using a shared dataset of Washington Post reader comments, we built tools and visualizations to analyze and predict the emotional impact of stories. Read more about the results on the Coral Project community page.

Screenshot of News Shaman

Don’t let the bots win

Presented with Daniel Craigmile at NICAR 2016.

Whether they’re built by data journalists, search engines, web archivists or malicious troublemakers, bots are substantial users of news websites. At The Texas Tribune, bots account for half of our site’s traffic, and they’re responsible for our search presence and our archival legacy as well as site attacks and performance problems. In this session, we dove into the world of bot users, and share some tips for identifying and managing these crawlers and scrapers, helping the “good” bots do their work, and keeping the “bad” ones from wreaking havoc.

We approached the topic through the lens of a mysterious site that was serving an exact mirror of The Texas Tribune for several weeks, wreaking havoc on our servers and analytics. In our effort to identify the source and method for this mystery site, we enlisted the aid of investigative reporters, business staff, and systems engineers. In the process we learned a great deal about bots, and how much their face has changed and their numbers have increased in recent years. We believe that closer tracking of bot users could lead to new stories and insights for data and tech reporting.





LitCity logo

Imagine walking down a city street and feeling that familiar buzz of a push notification. But instead of it being a notification on Twitter or a restaurant recommendation, it’s a beautiful passage from a work of literature with a tie to that place. In Paris, it could be walking past Café de Flore and receiving a sample from James Baldwin or Richard Wright. In Washington, DC, it could be a sample of an Alex Cross novel. In Japan, it could be one of Miyuke Miyabe’s mystery novels. In Chicago, it could be a bit from Devil in the White City. LitCity ties literature and place, injecting a little bit of romance and geographic discovery into books. Not only is the reader given a beautiful prompt to reflect upon (contributing to the mental environment) but it also is a reminder that literature lives wherever we are.

The LitCity prototype was developed at the Codex MIT hackathon as a mobile web application, using a combination of natural language processing, scraping, and web APIs to create and connect books, places, and quotations, along with a Django-based admin interface that allows editors to approve and adjust the automated data. The prototype was developed for Boston and London. The source code is available on GitHub.


The Gist

The Gist logo

Existing news topic pages are static, usually just a reverse-chronological list of stories about a given topic or category. The Gist remakes the traditional news website’s topic page into an interactive, human-curated hub that summarizes current events by assembling content and conversations from across the web (social media, the homepage, and other news sources). A curated hub gives reporters more ownership over the topic, and allows audience to drive the conversation and highlight story points that might otherwise be drowned out.

The Gist prototype was developed at SNDMakes Austin. Read more from teammate Adam Schweigert at the INN Nerds blog.


The Gist demo homepage

The missing links: an archaeology of digital journalism


Masters thesis in Comparative Media Studies, published by Massachusetts Institute of Technology, 2015.

Advisors: William Uricchio, Kurt Fendt, Ethan Zuckerman. Download from DSpace or check out the blog.


As the pace of publishing and the volume of content rapidly increase on the web, citizen journalism and data journalism have threatened the traditional role of institutional newsmaking. Legacy publishers, as well as digital-native outlets and aggregators, are beginning to adapt to this new news landscape, in part by drawing newfound value from archival stories and reusing older works. However, this trend’s potential remains limited by technical challenges and institutional inertia. In this thesis I propose a framework for considering the news institution of the digital era as a linked archive: equal parts news provider and information portal, the linked archive places historical context on the same footing as new content, and emphasizes the journalist’s role as news explainer and verifier. Informed by a theoretical, historical, and technical understanding of the web’s structural affordances and limitations, and especially by the untapped networking power of the hyperlink, I suggest how publishers can offer an archive-oriented model of structured, sustainable, and scalable journalism. I draw from concepts and lessons learned in library and computer science, such as link analysis, network theory, and polyhierarchy, to offer an archivally-focused journalistic model that can save time for reporters and improve the research and reading process for journalists and audiences alike. This allows for a treatment of news items as part of a dynamic conversation rather than a static box or endless feed, revitalizing the news archive and putting the past in fuller and richer dialogue with the present.



The backstories behind breaking news can be complex, and incredibly difficult to understand if you haven’t been following a story all along. Fortunately, some smart people in television have been thinking about this problem: they’ve created the recap sequence (think “Previously on…”). It helps you get up to speed on the facts so you can jump in to a new episode. Explainer stories are the news equivalent to “Previously On…”; they are comprehensive, evergreen, and search-optimized, but they’re focused on seekers (rather than casual browsers), difficult to make, and are quickly out of date in situations that are changing frequently. How could you make a quick, flexible recap sequence for news without having to create new content?

Inspired by TV recap sequences, Backstories remixes structured data from previous stories (leveraging your archive) to create an explainer video for a reader to get up to speed. Backstories videos are composed of headlines and key images from previous stories, and background music. The videos are automatically generated, but journalists can fine-tune the content to make it more coherent.

Backstories was created for Ethan Zuckerman’s Future of News and Participatory Media class at the MIT Media Lab, and has been featured on Nieman Lab and the AIR newsletter. The prototype runs on Flask and the code is available on GitHub.

Role: Co-lead, prototype developer

Link profiles of the New York Times and the Guardian

Using the Media Cloud API I gathered all of the stories from The New York Times and The Guardian in January 2014. It was about 10,000 stories for each. For each story, I extracted some data about the inline hyperlinks in each story — its place on the page, its anchor text, and whether it was an inlink or outlink. I started with two research questions:

  1. are there patterns in the ways that different sections or desks of the newsrooms use internal hyperlinks?
  2. could the hyperlinks created within news stories at either organization constitute new categories of their own?

Q1: Linking by category

I treated each story’s URL path as a proxy for the category or desk that it is in. Using regular expressions, I queried for metadata about the “link profile” of the stories in each category of the Times and the Guardian.

The first striking difference was in the breakdown of categories. Here’s the Guardian’s distribution of types of stories:

And the Times’:

The Guardian seems to have a more balanced breakdown of stories — two Times categories take up nearly 50% of the pie — but then again, the Guardian has two separate categories for “sports” and “football”, so it’s not to be completely trusted.

I then looked at the average rate of internal linking in each newsroom’s stories, broken down by category. Here’s the Times’s top 20 desks for inlinking:

Blogs are just 13% of total stories, but account for 31% of the links. However, these skew towards outlinks and only 2% of them go to Times topic pages. Meanwhile, half of all inlinks on traditional Times stories go towards topic pages. Blogs seem to be a site of greater outlinking, but the same level of inlinking as traditional Times stories.

I was especially surprised by the low number of inlinks on “world”, a category which would seem to require more context to explain complex and distant events. “world” and “sports”, the most populous categories, have the lowest link levels. These categories do not support the idea of a link-based classification scheme.

The Guardian was a different story, with 20 categories at higher than 3 inlinks apiece:

Lastly, I matched up all of the categories I could between the Times and the Guardian (for instance, each one has a “science” section), then I ran a side-by-side comparison of the two papers. Blue is Guardian and Gray is NYT, obviously.

The Guardian beats the pants off of the Times, in nearly every category. The “.+” category is particularly important, as it is the global figure (“.+” is a regular expression that matches any URL). The difference is stark.

It’s interesting that “movies” does particularly well for both publications, as does “theater”; presumably, these stories link often to topic pages about films, actors, and directors. Such stores are also usually slower, less breaking news, allowing for more time for background research and context.

Q2: Network formation

My next question was whether these internal links could start to self-organize. This required examining a subset of each archive as a network graph, honing in on who was linking to whom at a story level. I selected the Times’ “arts” section and the Guardian’s “arts and culture” section, each containing around 400-500 stories and 4000 links.

The Guardian’s graph is as follows:

And the Times’ graph:

It is clear from a quick review of each graph that the Guardian’s appears more “networked,” with more coherent clusters forming rather than isolated pods. This suggests new possible ways to organize topic pages, suggest related stories, and organize archival material.

This is an early experiment and this research question needs more fleshing out. Future inquiries will attempt to “spider” out from links to get stories outside of January 2014, and will also attempt to cross categories in order to examine the networks formed between different desks at a news organization.

Playful engineering: designing and building art discovery systems

(Originally submitted for Museums and the Web 2015. Co-authored with Desi Gonzalez and Kurt Fendt.)


How can we engineer the discovery of art? HyperStudio, MIT’s digital humanities laboratory, has been tackling this question through the development of Artbot, a mobile website that encourages meaningful, sustained relationships to art museums in the Boston area. Artbot combines two strategies to enable users to discover cultural happenings in the city: a serendipitous approach that allows users to explore via linkages between art events, and a recommendation system that suggests events based on a user’s interests.

Using Artbot as its primary case study, this paper will examine the design and building of art discovery systems. First, it will survey other examples of art recommendation and discovery systems, such as Magic Tate Ball, Serendip-o-matic, Artsy, and the Powerhouse’s OPAC 2.0 collection project. Then we will discuss the front-end design and back-end technologies behind Artbot’s discovery engine and consider how other cultural institutions can implement these approaches.

1. Introduction

At HyperStudio, MIT’s digital humanities center, we research, conceptualize, and develop projects in support of scholarship and education in the humanities. Each project starts with a clearly defined scholarly and/or educational need, often in partnership with MIT faculty and local institutions. We then expand on existing digital humanities research to develop tools that can be applied to other humanities fields.

In the fall of 2013, HyperStudio began work on Artbot (, a mobile website that allows users to discover art in the Boston area. We wanted to go beyond a listings website or a tourist guide. Instead, Artbot aims to create deeper connections to art through personalization and by unearthing hidden gems within Boston’s cultural landscape. As such, we targeted an audience of Boston-area residents who are interested in art but may not be aware of all of the city’s cultural happenings; we especially aim to reach Boston’s semi-transient student and researcher population. Artbot aims to accomplish three goals:

  1. To encourage a meaningful and sustained relationship to art
  2. To do so by getting users physically in front of works of art
  3. To reveal the rich connections among holdings and activities at cultural institutions in Boston

During Artbot’s ideation process, one advisor likened Boston as a whole to a museum: from attending lectures at Tufts University to concerts at the Isabella Stewart Gardner Museum, an individual could enjoy a wealth of educational and cultural experiences within the city. But many people don’t know how to find out about new happenings in Boston’s art scene, and while they can turn to email digests and newspapers listings, few tools provide personalized and playful recommendations for arts and culture.

We endeavored to accomplish these goals by building a tool that addressed a research question: How can we engineer the discovery of art? Beyond this core question, what audience would be most interested in such a tool, and how can we best reach them? Finally, how can we balance a smart, engaging, scalable, and sustainable discovery system with time and budget limitations?

This paper examines the research and process of developing a discovery system for visual art events and exhibitions in the Boston area. First, we review the challenges to algorithmic recommendation and discovery in the cultural sector, assessing existing recommendation models. For Artbot, we opted for two modes of discovery: a serendipitous and playful approach that allows users to explore via linkages between art events, and a recommendation system that suggests events based on a user’s interests. The following section addresses our process and approach while designing and developing Artbot. Next, we describe the app’s interface design and the ways it reflects Artbot’s two modes of discovery. We then outline the back-end system, which is built around a suite of modular components, services, and open-source tools; it combines Web scraping and natural-language processing tools to semi-automate the data sourcing and tagging process. After outlining our next steps, we conclude by considering what other cultural institutions might learn from this project.

2. Cultural recommendation and discovery

Recommendation and discovery systems are plentiful, and many of us use them on a daily basis. When a user opens Netflix, it might recommend that she watch The IT Crowd and All About Eve because she previously binged on 30 Rock and tends to watch classic films with strong female heroines. For another individual, an app called Zite bookmarks news articles based on topics he’s chosen to follow, such as “museums,” “digital humanities,” and “politics.” And for yet another person, might recommend a Columbia fleece vest because she was browsing a Patagonia jacket.

Cultural institutions can employ recommendation systems, but for different ends than their for-profit counterparts. For museums, cultural organizations, and university-affiliated research groups like HyperStudio, our goals are more intangible: rather than increasing the bottom line, we aim to increase engagement with culture at large. In The Participatory Museum (2010), Nina Simon advocates for personalized recommendation systems as a way to frame “the entry experience in a way that makes visitors feel valued,” provide “opportunities to deepen and satisfy their pre-existing interests,” and give “people confidence to branch out to challenging and unfamiliar opportunities.” Barry Schwartz (2008) discusses how recommendation in the cultural space, unlike the commercial space, is not an either-or situation. You can choose both the Philip Roth and the Steven King novels to read on a cross-country flight, he explains. In fact, diversity in cultural consumption might lead to further interest in culture: “Perhaps because culture is an experienced good, participating in cultural events may whet the appetite for more participation. Doing culture may stimulate the demand for more culture” (Schwartz, 2008). Richard A. Peterson and Gabriel Rossman (2008) call an interest in a variety of culture—from the highbrow to popular culture—“omnivorousness” and argue that omnivorousness might be an increasingly important factor in predicting whether a person will participate in cultural events than his or her “brow level of taste.”

But while diversity of choice is a good thing—and something we certainly hope to emphasize in Artbot—too much choice can be a bad thing. Schwartz talks of the problem of “choice overload” and the “paradox of choice,” in which a surplus of cultural options can lead to people participating less. He suggests that having too much choice can result in dissatisfaction with a final decision, or a feeling of paralysis in the face of so many options; often, people will choose the same things that they are used to or “choose not to choose at all.” In a focus group we conducted during the conceptualization phase of Artbot, participants expressed frustration with massive lists of cultural options. Rather than sifting through comprehensive event feeds or digests, they tended to find out about events through their social networks or directed emails. Schwartz comes to a similar conclusion: he advocates for cultural institutions to focus “more on filtering diversity than creating it.” In other words, now that the Internet allows access to so many artworks, films, books, and music, how can we build tools to help users make sense of such a wealth of culture?

These are considerations we have kept in mind during the development of Artbot. We hope to expand users’ cultural purview by suggesting hidden gems they may not have known they were interested in; at the same time, we want users to feel like the cultural events and exhibitions recommended to them align with their own interests and sense of identity.

In designing Artbot’s discovery engine, we looked to existing models of cultural recommendation systems. Recommendation systems can be divided into two main approaches: collaborative filtering and content-based filtering. Collaborative systems take on a social approach, giving recommendations based on users’ behavior. Amazon, for example, employs item-to-item collaborative filtering, showing that people who buy product X also bought product Y. Netflix recommends movies and TV shows that you might like based on other users who have similar viewing habits. In the museum realm, the Powerhouse Museum’s OPAC2.0 collection search interface integrates what Seb Chan (2007) has called “frictionless serendipity.” OPAC2.0 provides users suggestions for collection objects according to the behavior of site visitors; as Chan explains, a search “for ‘minton’ currently gets suggestions for other searches of ‘mintons’, ‘bone china’, ‘british’, ‘porcelain’ and ‘peacock’, based on the terms other searchers of the term ‘minton’ have used and the objects they have viewed.”

One of the drawbacks of collaborative filtering, however, is that it can limit rather than expand a user’s purview. Early utopian language surrounding collaborative filtering championed its power to break down rigid demographic groups, but as Nick Seaver (2012) points out, collaborative filtering merely redraws borders based on consumer taste (which often fall along traditional demographic lines), and its dynamic, shifting nature can make it seem invisible and immune to scrutiny. Eli Pariser (2011) coined the term “filter bubble” to describe the phenomenon that occurs when an algorithm, guessing what a user might want, isolates said user from content that might differ from his or her viewpoints. In order to combat such filter bubbles, Ethan Zuckerman (2013) advocates for the building of digital tools that infuse serendipity and a diversity of voices.

Collaborative filtering also suffers from the “cold start” problem (Schein et al., 2002): a system doesn’t know how to evaluate items that no user has rated yet. It thus requires a strong user base at the outset in order to be effective—no recommender system uses collaborative filtering exclusively—and only gets better with more and more users. While this approach works for Netflix or Amazon, we opted against implementing collaborative filtering at an early stage for the reasons outlined above.

Content-based systems look to the properties of the items themselves, rather than the users, for recommendation signals. The Powerhouse Museum’s OPAC2.0 also incorporates content-based recommendation strategies, taking advantage of the object and subject taxonomies built into museum’s collection management system while also allowing users to add tags to objects in the collection. Trill (, a Boston-based service for discovering music and performance events, allows users to explore via genre tags. Trill also suggests curated picks from local music and performance experts. Artsy’s “Art Genome Project” is a more nuanced tagging system; artworks are tagged by “genes” that are assigned priority on a scale from 0 to 100. Like collaborative discovery tools, tag-based recommendation systems have disadvantages. Whether tags are user generated or curated by a selected team, generating and maintaining a taxonomy can be time intensive. Additionally, tags are only as good as their taggers, often leading to rigid classifications that don’t allow for happy accidents in the discovery process.

Serendip-o-matic and Magic Tate Ball are two content-based recommendation projects that hope to infuse serendipity into the discovery process. Serendip-o-matic, a website created in August 2013 and geared toward scholarly researchers, asks users to input a text; the tool identifies key words within the text and yields images from online humanities collections. The results can serve as inspiration for a new research project or unexpected primary sources that can enliven a project. Magic Tate Ball is a mobile app that provides a fun way to discover artworks within the Tate’s collection. Based on factors such as GPS location, time of day, current weather conditions, and ambient noise levels, the “eight ball” yields one artwork, providing an explanation as to why the object was selected and additional educational content.

In developing Artbot, we wanted to emulate Serendip-o-matic and Magic Tate Ball’s sense of discovery and fun. We initially considered many search strategies: most popular events, events happening or exhibitions closing soonest, curated lists, and so on. Ultimately, we opted for a content-based approach that combines two main modes of discovery: personalized recommendations based on favorites and user interests, and serendipitous connections between events, based on automatically generated tags.

3. Process

HyperStudio is a research group within an academic unit, consisting of graduate research assistants, technical, and academic staff. Besides Digital Humanities research and education, we develop a variety of grant-funded projects. As such, our application needed to be buildable and sustainable within this framework. The project’s small team and limited time frame informed both the scope and the process of building Artbot, as we aimed to balance research interests with practical concerns.

We began the process with extensive background research into existing museum, event recommendation, and mobile discovery apps, as well as conversations and focus groups with potential target users in the MIT community. These confirmed that people felt overwhelmed by the plethora of events in Boston and wanted help in sorting through them to find interesting happenings.

With this less-is-more approach in mind, we started small, sourcing our data from a handful of museums in Boston rather than expanding too quickly to other venues or types of events. We also decided to hone in on event and exhibition data, rather than collections, in part because of access and rights issues; museum collection data is often difficult to access, while event information is freely available on the Web. By starting small, we were able to keep our focus on providing rich and nuanced recommendations rather than worrying about quantity and scale. The decision to work with events and exhibitions also helped to focus our app’s goals: rather than educating users from afar, the app would aim to encourage users to attend the museums themselves.

After the initial research phase, we began designing and prototyping Artbot in spring 2014. We opted for a mobile website, as we wanted to design for a mobile audience without adding the technical demands and overhead of a native mobile application. This allows the app to work responsively on any device, which is particularly helpful at our stage for user testing. Future versions of Artbot could also build on this backbone for new, mobile-specific features and integration with app stores.

Figure 1: early paper prototype

We developed the core features of Artbot between April and October 2014. In the process, we extensively leveraged third-party and open-source technologies and services. We also supplemented in-house development with contract work from outside consultants, working primarily in a series of intensive sprints. We collaborated with design firm Clearbold ( on the front-end execution and user experience, and with the consulting firm Thoughtbot ( on aspects ranging from visual design to application programming interface (API) structure.

4. Front-end design

Artbot’s front-end design reflects the two modes of discovery outlined above. The landing page, also called the “Discover” page, reflects the personalized-recommendation aspect of the application. On the top two-thirds of the screen, a signed-in user sees one event or exhibition at a time based on his or her preferences. The user can swipe left or right to browse up to ten other recommendations. The decision to show only one event at a time was influenced by the Magic Tate Ball app: by displaying only one recommendation at a time, that recommendation becomes all the more special to the user and avoids overloading the user with too many options. On the bottom third of the same screen, a user can access the events and exhibitions they’ve saved via the “My Favorites” carousel.

Figure 2: the Discover page

Event and exhibition pages are designed to reveal the serendipitous connections between events. The bottom carousel—mirroring the “My Favorites” navigation of the “Discover” page—recommends other events and exhibitions based on related tags. Users can tap on the “refresh” icon to load another tag related to the event on view. Pairing specific event information with other recommendations emphasizes the connections between events across the Boston area, visualized on a single screen.

Figure 3: a sample event page

While personalized recommendations and serendipitous connections are the primary modes of discovery on Artbot, we recognize that users are also interested in finding happenings that are convenient for their busy schedules. With this in mind, we also allow users to filter events by date and by location. However, we purposefully excluded search functionality from the app. By limiting its design to only a few modes of discovery, we hope to develop a simple but elegant tool that prioritizes serendipitous discovery above all.

image08 image00
Figure 4: “By Date” and “By Location” pages

5. Technology pipeline

The back end of Artbot’s technology stack is designed to support a unique and nuanced recommendation engine that can complement and reinforce these playful, serendipitous elements of user discovery. We also endeavor to make Artbot a suite of modular, easily customizable components in order to further scale and develop particular elements, as well as lending itself to future reuse.

Artbot’s back-end ecosystem consists of three major components that will be described in the following sections:

  1. A series of Web scrapers to obtain the event data
  2. A Web-based natural language parser and named entity recognition service
  3. A back-end API with user and admin management capabilities

Figure 5: diagram of Artbot’s applications and features

  1. Scrapers

One of the biggest challenges in building and maintaining an event recommendation application is the sourcing of event data. Events come in many forms and from many sources, so finding an optimal approach for adding and classifying every event is a particular challenge. Some event applications ask partners (such as museums, galleries, or marketers) to enter data into an online form; others rely on staff or interns who actively scour the Web and manually enter data, while still others crowdsource submissions from the public.

Artbot’s event data comes directly from museum websites via scraping. A scraper is an automated script that fetches a Web page in order to extract and store certain data from it. For instance, the Web page for the Museum of Fine Arts’ Goya: Order and Disorder exhibition ( contains a plethora of useful metadata: start and end dates, gallery location, images, price, and most importantly for Artbot, a four-hundred-word description of the exhibition complete with associated artists, movements, themes, and locations.

The scraper approach allows Artbot to use a “pull” model that automatically grabs event data, rather than a “push” model in which administrators manually find and enter event information. The primary benefit of scrapers lies in its flexibility; we do not require museums to directly supply data in a certain format, repeating work that they might do elsewhere. Scrapers also obviate the need for our staff to scour museum websites for new or updated events, as the scraping scripts do so for us.

We have built scrapers for seven of the eight museums in our system. While every museum structures its website differently, each of these museums follows a certain formula for event pages and listings. The museum homepage tends to lead to an event or exhibition index page, which our scrapers are able to locate and fetch in order to collect a full and updated event listing. The conventions and patterns in museum websites also allow us to add most conventional museum websites efficiently, with just a few lines of code. The scraper then visits each event page in the listing and gathers the exhibition’s URL, name, dates, location, image, and full-text description. It yields a list of these events in the JSON data format for storage in a database.

We currently run each of the seven scrapers once a day to check for new and updated events. Any changes are handled automatically by cross-referencing the exhibition’s URL with the corresponding URL in our system. If the event’s metadata (such as its dates or description) has changed, it will automatically update the event with the new data and alert any admins about the change.

The scraper approach, while successful and useful for our purposes, has certain drawbacks and limitations. First, a website must be well structured in order to retrieve the data, which makes the approach better suited to institutions with a well-developed Web presence. Moreover, when a museum redesigns its website, the scraper needs to be rewritten. Second, the approach necessitates a “closed” ecosystem of venues, with a bespoke scraper for each venue. Still, we see this method as more consistent and sustainable than manually adding and maintaining all of the venue’s events.

Although the scraping approach does not lend itself as well to smaller, more ad hoc events with less of a Web presence, it can easily be used in tandem with other sourcing methods, such as manually entering event data into the database. For our purposes, the scrapers have proven helpful for maintaining a small but sustainable set of data with the potential for rich interconnections.

  1. Parsers

Most museum event pages include a description of the event to entice users to attend. These descriptions are replete with rich and useful information, such as artists, movements, mediums, and geographic locations that make up a web of connections and influences. In order to tie these influences to collections, events, and exhibitions at other museums, we utilize a suite of third-party natural-language processing tools, which help with tasks like named entity recognition and entity linking. Named entity recognition (NER) is a process that identifies and classifies text elements, such as the names of people, locations, and organizations. Entity linking takes these names and connects them to a broader knowledge base, such as Wikipedia. Together, these services, which we call “parsers,” extract proper names and broader themes from a given text (such as an event description), allowing for automated categorization, classification, and linking of events and their contents.

The resulting application, written in Python’s Flask framework, allows a user to supply any text (i.e., a museum event description) to the suite of parsers. The parsers will then return a list of entities (proper names) and tags (descriptions and themes) that are associated with the text. We have so far implemented four different services in tandem: the Stanford Named Entity Recognizer, DBpedia, OpenCalais, and Zemanta.

We found that the Stanford Named Entity Recognizer ( was the most effective of the four services at identifying the presence of entities—people, places, and organizations—in event descriptions. However, unlike the other services, Stanford does not automatically link or disambiguate these entities. To compensate, we rely on linked-data service DBpedia ( to match the recognized text to a specific entity with contextual meaning. So for instance, if the string “Sol LeWitt” is in a given event description, the Stanford NER will recognize that Sol LeWitt is a person, and DBpedia will link it to Sol LeWitt’s Wikipedia page, which leads in turn to additional insight (e.g., that Sol LeWitt is a Conceptual artist, an American, and so on).

While the Stanford NER is effective at finding the proper names in a given text, it is not built to locate or infer broader topics and themes. To shore up this aspect, we rely on third-party API services OpenCalais ( and Zemanta ( Like DBpedia, these services find linked entities and tags in free text. They are primarily targeted towards news and blog classification, so they sometimes provide superfluous or irrelevant tags to art events; however, since we combine multiple linked-data services, we are able to triangulate between these tags in order to vet or confirm their accuracy.

Figure 6: diagram of the data ingestion and tagging process, from event page to database

Our integrated approach, combining multiple services in a single Web application, allows us to balance the strengths and weaknesses of each service. It also allows us to easily add new parsers to the same application, rather than building different applications for each service. For instance, certain Getty Vocabularies (, like the Union List of Artist Names, would be a promising addition for deeper dives into the connections between artists and art movements. The open-source Stanford NER is also highly customizable, allowing for the long-term possibility of creating a custom entity recognizer built specifically for museum exhibits and descriptions.

  1. Core application and API

After the scraping and parsing process, the event data is sent to Artbot’s core back-end and API, which is written in the Ruby on Rails framework. Along with housing the database and its corresponding data models, the core application contains some of the pre-ingestion data management required to format the data for use, such as a date parser and an entity linker. It also provides API endpoints for use by a front-end presentation system, as well as admin management and email features.

  • Entity linker

The entity linker manages the parsers’ entities and tags, and makes smart and dynamic connections between these entities for future use. Due to the often-noisy data coming from the parsers, the entity linker blocks any tags except those matching a given whitelist of regular expressions. Terms like “artist” and “sculpture” will be recognized as relevant tags and therefore applied, while less-relevant tags like “banking” or “philanthropist” are thrown out. These regular expressions also allow for tagging on dynamic contexts, such as “movement,” “medium,” or “era”; therefore, the system will understand that an artist who is tagged as “baroque” and “Spanish” is being tagged according to an era and a location, respectively. This allows for more nuanced recommendations that incorporate context and source variety in the selection algorithms.

Figure 7: diagram of the database schema, demonstrating how events connect to other events and users

This approach is more sustainable than directly tagging events themselves; by instead tagging the entities associated with the event, the system learns how to tag future events containing that entity. For instance, if the artist Marina Abramovic is tagged with “performance art,” any future event mentioning Abramovic will be automatically tagged as such, rather than requiring new tags on a per-event basis.

  • API (Application Programming Interface)

The application programming interface (API) allows the separate parts of Artbot talk to each other by sending and receiving dynamic and customizable data via JSON, a lightweight data format that allows for easy integration with any front-end interface. The API has built-in functionality for event and location search and filtering—such as by date, location, user interest, and event relevancy—as well as user authentication and management.

The core API endpoints correspond to Artbot’s two main approaches to discovery: personal recommendations (used on the user “Discover” page), and serendipitous cross-event connections (used on the event and venue pages).

The “Discover” API takes into account a user’s interests (gathered when the user signs up, or when they go to their “My Interests” page), as well as a user’s favorited events. The algorithm first searches for all current events tagged with a user’s interests, as well as all the tags associated with a user’s favorite events. It then withholds any events that have already been favorited by the user. Finally, it orders the events based on heuristics such as tag quantity, event location, and contextual variety. If no events are found, or if the user is not signed in, it defaults to ordering by date, recommending events and exhibitions that close soon.

The cross-event API, which populates the bottom carousel of the event page, looks for signals in the properties of the events themselves, rather than the preferences of the user requesting the event. It utilizes a mix of approaches to determine relevancy and ultimately display and sort items for the user. This includes direct entity relationships (such as a certain artist appearing in two different events), as well as tags associated with these entities (such as an art movement that is associated with two artists mentioned in different events).

Figure 8: diagram of the query process. Events are linked via both entities and tags.

To support the need for users to find events by date and location, the API can also return a list of events in a given year, month, or specific day, or events occurring within a given radius of a latitude and longitude.

  • Admin management

The scrapers, parsers, and API are built to minimize the need for human maintenance and interference, but Artbot is built on the assumption that human curation is important to the recommendation system; the automated elements exist primarily to supplement and streamline, rather than entirely replace, the work needed to maintain it. As such, the application includes an interface for admin users to manage and connect events, entities, and locations in Artbot. One of the more notable custom functions in the Artbot admin interface allows admins to manually run the natural language parsers on a new or changed event. This gives admins the ability to manually add events without the need for a scraper, but still take advantage of the automated natural language tools.

6. Next steps

In spring 2015, we will be conducting extensive beta testing of Artbot. We hope to investigate in which situations people use Artbot, how often, how long they access the app, and how they take advantage of profile-specific functionalities such as favorites and history. We’re also hoping that user research might begin to answer our larger questions: Can a digital tool like Artbot foster sustained engagement with the arts? Can we engineer the discovery of art through content-based filtering and a dash of serendipity?

In the future, we plan to further modularize parts of the application in order to prepare them for open-source use. For example, the parser technology could be available so that cultural institutions may take advantage of automatic tagging and linking for their own projects, without having to use other parts of Artbot. An open-source Artbot could also potentially be adopted as a whole; this would allow anyone to build their own Artbot with a customized ecosystem of events and venues.

7. Conclusion

At HyperStudio, we collaborate closely with scholars, educators, and students, often building open-source tools that can be reimagined for many contexts. In developing Artbot, we are excited to build a better cultural discovery system, but we also hope the research we have conducted in the process can aid others as they build their own personalization and recommendation tools. Cultural agents, such as local arts councils or consortiums of museums, can adopt the strategies outlined above to reveal connections across institutions. Individual museums, too, can repurpose many of the same strategies to build recommendation systems within their own institutions. For example, a museum might adopt an Artbot-like approach to reveal a web of influences between artists in an exhibition. Museums can apply Artbot’s parser system—a combination of named entity recognition, natural language processing, and linked-data services—to collection objects, so long as collection objects have rich enough data/descriptions.

We hope that the Artbot case study illuminates three main takeaways that cultural organizations on a budget can apply to their own projects:

  1. Design for diversity, but don’t overload your user with too many options. Barry Schwartz (2008) introduces the “paradox of choice,” in which too much choice can lead to decreased participation with culture. When a user is faced with a surplus of options, he or she may be less satisfied with the ultimate decision, or may choose to not choose at all. When designing Artbot, we wanted to build a system that would surprise users with options they may not have been aware of, but we also wanted to avoid overloading them with too many choices. Our interface is designed to limit the recommendations a user sees at any one moment, while allowing him or her to dig deeper into the connections between events and exhibitions. The back-end system and data schema are built to support this, allowing for a wide variety of diverse and sometimes unconventional recommendations.
  2. Build a system that allows for a hybrid of automation and curation. Populating event and exhibition data and maintaining a robust tagging system tend to be time intensive. At HyperStudio, we’re a small staff working on multiple projects; when developing Artbot, we needed to devise a recommendation system that wouldn’t require staff to spend the majority of their time on the upkeep of content. By writing scrapers that retrieve event information from museum websites and running parsers to automatically categorize, classify, and link events, we were able to build a tool that does much of the leg work for us. However, the best discovery systems allow room for human curation. Artbot’s admin interface lets our team add tags manually; by associating the tags with entities rather than specific events or exhibitions, the system learns from the human curation to similarly tag future events containing those entities. A hybrid automated/curated approach affords Artbot the nuance of recommendation apps like Artsy while saving on time and staff resources.
  3. Utilize existing APIs and tools. Like all good software developers, we didn’t reinvent the wheel. Instead, Artbot is built on a variety of open-source frameworks, libraries, gems, and services used for Web scraping, testing, admin management, and many other features. Notably, we used four free services—the Stanford Named Entity Recognizer, DBpedia, OpenCalais, and Zemanta—as the core of our automated tagging system. By combining the four services into our recommendation system, we were able to take advantage of the strengths of each.

In our introduction, we identified Artbot’s audience as Boston locals who want to learn more about the visual arts in the area. While this remains true, we hope that Artbot will be useful beyond this community: not just to the end users, but also to designers and technologists in the cultural sector who could learn from our research, strategies, and process.


Artbot was developed as a collaboration among many individuals and groups surrounding HyperStudio, all of whom have been instrumental in Artbot’s design and implementation. Special thanks to Kurt Fendt, Jamie Folsom, Mark Reeves and Clearbold, Daniel Collins-Puro and Thoughtbot, Hannah Pang, Gabriella Horvath, Rachel Schnepper, Andy Stuhl, and the HyperStudio team.


Chan, S. (2007). “Tagging and Searching—Serendipity and Museum Collection Databases.” In J. Trant & D. Bearman (eds.). Museums and the Web 2007: Proceedings. Toronto: Archives & Museum Informatics. Available

Pariser, E. (2011). The Filter Bubble: What the Internet is Hiding From You. London: Penguin UK.

Peterson, R. A., & G. Rossman. (2008). “Changing Arts Audiences: Capitalizing on Omnivorousness.” In W. Ivey & S. J. Tepper (eds.). Engaging Art: The Next Great Transformation of America’s Cultural Life. New York: Routledge.

Schein, A. I., A. Popescul, L. H. Ungar, & D. M. Pennock. (2002). “Methods and Metrics for Cold-Start Recommendations.” Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 2002). New York: Association for Computer Machinery. 253–260.

Schwartz, B. (2008). “Can There Ever Be Too Many Flowers Blooming?” In W. Ivey & S. J. Tepper (eds.). Engaging Art: The Next Great Transformation of America’s Cultural Life. New York: Routledge.

Seaver, N. (2012). “Algorithmic Recommendations and Synaptic Functions | Limn.” Limn 2, 44–47.

Simon, N. (2010). The Participatory Museum. Santa Cruz, CA: Museum 2.0.

Zuckerman, E. (2013). Rewire: Digital Cosmopolitans in the Age of Connection. New York: WW Norton & Company.


artbot_ss_2 artbot-map

Artbot is a mobile website that encourages meaningful, sustained relationships to art through the discovery of cultural events and exhibitions in the Boston area. The application’s interface allows for two main modes of discovery: a serendipitous approach that allows users to explore linkages between events and exhibitions, and a recommendation system that suggests events based on a user’s interests. Recognizing that users often search for cultural happenings based on convenience, Artbot also allows users to find events and exhibitions by date and by location.

On the backend, events and exhibitions are linked through a symbiosis of automated and human tagging and selection. Custom-designed scrapers automatically pull and update event information from museum websites. Automatic topic modeling and entity recognition tools capture event descriptions and identify the artists, mediums, art movements, and geographic locations that these events represent; these entities are then linked to other events with related entities and concepts. Human curators may also intervene and link artists to movements, or locations to concepts, which serves to improve the automated linking system in turn. Like the app’s frontend design, this flexible tagging structure aims to inject fresh, unconventional, and playful connections between events.

Artbot’s technology stack consists of three interlocking applications. The core database and API are written in Ruby and Rails, while the data scraping and natural language processing tools are in Python. Artbot’s technology stack also leverages free and open source services like the Stanford NER, DBpedia, OpenCalais and Zemanta. Forthcoming plans include releasing components as open source and integrating with MIT-based arts discovery startup Trill. Read more about the project at MIT News.

Roles: Co-product and technical lead; lead developer (Rails, Python)

Annotation Studio


Annotation Studio is an open source web application under development at MIT HyperStudio. It is a collaborative, multimedia annotation platform that engages students in close reading and textual interpretation. It integrates a powerful set of tools with an interface that makes using those tools intuitive. Building on students’ new media literacies, Annotation Studio develops traditional humanistic skills including close reading, persuasive writing, and critical thinking. Features of the initial Annotation Studio implementation, supported by an NEH Start-Up Grant, include aligned multimedia annotation of written texts, user-defined sharing of annotations, and grouping of annotations by self-defined tags to support interpretation and argument development. Annotation Studio allows students to act as “novice scholars,” discovering how literary texts can be opened up through the exploration of sources, influences, editions, and adaptations.

Annotation Studio is built in Ruby on Rails, extensively utilizing the Annotator.js library and participating in the surrounding Open Annotation community. Its custom data store is built in Node.js on top of MongoDB. See it on Github.

Roles: Researcher, developer (Rails, JavaScript, Node.js, MongoDB)



Bonfire is a forthcoming open-source tool that is the result of a collaboration between the Nieman Journalism Lab and the Northwestern University Knight Lab.

It is based on Fuego, Nieman Lab’s Twitter-tracking bot. Every hour, Fuego pulls in the links that the future-of-news crowd is discussing, analyzes them for popularity and freshness, does a little math, and determines which links are at the center of the conversation.

Bonfire builds on Fuego by storing and faceting additional metadata from the links, as well as by giving users control over the Twitter communities it tracks. By allowing anyone to spin up his or her own Fuego, Bonfire gives users the ability to tweak their news universe and gain more insight into how Twitter networks circulate information.

Bonfire is written in Python and runs on ElasticSearch.

Roles: Co-lead developer (Python)


(Site)    (Facebook)    (Twitter)    (YouTube)    (Spotify)    (iTunes)    (Bandcamp)

DINOWALRUS is a psychedelic synth-rock band from Brooklyn, NY, comprised of vocalist/guitarist Pete Feigenbaum, synth and bass wizard Liam Andrew, and veteran drummer Max Tucker.

Pete started DINOWALRUS in 2008 and they quickly grabbed attention with a long run of local and regional shows. In January 2010, DINOWALRUS put out their debut, %, on Kanine Records. In March 2012, they released their sophomore album BEST BEHAVIOR on Old Flame (US) and Heist or Hit (UK). In summer 2014, COMPLEXION was released on Personal Projects, a part of the Frenchkiss Label Group.

DINOWALRUS has toured and played over 200 shows with the likes of Titus Andronicus, Fujiya & Miyagi, Gauntlet Hair, Crystal Stilts, the Thermals, Sun Airway, Everything Everything, Zombie Zombie, SUUNS, and many others.


What Now

grounded what_now

Controlled chaos: the convergence of documentary film and journalism

(Originally posted at Nieman Lab October 29, 2014)


Controlled chaos: As journalism and documentary film converge in digital, what lessons can they share?

Old and new media types from journalism, documentary, and technology backgrounds gathered at MIT to share practices and discuss mutual concerns.

Documentary film and journalism are, in many ways, rooted in the same traditions. Though focus on narrative often differentiates film from traditional journalism, it helps to remember that the earliest films were straightforward recordings of real life, such as trains pulling into stations

Decades after L’arrivée d’un train en gare de La Ciotat, journalists like Edward R. Murrow made activist films that helped shape the documentary’s focus on social issues, while 1960s direct cinema filmmakers played with a journalistic sense of objectivity and realism.

Today, more and more documentaries are finding news publishers to be the ideal platforms for their work — especially interactive documentaries, like those mapped by Docubase. Meanwhile, journalism schools increasingly offer courses in software development and multimedia production. As both practices migrate into the digital space, they have a lot to learn from one another.

odl-mitTo further explore this convergence, earlier this month MIT’s Open Documentary Lab and the MacArthur Foundation hosted a daylong event called “The New Reality.” ((Disclosure: I’m a graduate student in MIT’s Comparative Media Studies program, which houses the Open Documentary Lab.)) Participants represented old stalwarts with large audiences like The New York Times, The Guardian, and Frontline, younger upstarts like Vox and Storyful, documentary fixtures from Tribeca and Sundance, and a range of academics studying digital journalism and interactive media. The goal was to explore the synergies and fissures at the crossroads of interactive documentary and digital journalism; here’s a brief overview of what was discussed, what remains unsolved, and what went unsaid.

The forms and platforms are converging

Journalists and filmmakers are increasingly using the same tools to tell stories, and they’re releasing them on the same platforms. Two panels at “The New Reality” — “Documentary Forms and Processes” and “Technologies in a Changing Media Landscape” — focused on these issues. Recurring examples of this technical merging were the many docs released by news entities, such as Katerina Cizek’s Highrise project produced by the National Film Board of Canada and published with the Times.

News organizations already have a built-in audience with stakes in social issues, an ideal springboard for a documentary filmmaker. In addition, entities like the Times and the Guardian have rich archives and technological firepower, allowing filmmakers to continue to push the boundaries of their form.

At the outset, Frontline’s Raney Aronson, a panelist, asked when a documentary should be interactive instead of linear. Panelists explored the tension between immersion and play, and the balance of experimentation with cohesion; web-native documentaries can take endless forms, each with endless capacity, but nobody wants to see a sprawling, sloppy product. The interactive form often requires the viewer to be an active and interested participant in the topic.

Cizek mentioned her favorite line, “I came for the technology, I stayed for the story,” but many storytellers are looking for a broader audience than activists and doc enthusiasts.

The unique form of each interactive doc also makes critical comparison and audience literacy difficult. Most agreed that projects should start with the story and build the form around it, but templates can serve as shortcuts to start developing a language for interactive features. Gabriel Dance of The Marshall Project called each story “a beautiful delicate flower…there is no template, there is no tool,” and AIR’s Sue Schardt stressed that it’s important to find the language before the funding models.

But too much experimentation may also keep the field from legitimizing. Some documentaries, like 18 Days in Egypt or Rachel Falcone and Michael Premo‘s Sandy Storyline, are about process and participation too; how can we judge these works critically? How will they be assessed for potential funding? And do they have a place in the newsroom, as CUNY’s new social journalism master’s degree might suggest?

There was also more practical discussion around technologies and platforms, and the challenge of balancing readymade templates and customized tools and code. Standardizing forms would also mean standardizing technologies and frameworks, which would streamline the process and reduce costs, but risk some of the creative experimentation. For now, storytellers are limited by the small screens of mobile devices and minimal capacity for interaction; the most exciting content-sharing platforms are too complex for mass audiences and commercial viability. Having conceded to Facebook and YouTube as the primary interaction and communication platforms, the trick might be to build tools that creatively remix them, though APIs may be unstable and engineers would end up taking on editorial responsibilities.

Audiences, participants, and publics are in transition

Journalists and documentarians have always cared about the impact of their work, but now they can see, measure, and interact with it. Digital metrics have changed what constitutes a successful project, which increasingly contributes to choices made by the creators (and some argued that it certainly should). Moreover, the web has created new opportunities for crowdsourced and participatory works — journalists use their audience to land scoops, source data, and fund projects. At MIT, the depth of potential audience interaction was discussed on panels such as “Rethinking Participation: What Can We Learn from Documentaries?” and “Audience Engagement & Impact.”

But “the audience” and “the public” are two very different groups, as the Times’ Lexi Mainland pointed out. Times readers represent a limited demographic, and will only be able to contribute to a small subset of the paper’s journalism; this is even more true for the niche audiences at small startups and trade journals. Tapping into the web’s communication channels without falling into the audience bubble will be crucial as storytellers hunt for stories worth telling, and presenting them compellingly.

Some panelists claimed to have a clear picture of their audience, but none have a solid grasp on impact. This is unsurprising, given that even the audience turns out to be slippery — public institutions are there to serve the public, of course, but their viewership and donors must be a priority. Older demographics still reach for TV and traditional forms, while digital and interactive viewers will skew younger. We can measure some behaviors, but they’re continuously shifting. For example, panelist Kamal Sinclair of Sundance pointed out that, while nobody expected millennials to sit and watch a 45-minute video on mobile, Vice has proven that they will.

What does that mean for the definition of a “successful” video project, as compared to a few years ago? Panelist and Rutgers professor Philip Napoli suggested that time spent was a dangerous measure of quality, too, calling attention “the last bottleneck” for the media world. There was general agreement that while metrics for documentary skew towards qualitative and personal impact measurement, journalism skews more towards the quantitative and aggregative. A blurring of these lines seems healthy as the forms collide.

Another concern around audience was the necessity of closing the feedback loop with creators. Participant and USC professor Henry Jenkins championed networked “circulation” over traditional top-down “distribution,” saying it would afford a better afterlife to projects and inform newsroom processes and practices.

The traditions, standards, and institutions remain divergent

Finally, a panel called “Journalistic Standards in Transition” focused on the balance between aesthetics and ethics in documentary and in journalism. For better or worse, journalism is a more codified institution than documentary, with its own degrees and standards about what journalism “is” or should be. Documentary is a more ramshackle affair, with its share of festivals and awards but less unified and established conventions.

The panel started with Aronson asking panelists to define journalism, which set the tone for complex questions: how do you deal with bias or media with an agenda, like an ISIS propaganda video? How many cameras need to be present to “verify” an event? Is it wrong for journalists to manipulate footage, even to add sound effects or music?

The current trend towards advocacy journalism can borrow ideas from documentary, but Jason Spingarn-Koff of the Times’ Op-Docs reiterated the need for fact-checking in order to maintain journalistic rigor. “We shouldn’t make everyone adhere to being journalists, but we do have journalistic standards at the Times,” he says.

But outside the Times, the line grows ever blurrier — there is no journalism, only “acts of journalism,” as Jeff Howe said, reiterating a line of Jay Rosen’s. Some journalistic outfits, like the Center for Investigative Reporting, are making graphic novels and rap videos; Ariane Wu asked when this stopped being journalism and became something more like art. On the one hand, this is a question of semantics, but on the other hand, the question has major consequences for how nonfiction video and interactive projects get made, structured and funded.

18daysinegyptAnother major difference is that, while docs can take years to create, news is inherently fast-paced. Longform works emerge between these time scales, of course, and can be crucial for bringing the public’s attention to complex story arcs; this type of storytelling helps the audience place newsworthy events in the context of larger historical phenomena. Interactive features might have form and marketing challenges, but they can play a crucial role in balancing the time scale of the news cycle.

What’s next — and what’s missing

While a few participants expressed relief at avoiding state-of-the-industry and revenue model discussions, such conversation was sometimes unavoidable. Beyond lamenting the lack of platform innovation in a crowded market, Larry Birnbaum of Narrative Science reminded attendees that advertisers lurk just around the corner of every new media innovation: there are people with much more money and much clearer goals who are eager for these tools and forms to be developed.

Looking further into the future, new platforms will mean new responsibilities for storytellers. Oculus Rift was cited as an example of a technology that raises the stakes, as do 3-D and tactile media. These platforms, like any others, have the potential to manipulate viewers and spread propaganda, but Birnbaum suggested that while computers can provide us with live data, immersive graphics and interactivity, they are still very far away from the higher-level field of complex storytelling.

Overall, “storytelling” was the word of the day. Participants preferred to self-identify as “storytellers” and “story-makers” rather than the platform-stereotyped “journalist” or “filmmaker.” It’s also telling that while everyone wants to be a storyteller, no one wants to be maligned as a “content creator.”

On the other end of the spectrum, Cizek spoke of “the people formerly known as subjects,” a phrase that resonated with many. I can’t help but wonder, though, whether we haven’t replaced “subjects” with “users,” a term that comes from the tech industry, which has fashioned better techniques for understanding its audience than the journalism or media industries. There could have been, I think, more discussion of these terms and who owns their histories.

Caught between advertisers and aggregators, journalists are not as in control of their message as much as storytellers typically like to be. In the age of the attention economy, gaining eyeballs often means producing work that triggers an emotional response, new ground for traditionalists. Is this journalism or documentary? Birnbaum, and others, called it loosely controlled chaos.

“Live with it,” he said. “It’s a haphazard field.”

Photo by Michael Saechang used under a Creative Commons license.

I’m feeling lucky: can algorithms engineer serendipity?

(Originally posted at Nieman Lab July 16, 2014)


I’m feeling lucky: Can algorithms better engineer serendipity in research — or in journalism?

Some historical collections are aiming to enable serendipitous content discovery, peering beyond the current limitations of search to capture happy accidents.

Let’s say you have a research topic, and maybe even an angle. You dive in by reading the canonical classics, all of which seem to cite one other, and maybe some of the most recent debates. Now what? Or perhaps you’ve been studying the same topic for years and feel stuck. How can you find a fresh take on a stale debate?

By this point, you might have exhausted the help that discovery platforms like Google and Facebook can provide. Google will reveal the most-cited works (especially on the more specialized Google Scholar or Google News), and Facebook might yield the ones your friends or subject experts value — but there’s no easy way to break out of the networks that define these platforms. Libraries provide content-based discovery portals, which offer one way out, but they often give you too much to wade through, with clunky interfaces and varying levels of relevance.

These limitations are not exclusive to serious researchers. News consumers frequent the same platforms, and they are subsequently directed to the most cited, the most retweeted, and the most relevant keywords. Network-based, big data methods for sorting the wheat from the chaff carry promise, but they rely on their own assumptions about value (mostly based on what’s already popular or viral), and they risk boxing out hidden gems and chance encounters in the process. In other words, the filter bubble affects history scholars as much as casual news browsers — and scholars’ careers often depend on unearthing something rare and different.

As a result, some researchers in the humanities and library worlds are looking for possible paths out of the research bubble for historians and scholars. By looking towards existing browsing and searching habits in both physical and digital environments, they hope to help scholars never miss the information they need — a problem that carries great weight in the news world as well.

The goal, in effect, is to increase the role of serendipitous discovery in online research. Old-school types are nostalgic for the days of walking into the library stacks and seeing what books catch one’s eye; digital tools often have trouble enabling this sort of accidental discovery, where a user finds something valuable that they didn’t even know they wanted.

But serendipitous encounters don’t have to be analog; if anything, digital tools should be able to foster more serendipity, since they can effortlessly reorder categories, effectively rearranging stacks based on the researcher’s avenue of inquiry. But how would one engineer serendipity — and can we even call something serendipitous if it was engineered?

What is serendipitous?

Serendipity can be loosely defined as a chance encounter or an accidental discovery that leads to added insight or value. It seems random, but this definition goes beyond merely injecting randomness into an algorithm. One definition proposed by Gary Fine and James Deegan is “the unique and contingent mix of insight coupled with chance.” The “insight” part is crucial. Serendipity requires a user who is ready to make connections that aren’t obviously there — making it a particularly difficult problem for a computer.

In attempting to classify serendipity, Stephann Makri and Ann Blandford see three facets: how unexpected was the encounter or connection; how much insight did it require from the person making it; and how much value did it give them? Whether or not this works for every instance, it shows the variety of ways in which one can define an encounter as serendipitous — and how often a seemingly lucky event was in fact somewhat directed. Finding a fortuitous article on Facebook or making an important contact at a conference still require following the right person or attending the right conference.

Anabel Quan-Haase, a professor at the University of Western Ontario, has been researching the role of serendipity in the research process for humanities scholars. She sees the process of serendipitous discovery as a function of a researcher’s “prepared mind,” the first step for serendipity. In order for any accidental connection to occur, the user must be ready for it, which makes the timing of the encounter crucial. This may set back digital tools that are geared towards highly targeted search, where a user is already in the mindset of looking for a specific item.

Beyond a prepared mind, a user must notice the find, stop to review it, extract the information, and finally return to it for future use. In each of these steps, design and user experience play a crucial role — even beyond the initial question of engineering a random-but-relevant encounter.

Engineering serendipity

A user must be mentally prepared for an accidental insight, but it’s unlikely that they’re thinking “I’m feeling serendipitous today.” So standalone platforms that encourage random discoveries could limit the ways in which serendipity can integrate into our digital lives.

But Quan-Haase says that adding serendipity to targeted search would make little sense, given how long we have spent honing the search experience for specific results. Instead, perhaps, we could augment rather than replace search — something like a “serendipity widget,” for example, in the sidebar of a search interface.

Such a widget could display articles at random pertaining to the keyword — or perhaps it could target a little further. One could envision a system that looks at your past searches and attempts to blend them with your present one, or grabs exclusively from sources you don’t normally peruse. You might call this a separate facet for targeted discovery rather than a truly serendipitous encounter (again, there are levels of serendipity), but it could serve the goal of finding what you didn’t know you wanted.

Many users in Quan-Haase’s studies cite Twitter as a serendipitous platform. I for one have found many of my most useful sources while randomly browsing Twitter, sometimes after hours of fruitless searching in specialized databases. I know I don’t see every tweet by everyone I follow, but I also know that some of the most inspiring tweets or links won’t be found by simple heuristics like most-retweeted or most-favorited — so I often follow my firehose in hopes of a nugget of gold, and quite often I am not disappointed.

This might suggest that Twitter might be a more serendipitous platform than Facebook or Google, which emphasize more targeted customization and personalization. It — along with the Twitter API’s ease of use — also might explain why many organizations take advantage of Twitter to create whimsical bots that inject a bit of randomness into your feed.

For instance, the Digital Public Library of America’s DPLA Bot grabs a random noun and uses its API to share the first result it finds. Lamenting that “the API has no means of calling up totally random items,” the DPLA Bot aims to “infuse what we all love about libraries — serendipitous discovery — into the DPLA.” For now though, this random dive into digital stacks is not personalized, which means you could be in the wrong section of the library.

The British Library’s Mechanical Curator similarly posts random resources with no customization, but its special focus on images in the library’s 17th- to 19th-century collections gives it a lighter and more visual feel. More for curiosity seekers than serious researchers, the library suggests on its blog that “the pursuit of knowledge is not the point.”

The TroveNewsBot, built on the National Library of Australia’s 370 million resources, features more interactivity. Send the bot any text, and it will dig through the Trove API for a matching result:

It doesn’t stop there: adding #earliest gives the first result in their collection, #latest the most recent; you can also limit the query by year and location. Give the bot a URL and it will fetch the link’s keywords and query the API with them, allowing TroveNewsBot to “respond” to any article on the web. The bot strikes a nice balance between targeted search and random luck, although your luck starts to run out if your interests lie far from Trove’s collections (primarily, Australian newspapers published between 1803 and 1954). Regardless, it’s good fun, as exemplified by the TroveNewsBot’s guide to child rearing.

Designing for serendipity

Veering away from Twitter, one tool that seems to get serendipity right is Serendip-o-matic, a project of the One Week | One Tool initiative. Brian Croxall explains that due to the project’s one-week time frame, experimentation and play were baked into the development process, and emphasized at the outset over feature-complete engineering marvels. Rather than using language like “select” or “upload,” they suggest that you “grab some text.” When you hit “Make some magic!” the tool peruses digital collections from the DPLA, Trove, Europeana, and Flickr, returning a series of multimedia documents that hopefully broaden your horizons to the topic at hand.

As might be expected, some results are more serendipitous than others. It’s also hard to know why a certain image or document was selected, which could otherwise be helpful in directing future searches. All the same, Serendip-o-matic’s playful setup and language prime the user well for making accidental discoveries.

These tools (along with others, such as the EuropeanaBot) are primarily targeting digital humanists and historians who are in a rut, but they each have their own insights about what is serendipitous versus simply random. It is difficult to plan for unplanned discoveries, especially so for a computer. Events are only serendipitous in hindsight, consisting of varying levels of planning versus dumb luck. But it seems quite possible to design for serendipitous discoveries, and to help put a user in the mindset for it.

[relatedstory slug=”qa-tarleton-gillespie-says-algorithms-may-be-new-but-editorial-calculations-arent”]Imagine a “serendipity widget” in your Facebook or Twitter feed, or on the sidebar of a New York Times article. The number and variety of signals that could go into it are endless, and many would bring their own biases. All the same, it would at least offer another pathway into news that relies on different assumptions, adds a sense of playfulness, and reminds a user that there’s more than one way to slice content.

Injecting randomness and play into recommendation systems could be valuable in its own right, but it seems especially timely given the current moment’s intense focus on content personalization. We all want relevant information, but perhaps you want to see something that users unlike you liked, or something no one has ever stumbled across ever before. Controlled randomness could be one small way to push back on hyper-curation.

Photo by Bob Gaffney used under a Creative Commons license.

Old news: digital newspaper archives at DH2014

(Originally posted on the HyperStudio blog October 27, 2014)


Old News: Digital newspaper archives at DH2014

Books and manuscripts are an archivist’s bread and butter, respectively. Librarians have honed techniques for storing, maintaining, and retrieving their contents for millennia—go into any stack in the library, organized by call number, for ample evidence. But newer media artifacts often don’t fit into old ways of storing and finding information. Digital media brings this problem into full relief, but centuries ago, the newspaper might have been the modern archivist’s first challenge.

Today, archives face the challenge of digitizing their collections, an issue of particular importance for us at HyperStudio, as our research focuses on the potential for digital archives to provide new opportunities for collaborative knowledge creation. For archivists, the digitization of newspapers raises unique questions when compared to their traditional stock. At the DH2014 conference in Lausanne, Switzerland, one panel in particular addressed historical newspaper digitization head-on.

Newspapers are rich archival documents, because they store both ephemera and history. The saying goes that “newspapers are the first draft of history,” but not all news becomes history. In a typical paper, you might find today’s weather sitting next to a long story summarizing a major historic event; historians have traditionally been more interested in the latter. Journalists sometimes divide these types of news into “stock” and “flow”. Flow is the constant stream of new information, meant for right now (think of your Twitter feed). Stock is the “durable stuff,” built to stand the test of time (for instance, a New Yorker longread).

For archivists, everything must be considered “stock”: stored forever. Some historians may be in search of ephemera in an effort to glean insight from fragments of local news snippets, advertisements or classifieds—so everything is of potential historical importance. The Europeana Newspapers project has digitized over 2 million pages with the help of a dozen key partner libraries around Europe, but by their calculations, 90% of European culture is not digitized. The project anticipates reaching 10 million records by 2015, along with metadata for millions more, but it is still a small fraction of Europe’s newspapers.

It is also no surprise that many biases exist even in this wide net of 10 million records. The 10% of culture that is digitized generally consists of culture’s most well-known and well-funded fragments. The lamentable quality of OCR (Online Character Recognition—a technology that turns scans into searchable text) likewise means that better image scans lead to better discovery. Moreover, groups like Europeana must work across dozens of countries, languages, and copyright laws; some of these will inevitably be better represented and better funded than others. So it seems you’re much more likely to find a major piece in a highbrow English paper than a blurb in the sports section of an obscure Polish daily.

Even taking as a given that everything is potentially important, newspapers present a unique metadata challenge for archivists. A newspaper is a very complex design object with specific affordances; Paul Gooding, a researcher at University College London, sees digitized newspapers as ripe for analysis due to their irregular size and their seriality. A paper’s physical appearance and content are closely linked together, so simply “digitizing” a newspaper changes it massively, reshaping a great deal of context.

Seriality and page placement also extend the ways in which researchers might want to query the archive. For some researchers, placement will be important (was an article’s headline on the first page? Above or below the fold? Was there an image, or a counterpoint article next to it?). Others could be examining the newspaper itself over time, rather than the contents within (for instance, did a paper’s writing style or ad placement change over the course of a decade?) Still others may be hoping to deep-dive into a particular story across various journals. Each of these modes of research requires different data, some of which is remarkably difficult to code and store.

In order to learn more about how people use digitized newspaper archives, Gooding analyzed user web logs from Welsh Newspapers Online, a newspaper portal maintained by the National Library of Wales, hoping to gain insight from users’ behavior. He found that most researchers were not closely reading the newspapers page by page, but instead searching and browsing at a high level before diving into particular pages. He sees this behavior as an accelerated version of the way people used to browse through archives—when faced with boxes of archived newspapers, most researchers do not flip through pages, but instead skip through reams of them before delving in. So while digital newspapers do not replace the physical archive, they do mostly mimic the physical experience of diving into an archive; in Gooding’s words, “digitized newspapers are amazing at being digitized newspapers.” Portals like Welsh Newspapers Online are not fundamentally rethinking archive access, but they certainly let more people access it.

The TOME project at Georgia Tech is aiming to rethink historical newspaper analysis from a different angle. Instead of providing an interface for qualitative researchers to dive in, TOME hopes to facilitate automatic topic modeling and entity recognition, to quickly get a high-level glance of a vast archive with quantitative methods. They are beginning with a set of 19th-century American newspaper archives focused on abolition. The project simplifies statistical analysis tools into a visually compelling interface, but at the risk of losing the context that seriality and page placement provide.

Perhaps the biggest challenge is how to present such a vast presence — and such a vast absence — to historians, curious researchers and individuals, all of whom may be after something slightly different. Where Gooding divided queries into three types — “search,” “browse,” and “content” — the TOME group follows John Tukey’s divide between “exploration” and “investigation”—or those who know what they want, and those who are looking for what they want. A good portal into a newspaper archive requires all of these avenues to be covered, but it remains to be decided how best to turn news into data, to visualize troves of ephemera, and to represent absence and bias.

Important books and manuscripts — the “great works” that line history books — tend to present a polished and completed version of events. Newspapers offer another angle into history, where routines, patterns, and debates are incidentally documented forever. Where a book is usually written for posterity, the newspaper is always written for today, reminding the archive diver of history’s unprepared chances and contingencies. The historians who mine old newspapers — and the archivists who enable them — have many new digital tools at their disposal to unearth promising archives, but much effort remains to fairly represent news archives, and determine how we might best use them.

Rethinking recommendations: digital tools for art discovery

(Abstract submitted to the DH2014 conference in Lausanne, Switzerland, presented July 11, 2014. Co-authored with Desi Gonzalez)

Automatic discovery and recommendation systems are often designed with one of two audience groups in mind: in academia, the target is the dedicated researcher who actively seeks out particular sources, whereas companies like Amazon or Netflix design recommendations for the casual, passive browser, with convenience as the top priority. Often, however, a user is both browser and researcher in separate tabs; while diving into research in a scholarly database, a user can simultaneously peruse news aggregators or Amazon. For-profit companies often recommend cultural products such as books and movies, but do so with a single goal—increasing the company’s profit. As digital humanists, we should rethink the structure of recommendation algorithms to make them more appropriate for audiences interested in deeper explorations of cultural heritage.

At HyperStudio, we are investigating how digital tools can encourage discovery and serendipity in the humanities, with a focus on art objects and museum collections. For this short paper session, we propose to share our research on the process of discovery, assessing algorithms used in research and recommendation tools on both scholarship and industry platforms. We will survey existing projects that allow scholars and casual users alike to discover new art. We will also discuss a tool that we are building, tentatively titled ArtX, that empowers users to discover cultural events, exhibitions, and art objects in the Boston area. Informed by our theoretical research into cultural recommendation systems, we are prototyping and testing this tool this spring and will be sharing our results at DH2014.

Recommendation systems are typically divided into two approaches: collaborative filtering and content-based filtering.1 While many digital tools use these in combination, here we outline the approaches and their limitations separately. Content-based filtering approaches, such as traditional tagging systems, look at the properties of the content rather than the user. Whether human- or machine-powered, tagging involves inferring what an object is “about” and how one might search for it, and assigning keywords of names, topics, or entities. The act of classifying culture is by its nature restrictive; when an art object is called “surrealist” or “American,” it is placed in a particular discourse and others are implicitly excluded. Even outside-the-box descriptions such as “hazy” are just different boxes. Artsy’s “Art Genome Project” offers a more nuanced approach to tagging (with gradients from 0 to 100, rather than 0 to 1), but this runs into the same problem.2 When an authoritative institution such as a museum produces tags, the tagging system lacks dynamism. User-generated tagging, or folksonomies, add a dynamic element but require that users actively and continually contribute to building up the tags, a process that is difficult to maintain.

Collaborative filtering attempts to sidestep these limitations, focusing instead on the user and their online behavior, similar users, and social networks. User history-based approaches like Amazon’s maximize efficiency at the sake of variety, assuming that a user has no desire to try something new. Social curation tools such as Curiator, ArtStack, Pinterest and Tumblr allow users to build their own collections and share with others, but they perpetuate what is already popular or the most reblogged. Collaborative filtering may work when shopping for a product, but risks creating a filter bubble for art. It shepherds audiences into identical routes of understanding, stifling productive conversation and undiscovered treasures in the process. At the heart of these approaches is the notion that more personalization leads to higher quality, and that existing networks and canons should be reinforced; these are meaningful signals, but they should not be the only ones.

One alternate approach is to include a serendipitous chance in the discovery process. The role of serendipity in scholarly research has been a growing topic of investigation in recent years.3 Serendipity has historically played a significant role in science, mathematics, and the humanities. As resources are increasingly digitized, an oft-cited lament is the lack of serendipity, yearning for the days when a scholar would go to the library stacks looking for one book and happen upon another that sparks his or her thinking in new directions.

While serendipity is chance-based and cannot be controlled, perhaps it can be engineered. A few existing digital humanities and cultural heritage projects experiment with engineering serendipity. Serendip-o-matic, launched in August 2013, aims to re-incorporate chance into the scholarly research process. On the website, users input a text; the tool identifies key words in the text and responds with primary source images from several online collections. The goal of Serendip-o-matic is to yield happy accidents for a wide range of users, whether students in search of inspiration for a paper topic or scholars looking for materials to enliven a current project.4 Another example is Magic Tate Ball, a mobile application designed by digital studio Thought Den to encourage a general audience to discover works of art in the Tate’s collection. Using GPS location, time of day, weather, and analysis of ambient noise, the application returns an artwork, explaining why this work was selected and providing content that allows the user to learn more.5 Magic Tate Ball enables users to engage with works they would not have sought out otherwise while infusing play in the discovery process.

At HyperStudio, we hope to incorporate a similar sense of serendipity in ArtX. Serendipity has the dual advantage of skirting traditional boundaries and adding a playful element to the user experience, which serves both browser and researcher. As we aim to make meaningful and creative connections between the art objects that comprise our past and the events of the present, we believe we can incorporate both audience groups without sacrificing archival rigor. To do so, we will need a holistic, audience-centered approach to digital curation and recommendation.

To achieve this goal, we plan to start small. Through specific partnerships with museums in Boston, we are building a closed and controlled system that can serve as a testing ground for new models of recommendation. Free from industry demands such as growth and scale, we can perfect our schemas and our assumptions before expanding to other institutions. We are also hopeful about creating a collaborative, open-source approach to art recommendation, particularly given the close secrecy with which proprietary recommendation algorithms are guarded. By encouraging open conversation around the ways we recommend art, we may find unique approaches and ways in which current recommendation systems are insufficient or misleading.

We have many questions and challenges ahead. It will be important to understand our audience: How much control over the discovery process do users want, and how can we best balance the sliding scale between browser and researcher? We expect our primary audience to be Boston-area residents and university communities—a casual but informed audience that bridges aspects of both. We hope to instill a scholar’s depth of interest and rigor in the casual user and we hope scholars too can employ the tool as serendipitous inspiration for their own work. But how transparent can we be about the logic behind our recommendations? How can we scale such a strategy, connecting artworks to books, lectures, music, movements and ideas?

Perhaps most importantly, while we have explained “why serendipity,” we must address the “how.” Serendipity involves more than simply selecting objects at random, but what signals are important? How can we prime a user for the mindset of serendipitous discovery, rather than rote research? Moreover, is it truly serendipitous if we are closely engineering the suggestion? We look forward to addressing these questions, but with care to not create our own faulty algorithms. One of the challenges in this process is to avoid reducing cultural objects to the level of products, and museum audiences to consumers. Looking past the current limitations of discovery will be vital for generating new connections and ideas.


1. A.A. Kardan and M. Ebrahimi, A novel approach to hybrid recommendation systems, Information

2. Interview: Matthew Israel on The Art Genome Project, September 21, 2013, Museum Geek,

3. Scholarship includes Allen Edward Foster and Nigel Ford, “Serendipity and Information Seeking: An Empirical Study,” Journal of Documentation, 59 (2003): 3, pp. 321-340; Sebastian Chan, “Tagging and Searching – Serendipity and museum collection databases” (paper presented at the annual meeting for Museums and the Web, San Francisco, California, April 11-14, 2007); and Anabel Quan-Haase and Kim Martin, “Digital Humanities: The Continuing Role of Serendipity in Historical Research” (paper presented at the annual meeting for iConference, Toronto, Canada, February 7-10, 2012).

4. One Week | One Tool Team Launches Serendip-o-matic, Roy Rosenzweig Center for History and New Media, Friday, August 2, 2013, matic.

5. Ben Templeton (2012), Mobile Culture and the Magic Tate Ball, The Guardian, July 16, blog/2012/jul/16/mobile-culture-magic-tate-ball-app.

FOLD wants to keep you from tumbling down link rabbit holes

(Originally posted at Nieman Lab July 2, 2014)

FOLD wants to keep you from tumbling down link rabbit holes

Two MIT Media Labbers are developing a “context curation platform” that aims to explain what you need to know without taking you out of the reading experience.

Journalists are in the business of creating both content and context. The rise of “explainer journalism” outlets and topic rundowns like Vox’s card stacks demonstrates an increasing interest in news that takes a step back from a specific event, picks up the myriad fragments of information it leaves behind, and distills it for a curious audience.

However, some stories (and some readers) need more context than others, and links don’t always do the trick. I just linked to Vox’s card stacks, but any reader who followed that link might have noticed that there is (ironically) no explanation of what the cards are. It might have taken you a moment to figure it out — and in the meantime, you left this article and might never return, buried instead in Vox’s impressive card catalog. The link, which I intended to be helpful, may have confused or even completely derailed you. How can a writer balance useful context with a sense of control?

foldEnter FOLD, a “context creation platform” under development by Alexis Hope and Kevin Hu of the MIT Media Lab. FOLD allows storytellers to add contextual elements to a story, with a clever design that places the context to the side horizontally, offset against the vertical scroll of the content. They’ve made it easy to include rich multimedia as context, with quick shortcuts for embedding YouTube videos, maps, Storify stories, and (of course) GIFs, all annotatable by the writer.

The idea: A storyteller can follow the “do what you do best, link to the rest” dictum without worrying about losing readers to other sites, topics, or tangents. A writer could add background context to an article about a complex and ongoing topic, using only the contextual elements that matter (rather than, for instance, linking to an extensive Wikipedia page). From the reader’s perspective, curated context could help a reader delve into new and unfamiliar topics with more confidence. FOLD-diagram-one FOLD was born out of Ethan Zuckerman’s News and Participatory Media course at the Media Lab, where Hope and Hu were inspired by Zuckerman’s metaphor of “unfolding” a story to get more or less detail from it. They were also intrigued by the challenge of improving the explainer journalism model.

Although it was hatched in a news class, the platform’s design is flexible and could be taken in unforeseen directions. Hope seems ready for that, and she is careful to call FOLD’s writers “storytellers” rather than “journalists,” knowing that this could see use outside of newsrooms (such as in classrooms).


Hope is also ambitious about potential new features and futures. Suggesting that journalists won’t always have the time to curate the ideal tangents to their stories, she envisions ways in which FOLD can start to automatically suggest the best contexts for new content. Hu’s background as a researcher in the Media Lab’s Macro Connections group primes them well for finding links between the tangents and providing automated recommendations. (Disclosure: I’m a graduate student in MIT’s Comparative Media Studies program.)

[relatedstory slug=”exegesis-how-early-adapters-innovative-publishers-legacy-media-companies-and-more-are-pushing-toward-the-annotated-web”]Crowdsourcing is another clear option, one that starts to overlap with the burgeoning annotation community. (Indeed, Quartz, one of the more celebrated adopters of annotations, is a major design inspiration). But FOLD goes beyond annotations, building in context as a first-class citizen of the platform, and Hope envisions collaborative context curation that could inform new editions of the story itself. She is also excited about the potential for sharing, reusing, and remixing the contextual information, whether by publishers or the public.

This raises the question of whether FOLD would be better served (or monetized) as a standalone publishing platform that writers will come to and build a presence on (think Medium-as-platform as opposed to Medium-as-publisher) or as a plugin or tool that can work with a newsroom’s existing CMS. It is launching as the former, but Hope and Hu are also exploring the latter. Publishers might be understandably reticent to let their writers publish under a different brand; on the other hand, the design may not be flexible enough to seamlessly integrate into another publisher’s site, and as Hope says, “We don’t want to build a content management system for newsrooms.”

One idea would be to partner with specific, forward-thinking publishers that might be more ready to compromise on their design for the sake of richer storytelling. FOLD expects to start here by trying to work with local publishers and established reporters, while they dive deeper into researching the ways that newsrooms are currently explaining their stories and educating their readership.

Regardless of monetization, the FOLD creators hope that by bypassing the loaded hyperlink, news sites will be more apt to share better contextual information — whether sourced from content creators, computers, or communities — from which readers can enjoy richer and more active reading experiences. Personally, I will be thrilled if you could reach the end of this article without ending up in a Wikipedia k-hole (but now you’re about to).

Mike Bostock wants us to visualize algorithms

(Originally posted at Nieman Lab June 26, 2014)

Mike Bostock wants us to visualize algorithms, not just the data that feeds into them

Mike Bostock is one of data visualization’s leading lights. As creator of the hugely popular visualization library D3.js and editor in The New York Times’ graphics department, he has had a hand (visibly and invisibly) in most of the widely shared interactives on the web.

Today Bostock posted an adaptation of a celebrated talk he gave at Eyeo 2014 about visualizing algorithms. Full of ideas and gorgeous patterns, it’s an elegant flip to the script of the typical data visualization.

Computers are sometimes conceptually divided between data structures and algorithms, and we usually visualize the data, while ignoring the processes that manipulate it. But Bostock argues that “visualization is more than a tool for finding patterns in data.”

He breaks down various methods for sampling, shuffling, sorting, and making mazes, ably explaining (via text and gorgeous graphics) why there are different types of randomness, for example, or how to most effectively sort a list.


Bostock is interested in the value of visualizing algorithms for learning about and understanding complex processes. A novice could use a visualization to peer into an algorithm’s black box; an expert algorithm builder might visualize in order to debug and reframe it.

He classifies algorithm visualizations based on the level of introspection they give into the data — some only show the output, while others let you peer fully into how data points are being manipulated.

The goal here is to study the behavior of an algorithm rather than a specific dataset. Yet there is still data, necessarily — the data is derived from the execution of the algorithm. And this means we can use the type of derived data to classify algorithm visualizations.

Using his work on the Times’ revamped rent-versus-buy calculator as an example, he shows how opening up the algorithm allows for new questions:

To output an accurate answer, the calculator needs accurate inputs. While some inputs are well-known (such as the length of your mortgage), others are difficult or impossible to predict. No one can say exactly how the stock market will perform, how much a specific home will appreciate or depreciate, or how the renting market will change over time.

We can make educated guesses at each variable — for example, looking at Case–Shiller data. But if the calculator is a black box, then readers can’t see how sensitive their answer is to small changes.

To fix this, we need to do more than output a single number. We need to show how the underlying system works.


Some of the examples are fairly technical and outwardly trivial — in a sense, what are the social implications of a sorting algorithm as long as the sorting happens? But they do demonstrate the sheer number of ways to solve a seemingly simple problem, and in the case of some of these examples (such as sampling algorithms), the results matter immensely.

The examples also demonstrate an opportunity to rethink what a visualization can tell us. Whether static or dynamic, or whether describing a state or a process, a visualization can show and hide as much as it needs.

Recommending art, suggesting culture

(Originally posted on the HyperStudio blog November 25, 2013)


Recommending art, suggesting culture

Think of the word “algorithm” and you might picture a data scientist crunching numbers in front of a terminal, analyzing functions and equations that you can’t begin to understand. If they’re building models of weather systems, you might be right (I can’t help you there). But recommendation systems are another story. The algorithms can be complex, but the output tends to be very simple: a list of some news articles, movies, or other content you might like.

Recommendation science is not rocket science, but it has always been the realm of the engineer. There’s no doubt that a good engineer can hone and chisel a recommendation algorithm to near-perfection (whatever that means), but first you have to choose the type of stone. Each recommendation system has certain methods and assumptions baked into it, and determining what kinds of inputs belong can be more of an art than a science.

Recommendation systems are generally divided into two types: collaborative filtering, and content-based filtering. Content-based filtering focuses on the product itself, like a traditional library classification system. Netflix provides one example: after culling their records down to several thousand movies and TV shows, they hire freelance film buffs to tag content with delightfully contrived categories like “Mind-Bending Romantic Foreign Movies” or “Understated Detective TV Shows” (though they also match you to similar users). While these are more fun than generic, automated tags (and Netflix deserves credit for using humans), these categories are still boxes; they place cultural products into certain discourses and implicitly exclude others. Tagging systems are inherently stale and lacking in dynamism, and folksonomies aren’t always feasible and come with their own problems.

One attempt to sidestep classification is via collaborative filtering, which looks at the user, their past behavior, and similar users or social networks for clues into what the user might like. Consider Amazon, who uses this model extensively; given the massive scale of products on offer, many from third parties, it is more manageable to leverage machine-learning algorithms that watch what you buy and browse, rather than attempting to infer the properties of thousands of new products a day. A user-history based approach maximizes efficiency but at the sake of variety, assuming that you want to keep seeing more of the same.

If it’s looking to your social networks, it feeds you what’s already most popular, stifling individual preference and shepherding audiences into identical routes. If you were to sit around a table with “users like you” and start a conversation, would you rather have seen everything they’ve seen, or something a little different? If you’re all on the same pages, how would anyone bring anything new to a discussion? What about all the possible treasures out there that haven’t been discovered yet? Collaborative filtering may work when shopping for consumer products, but it creates a filter bubble for art and culture.

The challenges and limitations behind each of these models is very different, and they point to a need to focus on the objects and users of recommendation systems, rather than the algorithms. As taste plays more of a factor in recommendation, this becomes even more crucial. If you click away from a product on Amazon, maybe it’s because you didn’t like the price, the quality, or the reviews. Regardless, the company assumes something about you. It gets even more complicated with art or music. Art has ever-changing material, cultural, and discursive properties; which ones are most important to a given viewer? Do they like being challenged and broadening their horizons, or prefer staying in their comfort zone within a certain style or mood? If so, is it worth trying to change their mind?

This brings up another variable that goes underserved when the focus is on the algorithm: what is the metric for a “successful” recommendation system, and how does that change from company to institution? Industry tends to give people more of what they want, with the end goal of a click or purchase. Cultural institutions like ours can break convention here, and at HyperStudio we hope to challenge users by making unique connections and new introductions, so long as users are ultimately delighted and informed. Unencumbered by industry demands like growth and scale, we can maintain a very different metric for success, and it could be unique to each project or user.

We also hope to open up further discussion about these systems and their limitations. The extreme secrecy behind companies’ proprietary algorithms and the domain’s traditionally engineering bent make the recommendation system something of a black box. Given the extent to which automatic recommendations affect what we read or hear about, it’s important to understand what can go into them. HyperStudio and other open-source initiatives can play a role in making them more transparent: creation and discussion of recommendation algorithms could lead to insight about the decisions computers are making for us and their assumptions about what we want. Perhaps in our effort to improve them, we’ll discover some ways in which proprietary algorithms are failing us.

Regardless, I should hope that people have a different relationship to art than to a product or piece of information, and cultural institutions should treat their audiences differently from companies. It’s important to devise recommendation systems that avoid reducing cultural objects to the level of products, and museum audiences to consumers. At the heart of the collaborative and content-based systems is the notion that more personalization leads to higher quality, and that existing networks, discussions and canons are there to be reinforced. These are meaningful and important signals, but they need not be the only ones. While quality is always important, categories should be fuzzy, as should networks of people; the most important signals are often the nodes that link them, and we hope to surface these new connections. When it comes to art and culture, looking past the current limitations of discovery will be vital for generating new ideas and conversations.