Using the Media Cloud API I gathered all of the stories from The New York Times and The Guardian in January 2014. It was about 10,000 stories for each. For each story, I extracted some data about the inline hyperlinks in each story — its place on the page, its anchor text, and whether it was an inlink or outlink. I started with two research questions:

are there patterns in the ways that different sections or desks of the newsrooms use internal hyperlinks?
could the hyperlinks created within news stories at either organization constitute new categories of their own?

Q1: Linking by category

I treated each story’s URL path as a proxy for the category or desk that it is in. Using regular expressions, I queried for metadata about the “link profile” of the stories in each category of the Times and the Guardian.

The first striking difference was in the breakdown of categories. Here’s the Guardian’s distribution of types of stories:

And the Times’:

The Guardian seems to have a more balanced breakdown of stories — two Times categories take up nearly 50% of the pie — but then again, the Guardian has two separate categories for “sports” and “football”, so it’s not to be completely trusted.

I then looked at the average rate of internal linking in each newsroom’s stories, broken down by category. Here’s the Times’s top 20 desks for inlinking:

Blogs are just 13% of total stories, but account for 31% of the links. However, these skew towards outlinks and only 2% of them go to Times topic pages. Meanwhile, half of all inlinks on traditional Times stories go towards topic pages. Blogs seem to be a site of greater outlinking, but the same level of inlinking as traditional Times stories.

I was especially surprised by the low number of inlinks on “world”, a category which would seem to require more context to explain complex and distant events. “world” and “sports”, the most populous categories, have the lowest link levels. These categories do not support the idea of a link-based classification scheme.

The Guardian was a different story, with 20 categories at higher than 3 inlinks apiece:

Lastly, I matched up all of the categories I could between the Times and the Guardian (for instance, each one has a “science” section), then I ran a side-by-side comparison of the two papers. Blue is Guardian and Gray is NYT, obviously.

The Guardian beats the pants off of the Times, in nearly every category. The “.+” category is particularly important, as it is the global figure (“.+” is a regular expression that matches any URL). The difference is stark.

It’s interesting that “movies” does particularly well for both publications, as does “theater”; presumably, these stories link often to topic pages about films, actors, and directors. Such stores are also usually slower, less breaking news, allowing for more time for background research and context.

Q2: Network formation

My next question was whether these internal links could start to self-organize. This required examining a subset of each archive as a network graph, honing in on who was linking to whom at a story level. I selected the Times’ “arts” section and the Guardian’s “arts and culture” section, each containing around 400-500 stories and 4000 links.

The Guardian’s graph is as follows:

And the Times’ graph:

It is clear from a quick review of each graph that the Guardian’s appears more “networked,” with more coherent clusters forming rather than isolated pods. This suggests new possible ways to organize topic pages, suggest related stories, and organize archival material.

This is an early experiment and this research question needs more fleshing out. Future inquiries will attempt to “spider” out from links to get stories outside of January 2014, and will also attempt to cross categories in order to examine the networks formed between different desks at a news organization.

media / technology // research / development

Link profiles of the New York Times and the Guardian

Q1: Linking by category

Q2: Network formation

Related