Presented with Daniel Craigmile at NICAR 2016.
Whether they’re built by data journalists, search engines, web archivists or malicious troublemakers, bots are substantial users of news websites. At The Texas Tribune, bots account for half of our site’s traffic, and they’re responsible for our search presence and our archival legacy as well as site attacks and performance problems. In this session, we dove into the world of bot users, and share some tips for identifying and managing these crawlers and scrapers, helping the “good” bots do their work, and keeping the “bad” ones from wreaking havoc.
We approached the topic through the lens of a mysterious site that was serving an exact mirror of The Texas Tribune for several weeks, wreaking havoc on our servers and analytics. In our effort to identify the source and method for this mystery site, we enlisted the aid of investigative reporters, business staff, and systems engineers. In the process we learned a great deal about bots, and how much their face has changed and their numbers have increased in recent years. We believe that closer tracking of bot users could lead to new stories and insights for data and tech reporting.