When the data set you need doesn't exist: creative ways to find and collect data from #NICAR14

byMaite Fernandez
Mar 6, 2014 in Data Journalism

After a source complained to the San Jose Mercury News that an elite group of criminal court judges were spending Fridays golfing instead of in the courtroom, the paper assigned a team of four reporters and a photographer to the story.

At the Santa Clara County courthouse, they noticed reserved parking spaces with the names of the judges. They wrote down the license-plate numbers and tracked the cars. They also got the judges' complete vacation schedule, as well as their golfing records, including their handicaps.

After a 5-month investigation, they found out that on Fridays the judges, who were avid golfers, spent most of the day on the links, which contributed to a backlog of cases in the courthouse. After they published their investigation ("The Judge Club,") the Friday golf days came to an end.

Sometimes the data you need to report a story just aren't there, either because they don’t exist or aren’t being collected properly. What’s a reporter to do?

You can start by collecting data in creative ways. A panel led by New York Times reporter Sarah Cohen at this year’s National Institute for Computer-Assisted Reporting (NICAR) conference in Baltimore shared a few ideas.

The Wall Street Journal recently published “A tale of two prices,” an investigation on how the office supply company Staples varies the prices of some of their items depending on people’s IPs and, therefore, ZIP codes. For that, the newspaper tested the prices of items a number of times using different ZIP codes.

The WSJ found that Staples charged customers in some zip codes US$14.29 for a basic Swingline stapler, while others paid as much as US$15.79.

For that investigation, WSJ reporters read carefully the Terms of Service on the Staples website and took screen grabs in case the company decided to change them. They also consulted with their news organization’s ethics editor to make sure they weren’t engaging in ethically or legally questionable tactics.

For its project “Dollar Politics,” National Public Radio called on the wisdom of the crowd.

During the 2009 session of the Senate Committee on Health, Education, Labor and Pensions discussing the new healthcare law, NPR noticed a horde of lobbyists present after the sessions. For this piece, the NPR reporters wanted to know who the attendees were and what influence they had. So they turned the camera around, took photos of the lobbyists and asked the crowd to help identify them. Once an attendee was identified, NPR labeled the photo with the person’s name and job title.

Building a database from scratch

Behind the bloodshed,” a USA Today investigation into mass killings in the U.S., started basically from scratch, since there wasn’t a dataset that centralized all the cases.

The database team started with FBI data, but they quickly noticed that the set of data was not comprehensive for various reasons. (At the end of the investigation they found out the FBI data had a 61 percent accuracy rate due to erroneous cases and missing incidents.) Some states, like Miami, don’t share their data with the public, while others took their time sharing it.

After that, they compiled data they found by searching Google News and LexisNexis. Using the FBI’s definition of a mass killing (four or more people killed), they ended up finding 236 incidents since 2006.

“Building your own database takes serious time and effort,” said panelist and USA Today data journalist Meghan Hoyer. But it pays off, she believes, since it allows reporters to design the database the way they want it. “You are the captain of your own ship.”

Hoyer’s experience building databases from scratch taught her a few lessons. She recommends that a reporter plan ahead; define the topic; bounce ideas off a partner; and introduce an error-checking system.

Release the drones

And, if the data don't exist, you can always use hardware to collect your own. Newsrooms across the world (like WNYC in NY and El Salvador’s La Prensa Gráfica) have been increasingly experimenting with drones and sensors to collect specific data.

Panelist Matt Waite, aka the “drone king,” showcased a few experiments he’s been doing at the Drone Journalism Lab at the University of Nebraska-Lincoln's College of Journalism and Mass Communications. Since depending on the government to release data can limit journalists, Waite recommended that reporters collect their own.

Waite has been experimenting with drones for reporting for some time now, using them to cover the Nebraska drought of 2012, for example.

He recently experimented with Arduino sensors plugged into microphones, which he used to detect the noise levels of the journalism school lobby at the University of Nebraska. Such sensors could be used to gather noise levels of an entire neighborhood or city, he said, while drones can be used for digital mapping or, with a few tweaks, as infrared cameras.

Image credits: First image: Screengrab from the WSJ's "A Tale of Two Prices."

Maite Fernández is IJNet’s managing editor. She is bilingual in English and Spanish and has an M.J. in multimedia journalism from the University of Maryland.