This report summarizes notes from the first workshop of the School of Data Journalism, organized by the European Journalism Centre, Open Knowledge Foundation and the International Journalism Festival. The session was led by Steve Doig, Knight Chair in Journalism, specializing in computer-assisted reporting - the use of computers and social science techniques to help journalists do their jobs better.
You can download Steve’s presentation here.
Why do data journalism at all?
Steve Doig believes that data journalism allows journalists to go beyond anecdotes, and base their stories on facts and evidence. Utilizing data for reporting, one is able to look for the best points, those that are most illustrative of that particular story.
So, how can journalists find data story ideas? First, try to look at topics you already report on such as sports, elections, disasters, crime investigations, money flows, etc. Almost all subjects that journalists typically cover produce data which can be analyzed. Other places to get ideas for data journalism stories include:
- Seeing what other journalists are doing. If something is going on in one city, chances are it’s happening in your city too
- Taking a look at featured projects on DataDrivenJournalism.net
- Checking out IRE's Extra Extra feed
- Following The Guardian Datablog
- Reading documents produced by government agencies and academics who collect large amounts of data. Paying attention to footnotes and bibliographies, which can lead to interesting data sources!
How do you get from an idea to a story?
Work backwards from your idea:
Think of the statements you want to make
Start with a hypothesis such as "crime is getting worse in my area." For this hypothesis, you might want to make statements such as: crime has increased by x amount, the amount of crime per 1000 people in such and such city is the greatest in our area, etc.
Think of what variables you need to make the statements
Think in terms of the table of information (columns are variables, and rows are the individual data points).
There are 2 different kind of variables:
- Categorical: Such as gender, type of crime, zip code. Variables with labels.
- Numerical variables: Such as the counts, number of crimes, number of accidents, number of arrests.
An example of these variables working together would be: Type of crime, population of the places where crime is happening, date of crime, time, location, number of victims, was an arrest made (y/n?).
Think who collects the data
Once we know our variables, check who collects them. Agencies and organizations such as government, corporations, etc. are collecting lots of information, so we don’t have to collect the data ourselves most of the time.
Get the data from there
Then we face the problem of getting the data. In the U.S., there are relatively strong public record laws. In Europe as well, most countries have Freedom of Information laws, or an official way to request data from public agencies.
Do not be intimidated by different formats. Know how you will want to work with the data, for example Excel. You don’t need to get the data in .xls format, but you can use programs to translate data from one format to another. Find a data nerd who can help you! One place to find good nerds is on forms or email lists, for example:
- Nicar-l, lists in the states where data journalists talk to each other
- School of data
One format you should try to avoid getting is PDF; it doesn't import well in other formats. If you are only able to obtain a PDF, there are tools to export it in other formats, such as Tabula.
Clean the data
Data is sometimes messy. A classic example is campaign finance information which has all been typed in by volunteers, the names of the cities are always misspelled! In this case, you need to find all the cities which were misspelled and correct them so that you can say, for example, how much was collected from a single city. People who collect data are often doing so for bureaucratic purposes, and it does not really matter to them how clean the data is. People who use data for analysis require more precision, and thus must clean the data. Some tools for data cleaning include:
- Open refine
- Notepad ++ or other good text editors (For features such as Search and Replace)
Once you have clean data, what do you do with it?
Look for patterns! Highs, lows, maximums, minimums, averages, etc. Get in your mind the shape of the data, look for outliers, anything in your data that seems weird and stands out. Remember that many stories have been discovered by easy things such as sorting, etc. Some tools that may help:
- Use simple spreadsheets functions like sort, filter, functions and pivot tables
- Another tool is your brain: math and statistics, but it's pretty much like 1+1 = 2! A resource for math is: http://t.co/CaZg5qS0jM
Finally, it’s important to remember that data journalism stories are best accomplished in teams. There are many roles to cover, including: reporters, editors, graphic artists, photographers, videographers, page designers, web designers, app developers, etc.
Antoine Laurent is a strategist and senior project manager at the European Journalism Centre. Formerly, he was the deputy director of the Global Editors Network and founder and director of the Data Journalism Awards, the Editors' Lab Hackdays and Startups for News.
_This post originally appeared on DataDrivenJournalism.net and is posted on IJNet with permission under a Creative Commons Attribution-NonCommercial license. Created by the European Journalism Centre, this data journalism initiative is aimed at enabling more journalists around the world to use data to improve reportage._
Image CC-licensed on Flickr via Intel Free Press.