So after months of phone calls, emails and a lot of patience, you finally got your data set from a government agency.
Before daydreaming about how you’re going to win a Pulitzer with the amazing stories you’ll write, you’ll notice the data set is messy and hard to figure out.
Google Refine allows you to clean the data before you actually start working on it and drawing conclusions.
First you need to download Google Refine. As explained in this tutorial, although you use it through your web browser, Refine is a desktop application so you don’t have to worry about uploading sensitive data to the web. The program is compatible with Windows, Mac and Linux and, best of all, it’s free.
After you give your project a name and hit the “create project” button, you’ll see the data displayed in Google Refine.
Now you’re ready to start working on your project. However, you’ll notice in your data set that the information can be formatted differently. So, for example, a date can be entered in different ways: Oct. 6, 2011, 10/06/2011, etc. This can also happen with numbers and text fields (D.C. can be entered as DC, District of Columbia, etc.).
The problem is that if you do searches to find patterns in the data and the information is not formatted in the same way, your findings will leave entries out, yielding unreliable results that can’t be used in a story.
You can fix this by using the text or numeric facet feature in Google Refine, which combines identical cells and shows you how many times they are repeated. For example, it can show how many fields are entered as DC and how many as District of Columbia. (You can learn more about this feature in this video tutorial). This will help you find out if there are fields that could refer to the same thing but are entered with different names. For example, in this altered data set about U.S. economic assistance around the world, the data referring to Cameroon is entered in two different ways. You can also edit the field names right there to fix the problem.
Another interesting related feature is “clustering,” which tries to find groups of different cell values that might refer to the same thing. You can read more about clustering and how it works here.
Google Refine also gives you the possibility to eliminate white space to make the data more uniform. You might not notice if a name has an extra white space but it’ll be problematic when you analyze the data. You can eliminate this by using the “trim leading and trailing white space” option, as explained in this guide created by Paul Bradshaw.
Also, if you make any mistakes in your edits, you can undo them easily. Google Refine tracks every change you made to your data set so it’s easy to change it back.
You can download Refine here, and there are several helpful tutorials that will get you started. Refine also includes a few sample data sets to try the program.
There's also a very useful guide on how ProPublica used it in their Pulitzer Prize winning piece “Dollars for Docs. How Industry Dollars Reach Your Doctors” and you can check out Bradshaw’s guide here.