How journalists can find, extract and use open data

by Juan Manuel Casanueva
Oct 30, 2018 in Data Journalism

Ever since information became readily available on the web, transparency advocates have pushed for public open-data formats. Groups like ILDA in Latin America and the international OGP Open Data working group strive to not only research and analyze the openness of public data, but also assess how infomediaries are using open data.

Journalists, civic hackers, academia and civil society organizations are some of the most active open-data users, transforming data into consumable information for the public. Reporters in Latin America have been especially creative in reaching different data sources for their research and stories, but since the open-data movement is relatively young in the region, finding clean, usable and available data sets can still be a challenge.

As an ICFJ Knight Fellow and co-promoter of Escuela de Datos and open data communities in Latin America, I have been in direct contact with the needs and tricks that journalists tend to confront while searching and using data. I led a workshop on this topic at Media Party Miami, a two-day event in Florida that brought together journalists, hackers, academics and students invested in media innovation across the U.S. and Latin America, and organized by ICFJ Knight Fellow Mariano Blejman. Here’s a recap:

Where is the data?

- Very few countries and cities in Latin America have open-data portals (see Chile and Buenos Aires as top regional references). So unlike the U.S. or some European countries, whoever needs open data will need to cherry-pick it from incipient government or civil society portals. Though in some countries like Mexico or Peru, journalists, civic hackers and civil society organizations freed much of the data and made it available in citizen portals.

- The open-data movement has proven that data is everywhere and team efforts can liberate key data for cities or countries. So if data is available but has not been opened, journalists can always follow La Nación Data’s collaborative project Voz Data as a guide.

- But do remember we’re living in an information age, and gathering that info is becoming easier every day with the use of mobile apps, wearables and data collection programs from a wide variety of sources, ranging from one’s vital signs to social media streams.

How can I extract and clean data?

- If data is available, it will most probably be in closed or semi-closed formats, such as PDF. In such cases, it's very important that journalists develop scraping capacities and become savvy using different tools to import data from websites, PDFs and scanned documents. A list of tutorials and tools is available at

- Cleaning and standardizing data is another basic capacity journalists need. Tools like OpenRefine or even smart use of spreadsheets can enable you to get rid of duplicated data, merge variables and combine datasets.

How can I use open data?

- Data is primarily used for analysis, but the approach on data analysis can vary. For instance, more narrative users tend to prove a set hypothesis (commonly a news or story lead) by analyzing the data. This approach can be very effective if a journalist’s overview on the data is accurate and there is high probability that he or she will find the answers they need in the data set. On the other hand, more analytical users (coders or data scientists) have a more agnostic approach over data. They will analyze all variables, and constantly determine and prove hypotheses that the data itself shows them.

- Data analysis can be challenging, but processing and investigative challenges will be much better addressed if there is a team that combines narrators and techies. You can see a product of team collaboration in this La Nación story.  

- Not all Latin American journalists agree on how data is used in storytelling. But without heading into data-viz debates, an approach that has been helpful to some journalists is to see data as the source for arguments and story milestones. Many of the conclusions that data brings should be used to strengthen key parts of the story, rather than dictate the story.

Image CC-licensed courtesy of