Big data is for machines: how to use small data sets for impactful stories

by Juan Pablo Marín Díaz
Oct 30, 2018 in Data Journalism

Updated at 2:58 p.m. on March 15, 2018

There is a lot of hype around big data in every industry, and journalism is not an exception. The Panama Papers and the subsequent Pulitzer Prize awarded to the team behind the project marked a milestone that proved how technology, collaboration and data can create impactful stories.

At Datasketch, we are helping journalists make sense of data by providing them with easy-to-use tools so they can improve their data-driven storytelling.

One of the first challenges to tackle is demystifying big data and what it means in the context of data-driven investigations.

In the specific case of the Panama Papers, the total amount of data leaked was around 2.6 terabytes (TB). However, only 22 megabytes (MB) made it to the final database that was used in most publications.

To put this into perspective, let's imagine that 1MB is worth a penny; 1TB would be the equivalent of US$10,000. Therefore, from the US$26,000 of information available, only 22 cents were actually published in the database. This means that a lot of information was not used.  

Unraveling truths requires pulling together multiple sources of information and organizing them into smaller chunks that comprise a story. For any data-driven piece, each source could be a path to explore a unique story. This is why, even though journalism has indeed benefited from big data analysis tools, it is still rather difficult to use technology to support scalable data journalism.

The advent of big small data

How big is big data? It depends on who you ask. Some say data is big if its size is larger than 1TB (the equivalent of 2 million photos).

I prefer to use another rule of thumb: "Big data is something that does not fit into a spreadsheet."

As noted above, even in a project as large as the Panama Papers, the final data, which was used to build the stories, was actually collections of small data files. Humans need to digest pieces of information that are accessible, aggregated and informative. No matter how big your data-driven journalism piece is, chances are that you will be using multiple, small data sets.

Let's not fool ourselves — big data is for machines. Rather than focusing on big data sets, we need to focus on becoming masters of using small data sets in journalism. Picturethink of a couple of spreadsheets with at most a couple thousand rows of aggregated information. This school of thinking, that big data is for machines and small data is for people, was first sparked by Allen Bonde, VP of marketing at Repsly.

Many journalists still lack a way to easily collect and find small data sets, and a way to explore them and combine them into stories.

Collecting small data sets

One of my favorite places to find small data sets is data.world. They use web semantic technologies and communities to open data sets in different formats together with visual tools to analyze them. Statista is another site that collects millions of statistics about different projects, and is particularly useful for visualizing market and business trends.

One innovative way to improve the collection of small data sets is to watch for data curated by citizens. This citizen-driven small data is very powerful not only as a data source, but also as a way to engage with readers and to find interesting topics for stories. There is an increasing number of citizens who are using social media to publish factual data on different topics that interest them.

Combining small data sets

In terms of combining small data sets into stories, the best way would be to use point-and-click data visualization tools like datawrapper or flourish. Another tool that was initially conceived as  a way for scientists to share data and graphs is Figshare, which now holds a lot of useful information for any researcher.

Feel free to explore more data journalism resources in our data journalism portal for Latin America, Datasketch.

Using small data for impactful stories

Mastering the use of data can open the door for new, innovative forms of journalism that create tangible results. Last year, journalists at Datasketch connected with a Twitter user who was collecting information on femicides, which led to a report about violence against women in Colombia.

Together we built the most complete database on femicides in Colombia using different sources, like freedom of information requests (FOIR) using our platform QueremosDatos, custom built data sets, online surveys and more.

The result helped shape the final report, in which we made over 30 small data sets available online. This work not only told a story of violence, but also helped to give an alternative for change by pushing the Colombian government to act upon this issue with a physical intervention we did with the data collected on femicides.

Main image CC-licensed by Pexels via Kevin Ku.