platform helps journalists automatically analyze thousands of documents

نوشته Mariano Blejman
Oct 30, 2018 در Miscellaneous

The holy grail of computer-assisted reporting is the systematic organization, cross-checking and extraction of unstructured data.

It's fairly easy to analyze data provided to us in a spreadsheet. But when we need to analyze thousands of documents, the analysis becomes more complex, even if it involves the same process of finding names, places, dates, organizations and verbs, and figuring out the relationships among them.

One might think it a Mission Impossible, but two years ago, at a Hacks/Hackers Buenos Aires event, Martín Sarsale and I began creating a way to more easily analyze data from many documents at once. We then built the team that would create, which performs semantic analysis of documents, detects entities and allows the comparison of these lists.

The project began as Mapa76, an examination of data from the last Argentine military dictatorship, which was in power between 1976 and 1983. On March 24, 2012, the 38th anniversary of the military coup in the country, we liberated and analyzed data related to thousands of people who disappeared during the dictatorship.

The project appears as a case study in the Data Journalism Handbook. Out of Mapa76 grew, which is now beginning to have its own life.

The system detects the data, lists them and shows links among them. It currently analyzes documents in Spanish, but we plan to make it usable in several other languages during the coming months. is different from the majority of the products that analyze data. These are complicated APIs intended for developers, and not meant to be of wide use to mere mortals. If DocumentCloud is a fantastic document manager and Overview knows how to convert data into points, has created a unique, usable product that makes documents “talk to each other” and finds concrete relationships among data points.

The version available on the website allows users to upload documents (pdfs, txts and DOCs) so the system can detect the data points. It also organizes them in a database and allows them to be downloaded. These databases can be combined with other applications to generate graphs, timelines, maps and other visuals.

The software is being developed with Ruby on Rails, Freeling, DocSplit, Resque, Elasticsearch and MongoDB. Our great development team is based in Buenos Aires, along with Marcos Vanetta, a Knight Mozilla Open News fellow who just moved to Austin, Texas, to work for the Texas Tribune. has an ambitious goal: to develop a set of algorithms and functions to find events in the texts and relate them with other interesting events for the investigator. There are several related experiments, many of which you can read about on the amazing Untangled blog from the Knight Lab at Northwestern University.

The analysis of social media for power relations and networks of influence is an emerging trend within investigative journalism, but the media have not yet taken full advantage of it. Instead, the industry has so far focused its research on natural language processing, in order to look at consumers' feelings; brands are willing to pay to know what people think of them. But until now, no platform existed to automatically analyzes facts. is here to fill that gap.

This post was originally written in Spanish and translated into English by Andrea Arzaba.

Image courtesy of