Updated Sept. 13, 2013, 11:42 a.m. EST.
But when data are trapped in a PDF, extracting and analyzing them so they can be presented in a user-friendly way for news consumers can be a headache.
Governments and organizations that release data in PDFs are “doing it out of either ignorance or malice,” web developer and Knight-Mozilla Open News Fellow in Argentina Manuel Aristarán, told IJNet during the recent Hacks/Hackers Buenos Aires Media Party. Because data from PDF files can’t easily be copied and pasted, he notes, even the most exciting data is often left unused.
To fix this problem, Aristarán began to develop a web app late last year that extracts data from tables in PDF files. Soon after, he linked up with nonprofit news organization ProPublica, whose developers had created their own internal system for extracting data trapped in PDF files.
The result is Tabula, which lets users upload a (text-based) PDF file into a simple web interface and then pull tabular data into CSV format for use. Directions for how developers can use and run Tabula are available on GitHub. The app is free and available under MIT's open-source license.
The Minneapolis Police Department announced this month that it would no longer publish crime data in Excel files, which make for simpler data extraction. Instead, it would make them available only in PDF form. Journalists at the MinnPost used Tabula to extract the data. The news organization was then able to update its Minneapolis crime app with the new info. (The police department later decided to go back to releasing crime data in the more accessible Excel format.)
The New York-based ProPublica is using Tabula to power the database behind its Dollars for Docs project. It uses pharmaceutical companies' required disclosures of their payments to doctors, other medical providers and healthcare institutions. The data is compiled so that patients can search for their physician or medical center and receive a listing of all matching payments. This lets patients see, for example, how much money their doctors are paid for speaking engagements or conference participation.
Each company publishes the data differently, and most of them publish in PDF files. Some companies report on a monthly basis, others report quarterly, and still others on a rolling schedule. For instance, in the third quarter of 2010, the firm Merck filed this 86-page report representing thousands of rows. And that is just one of hundreds -- if not thousands -- of files that may need to be processed to get a full picture of the data.
ProPublica uses tabula-extractor -- the "engine" behind Tabula -- plus some of their own internal code to process these files and import the data into the database that powers the Dollars for Docs site.
“Without Tabula, I'm fairly sure a project like Dollars for Docs would simply be impossible,” says Mike Tigas, Knight-Mozilla OpenNews Fellow at ProPublica. “The effort and man-hours required to pull that much data out of PDFs just wouldn't make sense to try.”
Image courtesy of Flickr user Sybren A. Stüvel under a Creative Commons license.