How to search the deep web: data extraction
Knowing how to find data is becoming an increasingly valuable skill in journalism.
In the first part of this series, we learned how to search the deep web through advanced searches. Here, you’ll find techniques and tools for searching and retrieving data.
The simplest example of retrieving data is to extract the contents of a table from a PDF file and import it into an Excel spreadsheet. There are a lot of paid options for this, but you can also try converters like Zamzar.com, which is free and requires no subscription.
Remember that tables and graphics are often uploaded to the web in image format, so your data hunt should include searches on platforms such as Flickr or Google Images. Optical character recognition software is a big help; a simple, free one is Free Ocr.
• Other Google tools:
Explore Google Public Data.
Similarly, Google Books and Google Blogs contain useful information because they allow you to filter results by date. Example: this post published on SoloLocal was based on a search on Google Books, using geographic positioning and the selection of a timeline: the search included books published in the last three years.
• Try semantic web resources, such as Wolfram|Alpha.
• Use the free version of Copernic. This powerful search tool allows you to define searches by categories such as "U.S. government documents." (Warning: it works only for Windows).
• Find data about your country outside your country. For example, data from the U.S. Census database contains updated info about U.S. imports from around the world. (That country list is somewhat buried, but you can find it here. Data spans from 2002 to 2011, which allows you to study variables over time and compare countries.
• Retrieve data that may have been deleted from the web version but were "cached" or saved as screenshots. Try the Internet Archive and its “Wayback Machine" feature.
• Drill down to the parent directory index or a site. For example, this link http://www.justiciachaco.gov.ar/listas/C_A_Civ_y_Com_Sala_II_Pro/Cam_Civ_Sala_II_Pro_2009-11-13.Txt can become this one: http://www.justiciachaco.gov.ar/listas/.
• Monitor social networks (shared documents, comments) using tools like SocialMention, 48ers or search Twitter in real time using Twitterfall, which allows you to make a geo-referenced search or search by name (these two options are more specific) or subject (less specific).
This is the second and final installment of a series of posts with advice for finding information on the web. The original version of this post appeared on IJNet Spanish. It was translated into English by Maite Fernandez.
Sandra Crucianelli is a Knight International Journalism Fellow, an investigative journalist and instructor, specializing in digital resources and data journalism. She is the founder and editor of Sololocal.info, an online magazine that provides hyperlocal news from Bahía Blanca City, Argentina.
Image: CC-licensed by altemark in Flickr.