The abrupt shutdown of the Gothamist and DNAinfo local news networks earlier this month was a stark reminder to digital journalists who want archives of their stuff: Back it up! Back it up!
It isn’t just that news apps and digital interactives won’t last forever; as my colleague Shan Wang wrote in September in a look at broader archiving efforts, “so many pioneering works of digital journalism no longer exist online, or exist only as a shadow of their former selves.” The problem is also that digital journalists, who will someday be looking for new jobs, will probably need to share samples of their previous work with prospective employers, and that’s tough to do if the site you were working for is gone. Even if you’re not job-hunting, you may want evidence, years down the line, that you, you know — produced something.
Luckily for journalists who haven’t had the foresight to be saving all along (i.e., most of us), a few solutions have emerged.
Save My News, launched this month by Ben Welsh, editor of the Data Desk team at the Los Angeles Times, lets journalists (about 300 of them so far) save their links to Internet Archive and WebCite. You can download all of the clips and archive links as an Excel spreadsheet.
Welsh created the tool on November 6, two days after DNAinfo’s shutdown. “On social media, I saw a lot of my peers panicking and outraged,” he said. “It seemed like a powerful opportunity to raise people’s awareness about the fragility of their work — that all the journalism they pour so much effort into could disappear from the internet, poof.” Save-your-work services aren’t new, Welsh pointed out. The Internet Archive has tools that let journalists preserve their work, but many people simply don’t know about them. “So many people who work professionally on the internet really don’t know, until too late, that their work is this fragile,” Welsh said. “And it doesn’t take a villainous owner to lose your legacy. It could just be a site redesign.”
Welsh plans to integrate more archiving services into Save My News, but he’s intent on keeping the service simple: “I’m hosting it for free on Heroku, and there’s not any business model attached to this. I’m trying to avoid broadening the scope too far.”
Parker Higgins, director of special projects at the Freedom of the Press Foundation, put out a call on Twitter to work with Gothamist and DNAinfo journalists to preserve their work when it looked as if the sites might be gone entirely and archives would have to be re-compiled through services like Internet Archive. He began writing the code to do that, but then the sites went back online and he was able to make a more robust, quicker tool that could get people PDF archives of all of the work they’d published on those sites in a couple of hours.
That tool, “Gotham Grabber,” is now available as open source code on Github. “With a few alterations, many journalists can use this tool to create an archived version of their entire portfolio,” Higgins wrote in a Freedom of the Press blog post. That requires some coding knowledge — and “most of the journalists I’ve talked to who were working at these places aren’t coders,” Higgins told me. “But I hope to see people adapting this code to other sites.” He’s now scraped more than 50,000 articles, each as a PDF.
Gotham Grabber returns work as PDFs because most of the journalists who’d contacted Higgins wanted it that way. “When you have a portfolio and you’re sending an attachment for a new job, PDF is the preferred format. There are better ways to archive if you’re talking about archiving for readers or long-term storage, and ultimately, these pages are meant to be HTML and served from a database," Higgins explained. "I hope this prompts people to think about longer-term storage and access. I mean, man, with some of these Gothamist and DNAinfo sites, in particular, these are the only records of local events in a lot of cases.”
If you prefer to set it and forget it, there’s Authory, a service that launched in beta last year (with a grant from the Google Digital News Initiative; the company is based in Hamburg) and then to the public over the summer. (Disclosure: I started using Authory for free in exchange for providing feedback when it was in beta.) Eric Hauch, its founder and CEO, worked for Axel Springer and Financial Times Deutschland and started thinking that it wasn’t easy enough to find out when his favorite journalists had published new stories.
When he started talking to other journalists about a tool that might help with this, “they told me they didn’t only have issues with updating their readers. They also had trouble keeping track of their articles themselves. It made a lot of sense to combine these new things.” After initial setup, Authory automatically backs up all of a journalist’s articles (no matter what site they’re published on) and also lets readers “subscribe” to journalists, so that they can receive email notifications when one publishes something new. (Muck Rack for Journalists performs a similar function on the notification side, but doesn’t back up journalists’ work. Byliner used to offer a similar journalist-following function.) “The idea of creating the backups for the journalists was basically something we added on top, but now it’s the center of what we do,” Hauch said.
A two-week trial is free, and after that the service is $7 a month or $70 a year. Authory is in very early stages; right now, it has under 1,000 active users, mostly in the U.S. and U.K. It can scrape content from sites with soft and metered paywalls, as well some sites with hard paywalls like The Wall Street Journal when journalists log in. In the future, it will support more hard paywalls.
Users can access the full text of their articles on their Authory page, or email email@example.com to request the export of some or all of their articles as an XML or HTML file. Eventually, they’ll be able to download their archives with a single click and will also be able to download individual PDFs.
This means “you don’t need to rely on us to be around forever,” Hauch said. “Some people are a little bit scared that we might go bust, which we don’t intend to do at all.” But — well — this is the Internet, and so, you never know.
Main image CC-licensed by Flickr via Marcin Wichary.