08.08.2013 Views

Data tools tipsheet - Investigative Reporters and Editors

Data tools tipsheet - Investigative Reporters and Editors

Data tools tipsheet - Investigative Reporters and Editors

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

FREE AND CHEAP TOOLS FOR WRANGLING DATA<br />

T. Christian Miller, ProPublica, @txtianmiller // Tyler Dukes, WRAL.com, @mtdukes<br />

IRE 2013 // June 20, 2013 // San Antonio, Texas<br />

Google Drive<br />

http://drive.google.com<br />

Price: Free<br />

An inexpensive alternative to Microsoft Office hosted in the cloud, Google Drive offers its own<br />

versions of Word, Excel, Powerpoint <strong>and</strong> more with much of the same functionality. Can also be<br />

embedded to share with readers.<br />

Cometdocs<br />

http://cometdocs.com<br />

Price: Free<br />

This free, Web-based file converter is useful for transforming PDFs into more useful formats, like<br />

text or Excel tables. IRE members can get a free subscription to Cometdocs Premium, which<br />

exp<strong>and</strong>s its limits on document sizes <strong>and</strong> includes optical character recognition for scanned PDFs.<br />

DocumentCloud<br />

http://documentcloud.org<br />

Price: Free<br />

This Web-based application offers a suite of <strong>tools</strong> to organize, annotate <strong>and</strong> analyze documents as<br />

well as share them with both colleagues <strong>and</strong> audiences. Every document uploaded to<br />

DocumentCloud is automatically processed with optical character recognition, allowing for<br />

accurate searches, as well as entity extraction to pull out <strong>and</strong> track things like names <strong>and</strong> dates.<br />

To get a free DocumentCloud account, you must be a working journalist.<br />

Fusion Tables<br />

http://tables.googlelabs.com<br />

Price: Free<br />

A beta project from Google, Fusion Tables allows users to mash up their data with other existing<br />

tables, like maps of political boundaries or poverty levels, using simple joins. Users can map the<br />

results <strong>and</strong> embed them in their websites quickly <strong>and</strong> easily.<br />

Timeflow<br />

http://reporterslab.org/timeflow<br />

Price: Free<br />

Timeflow is a visual tool for reporters looking to organize <strong>and</strong> analyze historical data on longterm<br />

stories. The downloadable java app makes it easier for reporters to tell in-depth narratives<br />

by tracking trends <strong>and</strong> making sense of temporal data.<br />

Overview<br />

https://www.overviewproject.org/<br />

Price: Free<br />

This Web-based application automatically sorts thous<strong>and</strong>s of documents into topics <strong>and</strong> subtopics<br />

by reading the full text of each one. Its interface also links with DocumentCloud, allowing<br />

reporters to easily import, tag <strong>and</strong> scan these large document sets.


Open Refine<br />

http://openrefine.org/<br />

Price: Free<br />

OpenRefine (formally Google Refine), allows rapid cleaning of data with the combination of Excellike<br />

formulas <strong>and</strong> text faceting/clustering. The downloadable application, which runs in a Web<br />

browser, groups similar words together based on multiple algorithms <strong>and</strong> allows users to quickly<br />

st<strong>and</strong>ardize names, businesses <strong>and</strong> other data.<br />

mySQL<br />

http://dev.mysql.com/downloads/installer/5.6.html<br />

Price: Free<br />

Although it’s a powerful (<strong>and</strong> free) tool for building databases, mySQL isn’t particularly userfriendly.<br />

It has an open-source community mostly consisting of hardcore developers.<br />

Navicat<br />

http://www.navicat.com/<br />

Price: $100<br />

Navicat’s $100 price tag can be worth it if you’re looking to deal commonly with mySQL. It<br />

provides a user-friendly front end for the database service <strong>and</strong> reduces your need for knowledge<br />

of SQL language. A free trial is available.<br />

Muse<br />

http://mobisocial.stanford.edu/muse/<br />

Price: Free<br />

This experimental research tool from a Stanford computer scientist was built to help users<br />

browse large email archives. Although it was originally meant for people to browse their own<br />

archives, it’s been adapted to import mailbox files from Outlook <strong>and</strong> other clients.<br />

OTHER COOL STUFF<br />

Mr. <strong>Data</strong> Converter<br />

http://shancarter.github.io/mr-data-converter/<br />

Price: Free<br />

This open-source tool, built by Shan Carter, converts Excel data into one of several web-friendly<br />

structured formats, including HTML, JSON <strong>and</strong> XML.<br />

<strong>Data</strong> Science Toolkit<br />

http://www.datasciencetoolkit.org/<br />

Price: Free<br />

This toolkit features an entire suite of easy-to-use Web apps for doing all kinds of cool things to<br />

data, like converting PDFs to plain text <strong>and</strong> converting street addresses to coordinates. Also<br />

features an open API for more advanced users.<br />

Jigsaw<br />

http://www.cc.gatech.edu/gvu/ii/jigsaw/<br />

Price: Free<br />

Another experimental tool born out of academia, this Java application helps users make sense of<br />

large collections of documents with the help of text analysis algorithms. It features a variety of<br />

different ways to look at the documents, from topic clustering to entity extraction.


TOOLS TO WATCH<br />

Tabula<br />

https://github.com/jazzido/tabula<br />

Price: Free<br />

A free, open-source application to convert tabular information locked inside PDFs into CSV data,<br />

which you can import easily into spreadsheets. Check out the team’s session Thursday at 3:40 p.m.<br />

DocHive<br />

https://github.com/raleighpublicrecord/dochive<br />

Price: Free<br />

This Web-based application helps you convert scanned forms like campaign finance records <strong>and</strong><br />

990s into spreadsheets using built-in optical character recognition <strong>and</strong> an easy template builder.<br />

FOIA Machine<br />

http://www.foiamachine.org/<br />

Price: Free<br />

Online tool that helps you submit, organize <strong>and</strong> track public records requests at the state <strong>and</strong><br />

federal level. Check out the team’s session Friday at 9:30 a.m.<br />

PLACES TO LOOK<br />

<strong>Reporters</strong>’ Lab<br />

http://reporterslab.org<br />

Tools <strong>and</strong> techniques for public affairs reporting. Includes consumer reports-style review site for<br />

a number of different <strong>tools</strong>.<br />

Nieman Journalism Lab<br />

http://niemanlab.org<br />

An attempt to help journalism figure out its future in an Internet age. Often features new <strong>and</strong><br />

emerging <strong>tools</strong> for journalists.<br />

Source<br />

http://source.mozillaopennews.org/en-US/<br />

Source is a Knight-Mozilla OpenNews project designed to amplify the impact of journalism code<br />

<strong>and</strong> the community of developers, designers, journalists <strong>and</strong> editors who make it.<br />

Knight Lab<br />

http://knightlab.northwestern.edu/<br />

A team of technologists, journalists, designers <strong>and</strong> educators working to advance news media<br />

innovation through exploration <strong>and</strong> experimentation. Often develop <strong>and</strong> release their own <strong>tools</strong><br />

for journalists.<br />

ProPublica Nerd Blog<br />

http://www.propublica.org/nerds<br />

Secrets for data journalists <strong>and</strong> newsroom developers from the news app team at ProPublica.<br />

Features entire collection of open-source <strong>tools</strong> available to the journalists.<br />

<strong>Data</strong> Driven Journalism<br />

http://datadrivenjournalism.net<br />

Dedicated to providing anyone interested in data driven journalism with a collection of learning<br />

resources, including relevant events, <strong>tools</strong>, tutorials, interviews <strong>and</strong> case studies.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!