OpenNews: Code Sprints do some spring cleaning on data

imageData is a buzzword nowadays. Whether it’s sifting Big Data to influence business, or the promise of Open Data to transform government, or Data Analytics winning elections, data is constantly in the news. But one thing that gets glossed over in all the buzz is that data is hard. Really, really hard. One of the hardest parts is cleaning, standardizing, and formatting data in a way that journalists and others can start to work with. These are real challenges faced by newsrooms and we’re hoping to make some of that a little easier with two new Code Sprints we’re happy to announce today.

First up: Dedupe

One of the biggest problems with data sets is figuring out if information in one set of data is the same as information in another. When you have a small set of data, the work is pretty straightforward. But as your rows increase, the work becomes daunting. Derek Eder and Forest Gregg at Chicago’s DataMade have been working on an automated process for deduplification of data, and we’re happy to help get it to a state where running it through huge datasets is as simple as a few calls from the command line.

A clear early use for the tool is in deduplifying campaign finance records, which can often be a slog. We’ve recruited the help of Derek Willis and others from the New York Times—a href=”who know something about

The DataMade team have done a great deal of heavy lifting already—“we’ve solved the most of major engineering challenges of scaling up on large datasets,” DataMade’s Eder says—but getting a lower barrier to entry on the tool is time and money well spent. If you can program Python, you can fork and start running Dedupe today. If you want to wait for the simplified version, we’re expecting development to wrap up early this summer.

Next up: FMS Parser

The US Treasury releases a statement of, essentially, the Federal Government’s checkbook every day at 4pm EST. Unhelpfully, they release it as a straight-up text file or a PDF. Newsroom developers and info-hackers Cezary Podkul, Burton DeWilde, Thomas Levine, Jake Bialer, Brian Abelson, and Michael Keller started work on scraping and parsing that daily statement at the Bicostal Datafest earlier this year.

The team got far enough along at the Datafest that they approached us about helping to turn it into an open API that any newsroom developer can access. With our Code Sprint grant, the team will take this once nearly-inaccessible dataset and transforming it into an easily accessible API that returns machine-readable JSON. In this time of cutbacks and budget wrangling, the FMS Parser should offer developers and journalists a new way to dive deeply into governmental spending.

The tool should see some immediate use too, as the team of developers working on it include newsroom developers at Reuters, the Daily Beast, and the Huffington Post (along with our Knight-Mozilla Fellow at the New York Times). While it’s still being developed, you can fork and follow at the FMS Parser Github repo.

Onward

A month ago I announced a reimagined Code Sprint application process, and we’re excited to help tools like this get the funding and attention they need through it. We’re always looking for developers and newsrooms with great ideas they want to build (along with newsrooms that want to betatest them), so please drop a line. Let’s do this!