Hi! We've renamed ScraperWiki.
The product is now QuickCode and the company is The Sensible Code Company.

Blog

The Big Clean

A "BIG CLEAN" logo that looks like a logo for soap

I’m just about to return from Prague, Czech Republic, where I gave a workshop at the Big Clean. What a nice little conference this was!

It had two tracks: Talks and the workshop. So I didn’t get to see many of the talks :(. But this meant I had the whole day to teach people about cleaning up data.

We started with some overview thoughts on cleaning up data, and then I went through the architecture of an analog data-cleaning process, like it might have been done 30 years ago and is still done more often than you’d realize. I work with computers enough that I can’t stand them, so I drew out a diagram on paper instead of using slides.

The fun part happens when we realize that the architecture is the same when we digitize the process. Once we realize this, the process seems less magic; it’s just a faster version of what people would do. Also, when you break up the project like this, it’s easier to work on it in stages.

I went through the writing of a simple web scraper script, then we broke for lunch, which was prepared by HotKarot using big data experimental social media crowdsourced realtime open-source catering methodologies.

Screenshot of a webpage with a network graph diagram covering most of the screen and with #bigcleancz tweets on the right

After stuffing ourselves at lunch, we worked on some of the participants’ projects/ideas.

  1. We added a column a spreadsheet of Czech municipality characteristics by finding municipality areas in another website.
  2. We talked about various approaches to parsing PDF documents for one of Juha‘s projects.
  3. We pulled the song-play history out of Last.fm. (I unfortunately don’t recall who’s account we were looking at.) Lastfm records exposes loads of data about your activities through it’s surprisingly convenient API, and this gets interesting if you’ve been using Last.fm for seven years.

Also,

2 Responses to “The Big Clean”

  1. Ton Zijlstra November 6, 2012 at 3:19 pm #

    Hi Thomas, it was good to meet you in Prague this weekend. As to wobbing: journalists from around Europe have created their own site on FOI and investigative journalism at the URL wobbing.eu 🙂

  2. tonzijlstra November 6, 2012 at 3:20 pm #

    Hi Thomas, it was good to meet you in Prague this weekend. As to WOB: A group of European journalists have named their website on FOI and investigative journalism wobbing.eu 🙂

    Indeed a much better sounding / stickier term than FOIA.

We're hiring!