Party

I went to a three-day party in Buenos Aires this past month. The first two days were talks and workshops, I gave a talk on how awesome I am and a workshop on cleaning data. The latter involved no computers and no slides, so I held it outside! I modeled an analog version of the […]

On-line directory tree webscraping

As you surf around the internet — particularly in the old days — you may have seen web-pages like this: or this: The former image is generated by Apache SVN server, and the latter is the plain directory view generated for UserDir on Apache. In both cases you have a very primitive page that allows […]

DumpTruck 0.0.3

I’ve added some new features to DumpTruck. Changes Dictionary case sensitivity I removed the dictionaries with case-insensitive keys because that just seemed to be delaying the conversion to case sensitivity. Ordered Dictionaries DumpTruck.execute now returns a collections.OrderedDict for each row rather than a dict for each row. Also, order is respected on insert, so you […]

Do all “analysts” use Excel?

We were wondering how common spreadsheets are as a platform for data analysis. It’s not something I’ve really thought about in a while; I find it way easier to clean numbers with real programming languages. But we suspected that virtually everyone else used spreadsheets, and specifically Excel Spreadsheet, so we did a couple of things […]

Digging Olympic Data at Londinium MMXII

This is a guest post by Makoto Inoue, one of the organisers of this weekend’s Londinium MMXII hackathon. The Olympics! Only a few days to go until seemingly every news camera on the planet is pointed at the East End of London, for a month of sporting coverage. But for data diggers everywhere, this is […]

The state of Twitter: Mitt Romney and Indonesian Politics

It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can […]

Mapping deaths in the Italian prison system.

This is a guest post by Jacopo Ottaviani, Italian freelance journalist and developer. The story it tells was published in the Italian newspaper Il Fatto Quotidiano. Currently in Italy many prisoners die every month in jail. According to an independent dossier by the Italian non-profit association Ristretti Orizzonti (lit. Narrow Horizons), almost one thousand deaths were registered in the […]

Three hundred thousand tonnes of gold

On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as […]

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it. Import it. (It only works on ScraperWiki, unfortunately.) from scraperwiki import swimport search = swimport(‘twitter_search’).search Then search for terms. search([‘picnic #tcamp12’, ‘from:TCampDC’, ‘@TCampDC’, ‘#tcamp12’, ‘#viphack’]) A separate search will […]

Software Archaeology and the ScraperWiki Data Challenge at #europython

There’s a term in technical circles called “software archaeology” – it’s when you spend time studying and reverse-engineering badly documented code, to make it work, or make it better. Scraper writing involves a lot of this stuff. ScraperWiki’s data scientists are well accustomed with a bit of archaeology here and there. But now, we want […]

ScraperWiki

Extract tables from PDFs and scrape the web

Blog