There’s More Than One Way to Scrape a Site

A request came in to ScraperWiki to scrape information on the Members of the European Parliament. I put it out on Twitter and Facebook hoping a kind member of the ScraperWiki community will have spent so much time on the computer he/she has no life at all. I had to turn people away!

Within minutes, two tweeters wanted to give it a go and I got a reply on Facebook. In fact, Tim Green had already scraped the names and URLs of MEPs by the time I got back to him saying it had already been claimed on twitter by Pall Hilmarsson.

Although both scrapers are looking at the same site, Tim‘s is less than 20 lines of code and with only 8 revisions, it’s a very quick scrape. Whereas Pall‘s went for the full schebang, scraping opinions and speeches and generally drilling down into the data a whole lot more. Hence the nearly 200 lines of code!

So if you’re a code junky, take a look and what it takes to scrape and then scrape further by comparing scrapers/meps with scrapers/meps_2. Also, Tim kindly scraped the next request: National Historic Ships Register. To Tim and Pall I say: If the ScraperWiki digger were capable of emotion you would both be receiving a diesel greasy kiss!

European Parliament Members and National Historic Ships – you’ve been ScraperWikied! (with help from your friendly neighbourhood programmers)

Tags: European Parliament, National Historic Ships Register, Pall Hilmarsson, Tim Green

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

There’s More Than One Way to Scrape a Site