julian todd – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Spot and Normalize Inconsistent Measures https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/ https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/#comments Thu, 10 Feb 2011 17:07:28 +0000 http://blog.scraperwiki.com/?p=758214273 Here’s an example of why you have to be very careful when scraping,
and why your normal run-of-the-mill technology that makes assumptions
won’t cut it:

One of our super-users, Julian Todd, decided to scrape the Vehicle Certification Agency (VCA) website on new car fuel consumption and exhaust emissions figures. And he spotted this:

And another search resulted in this:

Yes, that’s a change from milligrams per km to grams per km, noted
only in the header.

In ScraperWiki we can normalize this in standard python code:

for key in data.keys():
if key[-6:] == " mg km":
    nkey = key[:-6]+" g km"
    v = data.pop(key)
    if v == None:
        data[nkey] = None
        data[nkey] = float(v)/1000

This is from the scraper:

https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/feed/ 1 758214273
What could a journalist do with ScraperWiki? A quick guide https://blog.scraperwiki.com/2010/07/what-could-a-journalist-do-with-scraperwiki-a-quick-guide/ https://blog.scraperwiki.com/2010/07/what-could-a-journalist-do-with-scraperwiki-a-quick-guide/#comments Fri, 16 Jul 2010 11:17:25 +0000 http://blog.scraperwiki.com/?p=758213701 For non-programmers, a first look at ScraperWiki’s code could be a bit scary, but we want journalists and researchers to make use of the site, so we’ve set up a variety of initiatives to do that.

Firstly, we’re setting up a number of Hacks and Hacker Days around the UK, with Liverpool as our first stop outside of London. You can follow this blog or visit our eventbrite page to find out more details.

Secondly, our programmers are teaching ScraperWiki workshops and classes around the UK.

Anna Powell-Smith took ScraperWiki to the Midlands, and taught Paul Bradshaw’s MA students at Birmingham City University the basics. Paul has written up some notes at this link.

Julian Todd ran a ‘Scraping 101’ session at the Centre for Investigative Journalism summer school last weekend. He ran through the basics of ScraperWiki and showed how he was using it to map and track offshore oil wells in the UK.

You can see his slides here at this link.

Julian explained just why ScraperWiki is useful…

Your options for webscraping

1. Do the coding yourself

2. Get someone else to code it for you

3. Have it done already!

Number 3 is where ScraperWiki, a place for sharing scrapers, comes in.

Last month, ScraperWiki spoke and also manned a stall at Journalism.co.uk’s news:rewired event. You can read a write-up of Francis Irving’s presentation here by Journalism.co.uk’s Rachel McAthy:

The presentations were concluded by Francis Irving, developer for ScraperWiki, who outlined how they can help journalists transform confusing data into a newsworthy story. He showed two examples of datasets the company can ‘scrape’ data from, producing more accessible tables or even visualisations such as maps, saving journalists’ time.

(Some more general points from the session can be read here)

Meanwhile, Jon Jacob from the BBC College of Journalism caught Francis on video…

If you have any questions about ScraperWiki or our Hacks and Hackers events please contact Aine McGuire: aine [at] scraperwiki [dot] com.

https://blog.scraperwiki.com/2010/07/what-could-a-journalist-do-with-scraperwiki-a-quick-guide/feed/ 4 758213701