inconsisent – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Spot and Normalize Inconsistent Measures https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/ https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/#comments Thu, 10 Feb 2011 17:07:28 +0000 http://blog.scraperwiki.com/?p=758214273 Here’s an example of why you have to be very careful when scraping,
and why your normal run-of-the-mill technology that makes assumptions
won’t cut it:

One of our super-users, Julian Todd, decided to scrape the Vehicle Certification Agency (VCA) website on new car fuel consumption and exhaust emissions figures. And he spotted this:

And another search resulted in this:

Yes, that’s a change from milligrams per km to grams per km, noted
only in the header.

In ScraperWiki we can normalize this in standard python code:

for key in data.keys():
if key[-6:] == " mg km":
    nkey = key[:-6]+" g km"
    v = data.pop(key)
    if v == None:
        data[nkey] = None
    else:
        data[nkey] = float(v)/1000

This is from the scraper:
http://scraperwiki.com/scrapers/vca-car-fuel-data/

]]>
https://blog.scraperwiki.com/2011/02/spot-and-normalize-inconsistent-measures/feed/ 1 758214273