Spot and Normalize Inconsistent Measures
Here’s an example of why you have to be very careful when scraping,
and why your normal run-of-the-mill technology that makes assumptions
won’t cut it:
One of our super-users, Julian Todd, decided to scrape the Vehicle Certification Agency (VCA) website on new car fuel consumption and exhaust emissions figures. And he spotted this:
And another search resulted in this:
Yes, that’s a change from milligrams per km to grams per km, noted
only in the header.
In ScraperWiki we can normalize this in standard python code:
for key in data.keys(): if key[-6:] == " mg km": nkey = key[:-6]+" g km" v = data.pop(key) if v == None: data[nkey] = None else: data[nkey] = float(v)/1000
This is from the scraper:
http://scraperwiki.com/scrapers/vca-car-fuel-data/
Trackbacks/Pingbacks
[…] This post was mentioned on Twitter by amcguire62 and Floor Terra, ScraperWiki. ScraperWiki said: What out for inconsistently measured #data. You can fix it with a #scraper http://wp.me/pZ2IH-Pjoat […]