scraping – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Scraping Spreadsheets with XYPath Wed, 12 Mar 2014 17:11:12 +0000 Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.



So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries.  Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

Digging Olympic Data at Londinium MMXII Tue, 24 Jul 2012 09:50:23 +0000 This is a guest post by Makoto Inoue, one of the organisers of this weekend’s Londinium MMXII hackathon.

The Olympics! Only a few days to go until seemingly every news camera on the planet is pointed at the East End of London, for a month of sporting coverage. But for data diggers everywhere, this is also a gigantic opportunity to analyse and visualise whole swathes of sporting data, as well as create new devices and apps to amplify, manage and make sense of the data in interesting ways.

Remapping past Olympic results into London 2012 schedule to predict the medal ranking leader board

I’m organising the Londinium MMXII Hackathon which happens the day after the opening of the Olympics so that the participants can do cool hacks using real time data. But while you can use Twitter and Facebook to gather social buzz, or TfL, Google Maps and Foursquare to do geo mashups, it turns out the one dataset we’re missing is real time game results. I spent a long time trying to find out if there are publicly available data APIs but in the end it looked like we were out of luck!

Out of luck, that was, until we found out about ScraperWiki. Rather than waiting for the data to come to us, ScraperWiki lets us go grab the freshest data ourselves – after all, there will be tons of news sites publishing the Olympic schedule, and many (like as the BBC) are well structured enough to reliably scrape. Since the BBC publishes the schedule (and, from the look of it, the result) of each event, including most importantly, the exact time of each sport, we can easily set periodic scheduler jobs to scrape the latest data as it is announced. Perfect!

I’ve already written one scraper while writing an Olympic medal rivalry article, so feel free to copy the scraper as your own starting point. Setting an hourly cronjob on ScraperWiki is normally a premium service, but the guys at ScraperWiki are so keen to see what data the Londinium MMXII hackers can come up with, they’re allowing all participants free access to set an hourly cron, for the duration of the hackathon (thanks ScraperWiki!). So let’s join the hackathon and hack together!!

Three hundred thousand tonnes of gold Wed, 04 Jul 2012 20:17:18 +0000 On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as 1.58887418588e+13.

Doesn’t matter for now. Let’s look at the graph over time:

It’s not very exciting, unless you happen to be interested in phrases such as “debasing our fiat currency” and “return to the gold standard”. In truth, one should really divide the values by the GDP, or the national population, or the cumulative inflation over the time period to scale it properly.

Nevertheless, I decided also to look at the gold price, which can be seen as a graph (click the [Graph] button, then [Group Column (x-axis)]: “date” and [Group Column (y-axis)]: “price”) on the Data Hub. They give this dataset the title: Gold Prices in London 1950-2008 (Monthly).

Why does the data stop in 2008 just when things start to get interesting?

I discovered a download url in the metadata for this dataset:

which is somewhere within the githubtm as part of the repository in which there resides a 60 line Python scraper known as

Aha, something I can work with! I cut-and-pasted the code into ScraperWiki as scrapers/gold_prices and tried to run it. Obviously it didn’t work as-is — code always requires some fiddling about when it is transplanted into an alien environment. The module contained three functions: download(), extract() and upload().

The download() function didn’t work because it tries to pull from the broken link:

This is one of unavoidable failures that can befall a webscraper, and was one of the motivations for hosting code in a wiki so that such problems can be trivially corrected without an hour of labour checking out the code in someone else’s favourite version control system, setting up the environment, trying to install all the dependent modules, and usually failing to get it to work if you happen to use Windows like me.

After some looking around on the Bundesbank website, I found the Time_series_databases (Click on [Open all] and search for “gold”.) There’s Yearly average, Monthly average and Daily rates. Clearly the latter is the one to go for as the other rates are averages and likely to be derivations of the primary day rate value.

I wonder what a “Data basket” is.

Anyways, moving on. Taking the first CSV link and inserting it into that code hits a snag in the extract() function:

downloaded = 'cache/bbk_WU5500.csv'
outpath = 'data/data.csv'
def extract():
    reader = csv.reader(open(downloaded))
    # trim junk from files
    newrows = [ [row[0], row[1]] for row in list(reader)[5:-1] ]

    existing = []
    if os.path.exists(outpath):
        existing = [ row for row in csv.reader(open(outpath)) ]

    starter = newrows[0]
    for idx,row in enumerate(existing):
        if row[0] == starter[0]:
            del existing[idx:]

    # and now add in new data
    outrows = existing + newrows
    csv.writer(open(outpath, 'w')).writerows(outrows)

ScraperWiki doesn’t have persistent files, and in this case they’re not helpful because all these lines of code are basically replicating the features through use of the following two lines:

    ldata = [ { "date":row[0], "value":float(row[1]) }  for row in newrows  if row[1] != '.' ]["date"], ldata)

And now your always-up-to-date gold price graph is yours to have at the cost of select date, value from swdata order by date –> google annotatedtimeline.

But back to the naked github disclosed code. Without its own convenient feature, this script must use its own upload() function.

def upload():
    import datastore.client as c
    dsurl = ''
    client = c.DataStoreClient(dsurl)

Ah, we have another problem: a dependency on the undeclared datastore.client library, which was probably so seamlessly available to the author on his own computer that he didn’t notice its requirement when he committed the code to the github where it could not be reused without this library. The library datastore.client is not available in the github/datasets account; but you can find it in the completely different github/okfn account.

I tried calling this code by cut-and-pasting it into the ScraperWiki scraper, and it did something strange that looked like it was uploading the data to somewhere, but I can’t work out what’s happened. Not to worry. I’m sure someone will let me know what happened when they find a dataset somewhere that is inexplicably more up to date than it used to be.

But back to the point. Using the awesome power of our genuine data-hub system we can take the us_debt_to_the_penny, and attach the gold_prices database to perform the combined query to scales ounces of gold into tonnes:

    AS debt_gold_tonnes
FROM swdata AS debt
LEFT JOIN gold_prices.swdata as gold
  ON =
WHERE is not null

and get the graph of US government debt expressed in terms of tonnes of gold.

So that looks like good news for all the gold-bugs, the US government debt in the hard currency of gold has been going steadily down by a factor of two since 2001 to around 280 thousand tonnes. The only problem with that there’s only 164 thousand tonnes of gold in the world according to the latest estimates.

Other fun charts people find interesting such as gold to oil ratio can be done once the relevant data series is loaded and made available for joining.

]]> 3 758217195
Software Archaeology and the ScraperWiki Data Challenge at #europython Fri, 29 Jun 2012 09:24:27 +0000 There’s a term in technical circles called “software archaeology” – it’s when you spend time studying and reverse-engineering badly documented code, to make it work, or make it better. Scraper writing involves a lot of this stuff. ScraperWiki’s data scientists are well accustomed with a bit of archaeology here and there.

But now, we want to do the same thing for the process of writing code-that-does-stuff-with-data. Data Science Archaeology, if you like. Most scrapers or visualisations are pretty self-explanatory (and an open platform like ScraperWiki makes interrogating and understanding other people’s code easier than ever). But working out why the code was written, why the visualisations were made, and who went to all that bother, is a little more difficult.

ScraperWiki Europython poster explaining the Data Challenge about European fishing boats

That’s why, this Summer, ScraperWiki’s on a quest to meet and collaborate with data science communities around the world. We’ve held journalism hack days in the US, and interviewed R statisticians from all over the place. And now, next week, Julian and Francis are heading out to Florence to meet the European Python community.

We want to know how Python programmers deal with data. What software environments do they use? Which functions? Which libraries? How much code is written to ‘get data’ and if it runs repeatedly? These people are geniuses, but for some reason nobody shouts about how they do what they do… Until now!

And, to coax the data science rock stars out of the woodwork, we’re setting a Data Challenge for you all…

In 2010 the BBC published the article about the ‘profound’ decline in fish stocks shown in UK Records. “Over-fishing,” they argued, “means UK trawlers have to work 17 times as hard for the same fish catch as 120 years ago.” The same thing is happening all across Europe, and it got us ScraperWikians wondering: how do the combined forces of legislation and overfishing affect trawler fleet numbers?

We want you to trawl (ba-dum-tsch) through this EU data set and work out which EU country is losing the most boats as fisherman strive to meet the EU policies and quotas. The data shows you stuff like each vessel’s license number, home port, maintenance history and transfer status, and a big “DES” if it’s been destroyed. We’ll be giving away a tasty prize to the most interesting exploration of the data – but most of all, we want to know how you found your answer, what tools you used, and what problems you overcame. So check it out!!


PS: #Europython’s going to be awesome, and if you’re not already signed up, you’re missing out. ScraperWiki is a startup sponsor for the event and we would like to thank the Europython organisers and specifically Lorenzo Mancini for his help in printing out a giant version of the picture above, ready for display at the Poster Session.

]]> 2 758217302
Local ScraperWiki Library Thu, 07 Jun 2012 15:24:28 +0000 It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use

pip install scraperwiki_local 

A dump truck dumping its payload
You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck


DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values

What happens if you do this?

import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw'][], {'shopping_list': shopping_list})

On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:

[u'carrots', u'orange juice', u'chainsaw'] 

In the local version, it is encoded to JSON, so it looks like this:

["carrots","orange juice","chainsaw"] 

And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names

SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]:[], {'shopping_list': shopping_list})
In [3]:[], {'sHOpPiNg_liST': shopping_list})
In [4]:'* from swdata')
Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data ='* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, returns a list of ordinary dictionaries. In the local version, returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!

]]> 5 758217004
Fine set of graphs at the Office of National Statistics Thu, 22 Mar 2012 11:47:01 +0000 It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK).

If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through?

A Population Pyramid is a straightforward way to visualize the data, like so:

This image is sufficient for determining what needs to be supplied (eg more children means more schools and toy-shops), but it doesn’t explain why.

The “why?” and “what’s going on?” questions are much more interesting, but are pretty much guesswork because they refer to layers in the data that you cannot see. For example, the number of people in East Devon of a particular age is the sum of those who have moved into the area at various times, minus those who have moved away (temporarily or permanently), plus those who were already there and have grown older but not yet died. For any bulge, you don’t know which layer it belongs to.

In this 2015 population pyramid there are bulges at 28, 50 and a pronounced spike at 68, as well as dips at 14 and 38. In terms of birth years, these correspond to 1987, 1965 and 1947 (spike), and dips at 2001 and 1977.

You can pretend they correspond to recessions, economic boom times and second wave feminism, but the 1947 post-war spike when a mass of men-folk were demobilized from the military is a pretty clean signal.

What makes this data presentation especially lovely is that it is localized, so you can see the population pyramid per city:

Cambridge, as everyone knows, is a university town, which explains the persistent spike at the age 20.

And, while it looks like there is gender equality for 20 year old university students, there is a pretty hefty male lump up to the age of 30 — possibly corresponding folks doing higher degrees. Is this because fewer men are leaving town at the appropriate age to become productive members of society, or is there an influx of foreign grad students from places where there is less of a gender equality? The data set of student origins and enrollments would give you the story.

As to the pyramid on the right hand side, I have no idea what is going on in Camden to account for that bulge in 30 year olds. What is obvious, though, is that the bulge in infants must be related. In fact, almost all the children between the ages of 0 and 16 years will have corresponding parents higher up the same pyramid. Also, there is likely to be a pairwise cross-gender correspondence between individuals of the same generation living together.

These internal links, external data connections, sub-cohorts and new questions raised the more you look at it means that it is impossible to create a single all-purpose visualization application that could serve all of these. We can wonder as to whether an interface which worked via javascript-generated SQL calls (rather than flash and server-side queries) would have enabled someone with the right skills to roll their own queries and, for example, immediately find out which city and age group has the greatest gender disparity, and whether all spikes at the 20-year-old age bracket can be accounted for by universities.

For more, see An overview of ONS’s population statistics.

As it is, someone is going to have to download/scrape, parse and load at least one year of source data into a data hub of their choice in order to query this (we’ve started on 2010’s figures here on ScraperWiki – take a look). Once that’s done, you’d be able to sort the cities by the greatest ratio between number of 20 year olds and number of 16 year olds, because that’s a good signal of student influx.

I don’t have time to get onto the Population projection models, where it really gets interesting. There you have all the clever calculations based on guestimates of migration, mortality and fertility.

What I would really like to see are these calculations done live and interactively, as well as combined with economic data. Is the state pension system going to go bankrupt because of the “baby boomers”? Who knows? I know someone who doesn’t know: someone who’s opinion does not rely (even indirectly) on something approaching a dynamic data calculation. I mean, if the difference between solvency and bankruptcy is within the margin of error in the estimate of fertility rate, or 0.2% in the tax base, then that’s not what I’d call bankrupt. You can only find this out by tinkering with the inputs with an element of curiosity.

Privatized pensions ought to be put into the model as well, to give them the macro-economic context that no pension adviser I’ve ever known seems capable of understanding. I mean, it’s evident that the stock market (in which private pensions invest) does happen to yield a finite quantity of profit each year. Ergo it can support a finite number of pension plans. So a national policy which demands more such pension plans than this finite number is inevitably going to leave people hungry.

Always keep in mind the long term vision of data and governance. In the future it will all come together like transport planning, or the procurement of adequate rocket fuel to launch a satellite into orbit; a matter of measurements and predictable consequences. Then governance will be a science, like chemistry, or the prediction of earthquakes.

But don’t forget: we can’t do anything without first getting the raw data into a usable format. Dave McKee’s started on 2010’s data here … fancy helping out?

How to get along with an ASP webpage Wed, 09 Nov 2011 12:14:02 +0000 Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats.

Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages.

The important thing to observe is that Next >> link is no ordinary link. You can see something is wrong when you hover your cursor over it. Here’s what it looks like in the HTML source code:

<a id="lnkNext" href="javascript:__doPostBack('lnkNext','')">Next >></a>

What it does (instead of taking the browser to the next page) is execute the javascript function __doPostBack().

Now, this could take a long time to untangle by stepping through the javascript code to the extent that it would be a hopeless waste of time, but for the fact that this is code generated by Microsoft and there are literally millions of webpages that work in exactly the same way.

This __doPostBack() javascript function is always on the page and it’s always the same, if you look at the HTML source.

<script type="text/javascript">
var theForm = document.forms['form1'];
if (!theForm) {
     theForm = document.form1;
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;


So what it’s doing is putting the two arguments from the function call (in this example ‘lnkNext’ and ”) into two values of the hidden form called “form1” and then submitting the form back to the server as a POST request.

Let’s try to look at the form. Here is some Python code which ought to do it.

import mechanize
br = mechanize.Browser()"")
print br.form

Unfortunately this doesn’t work, because the form is has no name. Here is how it appears in the HTML:

<form method="post" action="" id="form1">
<div class="aspNetHidden">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMjA4MT... insanely long ascii string />
...the entire rest of the webpage...

The javascript is selecting it by the id which, unfortunately, mechanize doesn’t allow. Fortunately there is only one form in the whole page, so we can select it as the first form on the page:

import mechanize
br = mechanize.Browser()"")
print br.form

What do we get?

 <POST application/x-www-form-urlencoded
<HiddenControl(__VIEWSTATE=/wEPDwUKMjA4...  and so on ) (readonly)>
<HiddenControl(__EVENTVALIDATION=/wEWVQK...  and so on ) (readonly)>
<TextControl(txtSearch=Search DataSets)>
<TextControl(txtSearch=Search DataSets)>
&ltSubmitControl(btnSearch=Search) (readonly)>
<SelectControl(ddlOrder=[*Title, Agency, Rating])>>

Oh dear. What has happened to the __EVENTTARGET and __EVENTARGUMENT which I am going to have to put values in when I am simulating the __doPostBack() function?

I don’t really know.

What I do know is that if you insert the following line:

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

just before the line that says to include some headers that are recognized by the Microsoft server software, then you get them back:

<POST application/x-www-form-urlencoded
<HiddenControl(__EVENTTARGET=) (readonly)>
<HiddenControl(__EVENTARGUMENT=) (readonly)>
<HiddenControl(__LASTFOCUS=) (readonly)>

Right, so all we need to do to get to the next page is fill in their values and submit the form, like so:

br["__EVENTTARGET"] = "lnkNext"
br["__EVENTARGUMENT"] = ""
response = br.submit()

Woops, that doesn’t quite work, because those two controls are readonly. Luckily there is a function in mechanize to make this problem go away, which looks like:


So let’s put this all together, including pulling out those special values from the __doPostBack() javascript to make it more general and putting it into a loop.

How about it?

import mechanize
import re

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response ="")

for i in range(10):
    html =
    print "Page %d :" % i, html

    print br.form
    mnext ="""<a id="lnkNext" href="javascript:__doPostBack('(.*?)','(.*?)')">Next >>""", html)
    if not mnext:
    br["__EVENTTARGET"] =
    br["__EVENTARGUMENT"] =
    response = br.submit()

It still doesn’t quite work! This stops at two pages, but you know there are four.

What is the problem?

The problem is this SubmitControl in the list of controls in the form:

<SubmitControl(btnSearch=Search) (readonly)>

You think you are submitting the form, when in fact you are clicking on the Search button, which then takes you to a page you are not expecting that has no Next >> link on it.

If you disable that particular SubmitControl before submitting the form

br.find_control("btnSearch").disabled = True

then it works.

From here on it’s plain sailing. All you need to do is parse the html, follow the normal links, and away you go!

In summary

1. You need to use mechanize because these links are operated by javascript and a form

2. Clicking the link involves copying the arguments from __doPostBack() into the __EVENTTARGET and __EVENTARGUMENT HiddenControls.

3. You must set readonly to False so you can even write to those values.

4. You must set the User-agent header or the server software doesn’t know what browser you are using and returns something that can’t possibly work.

5. You must disable all extraneous SubmitControls in the form before calling submit()

Some of these tricks have taken me a day to learn and resulted in me almost giving up for good. So I am passing on this knowledge in the hope that it can be used. There are other tricks of the trade I have picked up regarding ASP pages that there is no time to pass on here because the examples are even more involved and harder to get across.

What we need is an ASP pages working group among the ScraperWiki diggers who take on this type of work. Anyone who is faced with one of these jobs should be able to bring it to the team and we’ll take a look at it as a group of experts with the knowledge. I expect problems to be disposed of within half an hour that would take someone who hasn’t done it before to take a week or give up before they’ve even got started.

This is how we can produce results.

]]> 7 758215794
Scraping guides: Excel spreadsheets Wed, 14 Sep 2011 15:55:29 +0000 Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.

The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?

The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.

We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.

Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.

This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.

Next time – parsing HTML with CSS selectors.

Access government in a way that makes sense to you? Surely not! Wed, 11 May 2011 15:07:22 +0000 uses Scraperwiki, a cutting edge data-gathering tool, to deliver the results that citizens want. And radically for government, rather than tossing a finished product out onto the web with a team of defenders, this is an experiment in customer engagement.

If you’re looking to renew your passport, find out about student loans or how to complete tax returns, it’s usually easier to use Google than navigate through government sites.  That was the insight for director of the Alphagov project Tom Loosemore, and his team of developers.  This is a government project run by government. is not a traditional website, it’s a developer led but citizen focused experiment to engage with government information.

It abandons the approach of forcing you to think the way they thought, instead it provides a simple “ask me a question” interface and learns from customer journeys, starting with the first 80 of the most popular searches that led to a government website.

But how would they get information from all those Government website sites into the new system?

I spoke to Tom about the challenges behind the informational architecture of the site and he noted that: “Without the dynamic approach that Scraperwiki offers we would have had to rely on writing lots of redundant code to scrape the websites and munch together the different datasets. Normally that would have taken our developers a significant amount of time, would have been a lot of hassle and would have been hard to maintain. Hence we were delighted to use ScraperWiki, it was the perfect tool for what we needed.  It avoided a huge headache .”

Our very own ScraperWiki CEO Francis Irving says “It’s fantastic to see Government changing its use of the web to make it less hassle for citizens. Just as developers need data in an organised form to make new applications from it, so does Government itself. ScraperWiki is an efficient way to maintain lots of converters for data from diverse places, such as have here from many Government departments. This kind of data integration is a pattern we’re seeing, meeting people’s expectations for web applications oriented around the user getting what they want done fast. I look forward to seeing the project rolling out fully – if only so I can renew my passport without having to read lots of text first!”

Just check out the AlphaGov tag. Because government sites weren’t built to speak to one another there’s no way their data would be compatible to cut and paste into a new site. So this is another cog in the ScraperWiki machine: merging content from systems that cannot talk to each other. is an experimental prototype (an ‘alpha’) of a new, single website for UK Government, developed in line with the recommendations of Martha Lane Fox’s Review. The site is a demonstration, and whilst it’s public it’s not permanent and is not replacing any other website.  It’s been built in three months by a small team in the Government Digital Service, part of the Cabinet Office.

]]> 5 758214754
ScraperWiki: A story about two boys, web scraping and a worm Thu, 05 May 2011 10:06:44 +0000 Spectrum game

“It’s like a buddy movie.” she said.

Not quite the kind of story lead I’m used to. But what do you expect if you employ journalists in a tech startup?

“Tell them about that computer game of his that you bought with your pocket money.”

She means the one with the risqué name.

I think I’d rather tell you about screen scraping, and why it is fundamental to the nature of data.

About how Julian spent almost a decade scraping himself to death until deciding to step back out and build a tool to make it easier.

I’ll give one example.

Two boys

In 2003, Julian wanted to know how his MP had voted on the Iraq war.

The lists of votes were there, on the website. But buried behind dozens of mouse clicks.

Julian and I wrote some software to read the pages for us, and created what eventually became TheyWorkForYou.

We could slice and dice the votes, mix them with some knowledge from political anaroks, and create simple sentences. Mini computer generated stories.

“Louise Ellman voted very strongly for the Iraq war.”

You can see it, and other stories, there now. Try the postcode of the ScraperWiki office, L3 5RF.

I remember the first lobbiest I showed it to. She couldn’t believe it. Decades of work done in an instant by a computer. An encyclopedia of data there in a moment.

Web Scraping

It might seem like a trick at first, as if it was special to Parliament. But actually, everyone does this kind of thing.

Google search is just a giant screen scraper, with one secret sauce algorithm guessing its ranking data.

Facebook uses scraping as a core part of its viral growth to let users easily import their email address book.

There’s lots of messy data in the world. Talk to a geek or a tech company, and you’ll find a screen scraper somewhere.

Why is this?

It’s Tautology

On the surface, screen scrapers look just like devices to work round incomplete IT systems.

Parliament used to publish quite rough HTML, and certainly had no database of MP voting records. So yes, scrapers are partly a clever trick to get round that.

But even if Parliament had published it in a structured format, their publishing would never have been quite right for what we wanted to do.

We still would have had to write a data loader (search for ‘ETL’ to see what a big industry that is). We still would have had to refine the data, linking to other datasets we used about MPs. We still would have had to validate it, like when we found the dead MP who voted.

It would have needed quite a bit of programming, that would have looked very much like a screen scraper.

And then, of course, we still would have had to build the application, connecting the data to the code that delivered the tool that millions of wonks and citizens use every year.

Core to it all is this: When you’re reusing data for a new purpose, a purpose the original creator didn’t intend, you have to work at it.

Put like that, it’s a tautology.

A journalist doesn’t just want to know what the person who created the data wanted them to know.

Scrape Through

So when Julian asked me to be CEO of ScraperWiki, that’s what went through my head.

Secrets buried everywhere.

The same kind of benefits we found for politics in TheyWorkForYou, but scattered across a hundred countries of public data, buried in a thousand corporate intranets.

If only there was a tool for that.

A Worm

And what about my pocket money?

Nicola was talking about Fat Worm Blows a Sparky.

Julian’s boss’s wife gave it its risqué name while blowing bubbles in the bath. It was 1986. Computers were new. He was 17.

Fat Worm cost me £9.95. I was 12.

[Loading screen]

I was on at most £1 a week, so that was ten weeks of savings.

Luckily, the 3D graphics were incomprehensibly good for the mid 1980s. Wonder who the genius programmer is.

I hadn’t met him yet, but it was the start of this story.