Table Scraping Is Hard
The Problem
NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS?
It’s a common problem – there’s lots of data out there which isn’t in a nice, neat, obviously usable format, for various reasons. What we’d like to have is all that data in a single database so we can slice it by time and place, and look for patterns. There’s no magic bullet for this, yet; we’re having to solve issues on each step of the way.
Where’s the data?
A sizable portion of the data is stored in data.gov.uk, but significant chunks were stored elsewhere on individual NHS trust websites. Short of spidering every NHS trust website, we wouldn’t be able to find the spreadsheets, so we automated finding the spreadsheets through Google.
Google make it difficult to scrape them – ironic, given that that’s what they do to everyone else – building a search engine means scraping websites and indexing the content. We also found that the data in the tables was held in a variety of formats and whilst most of this spending data was in spreadsheets, we also found tables in web pages and PDFs. Usually these would each need separate software to understand the tables, but we’ve been building new tools to make this easier to let us extract tables from all these different formats so we don’t need to worry about where the table originally came from.
The requirement from central government to provide this spending data has led to some consistency in the types of information provided; but there are still difficulties in matching up the different columns of data: Dept Family, Department family, and Departmental Family are all obviously the same thing to a human, but it’s more difficult to work out how to describe such things to a computer. Where one table has both “Gross” and “Net”, which should be matched up with another table’s “Amount”?
Worse still, it’s possible for the data in the tables to need to be matched, such as company names; where an entry exists for “BT” it needs to be matched to “British Telecommunications PLC”, rather than “BT Global Services”. Quite how to do this reliably, even with access to Companies House data is still not as easy as it should be. Hopefully projects such as OpenCorporates which also uses Scraperwiki will in the future make this an easier job.
To handle the problem of providing a uniform interface to tables in PDF, on web pages and in Excel files we made a library which we built into our recently launched product, Table Xtract.
