Table Scraping Is Hard

David McKee — Tue, 05 Nov 2013 15:25:43 +0000

The Problem

NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS?

It’s a common problem – there’s lots of data out there which isn’t in a nice, neat, obviously usable format, for various reasons. What we’d like to have is all that data in a single database so we can slice it by time and place, and look for patterns. There’s no magic bullet for this, yet; we’re having to solve issues on each step of the way.

Where’s the data?

A sizable portion of the data is stored in data.gov.uk, but significant chunks were stored elsewhere on individual NHS trust websites. Short of spidering every NHS trust website, we wouldn’t be able to find the spreadsheets, so we automated finding the spreadsheets through Google.

Google make it difficult to scrape them – ironic, given that that’s what they do to everyone else – building a search engine means scraping websites and indexing the content. We also found that the data in the tables was held in a variety of formats and whilst most of this spending data was in spreadsheets, we also found tables in web pages and PDFs. Usually these would each need separate software to understand the tables, but we’ve been building new tools to make this easier to let us extract tables from all these different formats so we don’t need to worry about where the table originally came from.

The requirement from central government to provide this spending data has led to some consistency in the types of information provided; but there are still difficulties in matching up the different columns of data: Dept Family, Department family, and Departmental Family are all obviously the same thing to a human, but it’s more difficult to work out how to describe such things to a computer. Where one table has both “Gross” and “Net”, which should be matched up with another table’s “Amount”?

Worse still, it’s possible for the data in the tables to need to be matched, such as company names; where an entry exists for “BT” it needs to be matched to “British Telecommunications PLC”, rather than “BT Global Services”. Quite how to do this reliably, even with access to Companies House data is still not as easy as it should be. Hopefully projects such as OpenCorporates which also uses Scraperwiki will in the future make this an easier job.

To handle the problem of providing a uniform interface to tables in PDF, on web pages and in Excel files we made a library which we built into our recently launched product, Table Xtract.

Try Table Xtract for free!

Quickly get an HTML table

Francis Irving — Wed, 06 Jul 2011 10:37:17 +0000

As I said before the Julian sometimes adds simple things to ScraperWiki, and nobody even notices them.

He obviously learnt, as he ticketed in BitBucket that we needed to release this one.

The External API can output data in several formats. jsondict is the standard JSON format using a dictionary for each row, jsonlist an alternative JSON format which has a simple array for each row. csv is for loading into spreadsheets, or if you want just a simple row based format.

The new one is called htmltable. When you use it in the API explorer you get a convenient easy to read table. You can also open the API URl directly in your browser.

Here are a couple of examples from Nike’s innovative GreenXChange database (scraped here), which is a place for large companies to share patents that improve sustainability.

This API result page shows how many entries there are in the database for each contributing company. And this is all the innovations about gloves. Notice that both pages are just URLs from the API, and contain editable SQL queries.

tables – ScraperWiki

Table Scraping Is Hard

The Problem

Where’s the data?

Quickly get an HTML table