David McKee – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Open Data Camp 2 https://blog.scraperwiki.com/2015/10/open-data-camp-2/ Tue, 13 Oct 2015 15:38:20 +0000 https://blog.scraperwiki.com/?p=758224153 Open Data Camp 2I’m back from Open Data Camp 2; and I’m finding it difficult to make a coherent whole of it all.

Perhaps it’s the nature of the lack of structure of an un-conference. Maybe the different stakeholders in the open data community throughout the various hierarchies have a common aim but different levers to pull: the minister with the will to make changes; the digital civil servants with great expectations but not great budgets; the hacker who tries to consume the open data in their spare time, and creates new standards and systems for data in their day job.

There seemed to be a few themes which echoed through the conference:


There’s a recognition that improving people’s skills and recognising the skills people already have is critical, whether it’s people crafting Linked Data in Microsoft Word and wondering why it doesn’t work or getting local authorities to search internally for their invisible data ninjas.

Sometimes those difficulties occur due to differences in the assumed culture for different types of data — it seems everyone working in GIS would know what was meant by an .asc file and how to process it, but this information isn’t obvious to someone fresh to the data. Is there a need for improved documentation; linked to from Data sets? Or the ability to ask questions of other people interested in the same datasets about interpretation and processing in comments?


How do you know if your data is useful to people? Blogs have a useful feature called pingback – the referencing blog sends a message to the linked blog to let them know they’ve been linked to. There was quite a bit of discussion as to whether this functionality would be useful: particularly for informing people if breaking changes to the data might occur.

Also, when data sits around not being used, people don’t notice problems with it. When things break noisily and publicly — like taking down a cafeteria’s menu system — it’s a bit embarrassing, but it does get the problem fixed quickly!

Core Reference Data

One of the highlights of the weekend was a talk on the Address Wars — the financial value of addresses and the fight to monetise them and their locations, the problems caused for the 2001 census as a result of not being able to afford a product from the Royal Mail and Ordnance Survey, both of which were wholly government owned at the time.

It highlighted how much core reference data — lists of names and IDs of things — is critical as the glue which allows different data to be joined and understood. Apparently there’s 20 different definitions of ‘Scotland’ and 13 different ways of encoding gender (almost all of which are male or female). There’s no definitive list of hospitals, and seven people claim to be in charge of the canonical list of business names and addresses. Hence there’s a big push from GDS at the moment to create single canonical registers.

But there’s other items that need standardised encodings. The DCLG have been working on standardised reasons for why bins don’t get emptied – one of the most common interactions people have with their council. There’s a lot more work to be done across the myriad things government does, and it’s not quite clear where it should be happening: councils are looking to leadership from central government, central government wants councils to work together on it, possibly with the Local Government Association. This only gets more complicated when dealing with devolved matters or finding appropriate international standards to use.

Meeting people

I’m also really happy to have met Chris Gutteridge who was showing off some of the things he’s been working on.

Equipment.data.ac.uk brings together equipment held by various UK universities in a federated discoverable fashion by making use of well-known URLs to point to well-formatted data on each individual website. So each organisation is in control of their data and is the authoritative source for it, and builds upon having a singular place to start discovering linked data about an organisation. It’s the first time I’ve actually seen linked data in the wild joining across the web like Tim Berners-Lee intended!

On a more frivolous level, using the OpenDefra LIDAR data to turn Ventnor sea-front into a Minecraft level is inspired, and the hand-crafted version looks stunning as well!

Spreadsheets are code: EuSpRIG conference. https://blog.scraperwiki.com/2015/07/eusprig/ https://blog.scraperwiki.com/2015/07/eusprig/#comments Thu, 16 Jul 2015 09:35:28 +0000 https://blog.scraperwiki.com/?p=758223366 EuSpRIG logo

I’m back from presenting a talk on DataBaker at the EuSpRIG conference. It’s amazing to see a completely different world of how people use Excel – I’ve been busy tearing the data out of spreadsheets for the Office of National Statistics and using macros to open PDF files in Excel directly using PDFTables. So whilst I’ve been thinking of spreadsheets as sources of raw data, it’s easy to forget how everyone else uses spreadsheets. The conference reminded me particularly of one simple fact about spreadsheets that often gets ignored:

Spreadsheets are code.

And spreadsheets are a way of writing code which hasn’t substantially changed since the days of Visicalc in 1978 (the same year as the book which defined the C programming language came out).

Programming languages have changed enormously in this time, promoting higher-level concepts like object orientation, whilst the core of the spreadsheet has remained the same. Certainly, there’s a surprising number of new features in Excel, but few of these help with the core tasks of programming within the spreadsheet.

Structure and style are important: it’s easy to write code which is a nightmare to read. Paul Mireault spoke of his methodology for reducing the complexity of spreadsheets by adhering to a strict set of rules involving copious use of array formulae and named ranges. It also involves working out your model before you start work in Excel, which culminates in a table of parameters, intermediate equations, and outputs.

And at this point I’m silently screaming: STOP! You’re done! You’ve got code!

Sure, there’s the small task of identifying which of these formulae are global, and which are regional and adding appropriate markup; but at this stage the hard work is done; converting that into your language of choice (including Excel) should be straightforward. Excel makes this process overly complicated, but at least Paul’s approach gives clear instructions on how best to handle this conversion (although his use of named ranges is as contentious as your choice of football team or, for programmers, editor.)

Tom Grossman’s talk on reusable spreadsheet code was a cry for help: is there a way of being able to reuse components in a reliable way? But Excel hampers us at every turn.

We can copy and paste cells, but there is so much magic involved. We’re implicitly writing formulae of the form “the cell three to the left” — but we never explicitly say that: instead we read a reference to G3 in cell J3. And we can’t easily replace these implicit references if we’re copy-pasting formulae snippets; we need to be pasting into exactly the right cell in the spreadsheet.

In most programming languages, we know exactly what we’ll get when we copy-and-paste within our source code: a character-by-character identical copy. But copy-and-paste programming is considered a bad ‘smell’: we should be writing reusable functions: but without stepping into the realm of macros each individual invocation of what would be a function needs to be a separate set of cells. There are possibilities of making this work with custom macro functions or plugins – but so many people can’t use spreadsheets containing macros or won’t have installed those plugins. It’s a feature missing from the very core of Excel which makes it so much more difficult and longwinded to work in it.

Not having these abstractions leads to errors. Ray Panko spoke of the errors we never see; how base error rates of a few percent are endemic across all fields of human endeavour. These error rates are at the time of writing the code the first time, and per instruction. We can hope to reduce these error rates through testing, peer review and pairing. Excel hinders testing and promotes copy-paste repetition, increasing the number of operations and the potential for errors. Improving code reuse would also help enormously: the easiest code to test is the code that isn’t there.

A big chunk of the problem is that people think about Excel the same wrong way they think about Word. In Word, it’s not a major problem, so long as you don’t need to edit the document: so long as it looks correct, that might be good enough, even if the slightest change breaks the formatting. That’s simply not true of spreadsheets where a number can look right but be entirely wrong.

Maria Csernoch’s presentation of Sprego – Spreadsheet Lego – described an approach for teaching programming through spreadsheets which is designed to get people thinking about solving the problems they face methodically, from the inside out, rather than repeatedly trying ‘Trial-and-Error Wizard-based’ approach with minimal understanding.

It’s interesting to note the widespread use of array formulae across a number of the talks – if you’re making spreadsheets and you don’t know about them, it might be worth spending a while learning about them!

In short, Excel is broken. And I strongly suspect it can’t be fixed. Yet it’s ubiquitous and business critical. We need to reinvent the wheel and change all four whilst the car is driving down the motorway — and I don’t know how to do that…

https://blog.scraperwiki.com/2015/07/eusprig/feed/ 3 758223366
DataBaker – making spreadsheets machine-readable https://blog.scraperwiki.com/2015/03/databaker-making-spreadsheets-usable/ Thu, 26 Mar 2015 17:19:15 +0000 https://blog.scraperwiki.com/?p=758222669 Office of National Statistics logo

Spreadsheets are often the way of choice for publishing data. They look great, are understandable by people who don’t use databases, and with judicious use of formatting you can represent complicated datasets in a way people can understand.

The down side is that machines can’t understand them. Sure, you can export the file as CSV, but that doesn’t give you a nicely structured file with a single header row, as any other software needs to consume it.

This is a problem that we’ve encountered a few times. We started working on this problem for the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), creating a library called XYPath to parse all sorts of spreadsheets using Python, as I’ve previously written about.

The Office for National Statistics currently publishes some of their data as spreadsheets, and they want that data in their new Data Explorer to make it far easier for people to analyse it. This is how the spreadsheets start out:

(Not shown: splits by age range, seasonal adjustments, or the whole pile of other similar spreadsheets)

(Not shown: splits by age range, seasonal adjustments, most of the rest of the data or the whole pile of similar – but not quite similar enough – spreadsheets)

We’ve written databaker, which simplifies the entire process of converting spreadsheets like the one shown above to nicely structured CSV files. The recipe for this one is as follows:

  • Use XYPath expressions to select the numbers you want, then select every set of headers (e.g. Male and Female; all the different dates).
  • Add two words to describe where the headers are relative to the values: the dates are DIRECTLY LEFT of the values, and we want the CLOSEST gender label which is ABOVE the value.
  • Finally, tell it which tabs to run on.

These ‘recipes’ are succinct for the simple cases, and the complicated cases are made possible, either through small snippets of Python or combinations of XYPath expressions.

It’s all Python, openly licensed, and available on GitHub.

And it’s always nice to have happy customers!

Scraping Spreadsheets with XYPath https://blog.scraperwiki.com/2014/03/scraping-spreadsheets/ Wed, 12 Mar 2014 17:11:12 +0000 https://blog.scraperwiki.com/?p=758221318 Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.

Source: http://data.worldbank.org/indicator/SP.POP.TOTL

Source: http://data.worldbank.org/indicator/SP.POP.TOTL

So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries.  Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

Table Scraping Is Hard https://blog.scraperwiki.com/2013/11/table-scraping-is-hard/ Tue, 05 Nov 2013 15:25:43 +0000 http://blog.scraperwiki.com/?p=758219419 The Problem

NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS?

It’s a common problem – there’s lots of data out there which isn’t in a nice, neat, obviously usable format, for various reasons. What we’d like to have is all that data in a single database so we can slice it by time and place, and look for patterns. There’s no magic bullet for this, yet; we’re having to solve issues on each step of the way.

Where’s the data?

A sizable portion of the data is stored in data.gov.uk, but significant chunks were stored elsewhere on individual NHS trust websites. Short of spidering every NHS trust website, we wouldn’t be able to find the spreadsheets, so we automated finding the spreadsheets through Google.

Google make it difficult to scrape them – ironic, given that that’s what they do to everyone else – building a search engine means scraping websites and indexing the content. We also found that the data in the tables was held in a variety of formats and whilst most of this spending data was in spreadsheets, we also found tables in web pages and PDFs. Usually these would each need separate software to understand the tables, but we’ve been building new tools to make this easier to let us extract tables from all these different formats so we don’t need to worry about where the table originally came from.

The requirement from central government to provide this spending data has led to some consistency in the types of information provided; but there are still difficulties in matching up the different columns of data: Dept Family, Department family, and Departmental Family are all obviously the same thing to a human, but it’s more difficult to work out how to describe such things to a computer. Where one table has both “Gross” and “Net”, which should be matched up with another table’s “Amount”?

Worse still, it’s possible for the data in the tables to need to be matched, such as company names; where an entry exists for “BT” it needs to be matched to “British Telecommunications PLC”, rather than “BT Global Services”. Quite how to do this reliably, even with access to Companies House data is still not as easy as it should be. Hopefully projects such as OpenCorporates which also uses Scraperwiki will in the future make this an easier job.

To handle the problem of providing a uniform interface to tables in PDF, on web pages and in Excel files we made a library which we built into our recently launched product, Table Xtract.

Try Table Xtract for free!