Scraping Spreadsheets with XYPath

Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.

Source: http://data.worldbank.org/indicator/SP.POP.TOTL

So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries. Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

Tags: data, data science, navigation, scraping, spreadsheet, XYPath

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

Scraping Spreadsheets with XYPath

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

So we wrote XYPath!