spreadsheet – ScraperWiki

Scraping Spreadsheets with XYPath

David McKee — Wed, 12 Mar 2014 17:11:12 +0000

Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.

Source: http://data.worldbank.org/indicator/SP.POP.TOTL

So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries. Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

Scraping guides: Excel spreadsheets

Francis Irving — Wed, 14 Sep 2011 15:55:29 +0000

Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.

The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?

The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.

We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.

Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.

This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.

Next time – parsing HTML with CSS selectors.

Scraping guides: Values, separated by commas

Francis Irving — Thu, 25 Aug 2011 12:50:30 +0000

When we revamped our documentation a while ago, we promised guides to specific scraper libraries, such as lxml, Nokogiri and so on.

We’re now staring to roll those out. The first one is simple, but a good one. Go to the documentation page and you’ll find a new section called “scraping guides”.

The CSV scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

“CSV” stands for “comma separated value”. It’s a basic but quite common format for publishing spreadsheet files. Take a look at the scrapers tagged CSV on ScraperWiki. Lots of ministerial meetings and government spending records.

The CSV scraping guide shows you how to download and parse a CSV file, and how to easily save it into the datastore.

Why write a scraper just to load a CSV file in? First there are quirks you’ll inevitably find – inconsistencies in fields, extra rows and columns, dates that you want formatting and so on. Secondly, you might want to load in multiple files and merge them together. Finally, you get the data in ScraperWiki, giving you a JSON API and letting you make views.

You can do quite funky things with even just CSV scraping. For example, see Nicola’s @scraper_no10 Twitter bot.

Next time, Excel files.