Scraping guides: Excel spreadsheets
Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page.
The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.
As with CSV files, at first it seems odd to be scraping Excel spreadsheets, when they’re already at least semi-structured data. Why would you do it?
The format of Excel files can varies a lot – how columns are arranged, where tables appear, what worksheets there are. There can be errors and inconsistencies that are easiest to fix in code. Sometimes you’ll find the data is there but not formatted in cells – entire rows in one cell, or data stored in notes.
We used an Excel scraper that pulls together 9 spreadsheets into one dataset for the brownfield sites map used by Channel 4 News.
Dave Hughes has one that converts a spreadsheet from an FOI request, making a nice dataset of temperatures in Cambridge’s botanical garden.
This merchant oil shipping scraper does a few regular expressions to parse some text in one of the columns.
Next time – parsing HTML with CSS selectors.