spreadsheets – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Spreadsheets are code: EuSpRIG conference. https://blog.scraperwiki.com/2015/07/eusprig/ https://blog.scraperwiki.com/2015/07/eusprig/#comments Thu, 16 Jul 2015 09:35:28 +0000 https://blog.scraperwiki.com/?p=758223366 EuSpRIG logo

I’m back from presenting a talk on DataBaker at the EuSpRIG conference. It’s amazing to see a completely different world of how people use Excel – I’ve been busy tearing the data out of spreadsheets for the Office of National Statistics and using macros to open PDF files in Excel directly using PDFTables. So whilst I’ve been thinking of spreadsheets as sources of raw data, it’s easy to forget how everyone else uses spreadsheets. The conference reminded me particularly of one simple fact about spreadsheets that often gets ignored:

Spreadsheets are code.

And spreadsheets are a way of writing code which hasn’t substantially changed since the days of Visicalc in 1978 (the same year as the book which defined the C programming language came out).

Programming languages have changed enormously in this time, promoting higher-level concepts like object orientation, whilst the core of the spreadsheet has remained the same. Certainly, there’s a surprising number of new features in Excel, but few of these help with the core tasks of programming within the spreadsheet.

Structure and style are important: it’s easy to write code which is a nightmare to read. Paul Mireault spoke of his methodology for reducing the complexity of spreadsheets by adhering to a strict set of rules involving copious use of array formulae and named ranges. It also involves working out your model before you start work in Excel, which culminates in a table of parameters, intermediate equations, and outputs.

And at this point I’m silently screaming: STOP! You’re done! You’ve got code!

Sure, there’s the small task of identifying which of these formulae are global, and which are regional and adding appropriate markup; but at this stage the hard work is done; converting that into your language of choice (including Excel) should be straightforward. Excel makes this process overly complicated, but at least Paul’s approach gives clear instructions on how best to handle this conversion (although his use of named ranges is as contentious as your choice of football team or, for programmers, editor.)

Tom Grossman’s talk on reusable spreadsheet code was a cry for help: is there a way of being able to reuse components in a reliable way? But Excel hampers us at every turn.

We can copy and paste cells, but there is so much magic involved. We’re implicitly writing formulae of the form “the cell three to the left” — but we never explicitly say that: instead we read a reference to G3 in cell J3. And we can’t easily replace these implicit references if we’re copy-pasting formulae snippets; we need to be pasting into exactly the right cell in the spreadsheet.

In most programming languages, we know exactly what we’ll get when we copy-and-paste within our source code: a character-by-character identical copy. But copy-and-paste programming is considered a bad ‘smell’: we should be writing reusable functions: but without stepping into the realm of macros each individual invocation of what would be a function needs to be a separate set of cells. There are possibilities of making this work with custom macro functions or plugins – but so many people can’t use spreadsheets containing macros or won’t have installed those plugins. It’s a feature missing from the very core of Excel which makes it so much more difficult and longwinded to work in it.

Not having these abstractions leads to errors. Ray Panko spoke of the errors we never see; how base error rates of a few percent are endemic across all fields of human endeavour. These error rates are at the time of writing the code the first time, and per instruction. We can hope to reduce these error rates through testing, peer review and pairing. Excel hinders testing and promotes copy-paste repetition, increasing the number of operations and the potential for errors. Improving code reuse would also help enormously: the easiest code to test is the code that isn’t there.

A big chunk of the problem is that people think about Excel the same wrong way they think about Word. In Word, it’s not a major problem, so long as you don’t need to edit the document: so long as it looks correct, that might be good enough, even if the slightest change breaks the formatting. That’s simply not true of spreadsheets where a number can look right but be entirely wrong.

Maria Csernoch’s presentation of Sprego – Spreadsheet Lego – described an approach for teaching programming through spreadsheets which is designed to get people thinking about solving the problems they face methodically, from the inside out, rather than repeatedly trying ‘Trial-and-Error Wizard-based’ approach with minimal understanding.

It’s interesting to note the widespread use of array formulae across a number of the talks – if you’re making spreadsheets and you don’t know about them, it might be worth spending a while learning about them!

In short, Excel is broken. And I strongly suspect it can’t be fixed. Yet it’s ubiquitous and business critical. We need to reinvent the wheel and change all four whilst the car is driving down the motorway — and I don’t know how to do that…

]]>
https://blog.scraperwiki.com/2015/07/eusprig/feed/ 3 758223366