As well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is the universal solvent for data analysis problems – go look at any survey of data scientists. But Excel has it’s limitations. There are the technical limitations of something like a million rows maximum size but well before this size Excel becomes a pain to use.
There is another path – the programming route. As a physical scientist of moderate age I’ve followed these two data analysis paths in parallel. Excel for the quick look see and some presentation. Programming for bigger tasks, tasks I want to do repeatedly and types of data Excel simply can’t handle – like image data. For me the programming path started with FORTRAN and the NAG libraries, from which I moved into Matlab. FORTRAN is pure, traditional programming born in the days when you had to light your own computing fire. Matlab and competitors like Mathematica, R and IDL follow a slightly different path. At their core they are specialist programming languages but they come embedded in graphical environments which can be used interactively. You type code at a prompt and stuff happens, plots pop up and so forth. You can capture this interaction and put it into scripts/programs, or simply write programs from scratch.
Outside the physical sciences, data analysis often means databases. Physical scientists are largely interested in numbers, other sciences and business analysts are often interested in a mixture of numbers and categorical things. For example, in analysing the performance of a drug you may be interested in the dose (i.e. a number) but also in categorical features of the patient such as gender and their symptoms. Databases, and analysis packages such as R and SAS are better suited to this type of data. Business analysts appear to move from Excel to Tableau as their data get bigger and more complex. Tableau gives easy visualisation of database shaped data. It provides connectors to many different databases. My workflow at ScraperWiki is often Python to SQL database to Tableau.
Python for Data Analysis by Wes McKinney draws these threads together. The book is partly about the range of tools which make Python an alternative to systems like R, Matlab and their ilk and partly a guide to McKinney’s own contribution to this area: the pandas library. Pandas brings R-like dataframes and database-like operations to Python. It helps keep all your data analysis needs in one big Python-y tent. Dataframes are 2-dimensional tables of data whose rows and columns have indexes which can be numeric but are typically text. The pandas library provides a great deal of functionality to process Dataframes, in particular enabling filtering and grouping calculations which are reminiscent of the SQL database workflow. The indexes can be hierarchical. As well as the 2-dimensional Dataframe, pandas also provides 1-dimensional Series and a 3-dimensional Panel data structures.
I’ve already been using pandas in the Python part of my workflow. It’s excellent for importing data, and simplifies the process of reshaping data for upload to a SQL database and onwards to visualisation in Tableau. I’m also finding it can be used to help replace some of the more exploratory analysis I do in Tableau and SQL.
Outside of pandas the key technologies McKinney introduces are the ipython interactive console and the NumPy library. I mentioned the ipython notebook in my previous book review. ipython gives Python the interactive analysis capabilities of systems like Matlab. The NumPy library is a high performance library providing simple multi-dimensional arrays, comforting those who grew up with a FORTRAN background.
Why switch from commercial offerings like Matlab to the Python ecosystem? Partly it’s cost, the pricing model for Matlab has a moderately expensive core (i.e. $1000) with further functionality in moderately expensive toolboxes (more $1000s). Furthermore, the most painful and complex thing I did at my previous (very large) employer was represent users in the contractual interactions between my company and Mathworks to license Matlab and its associated tool boxes for hundreds of employees spread across the globe. These days Python offers me a wider range of high quality toolboxes, at it’s core it’s a respectable programming language with all the features and tooling that brings. If my code doesn’t run it’s because I wrote it wrong, not because my colleague in Shanghai has grabbed the last remaining network license for a key toolbox. R still offers statistical analysis with greater gravitas and some really nice, publication quality plotting but it does not have the air of a general purpose programming language.
The parts of Python for Data Analysis which I found most interesting, and engaging, were the examples of pandas code in “live” usage. Early in the book this includes analysis of first names for babies in the US over time, with later examples from the financial sector – in which the author worked. Much of the rest is very heavy on showing code snippets which is distracting from a straightforward reading of the book. In some senses Mining the Social Web has really spoiled me – I now expect a book like this to come with an Ipython Notebook!