Data Analysis – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Book review: Python for Data Analysis by Wes McKinney https://blog.scraperwiki.com/2014/01/book-review-python-data-analysis-wes-mckinney/ Thu, 30 Jan 2014 08:40:57 +0000 https://blog.scraperwiki.com/?p=758220922 PythonForDataAnalysis_coverAs well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is the universal solvent for data analysis problems – go look at any survey of data scientists. But Excel has it’s limitations. There are the technical limitations of something like a million rows maximum size but well before this size Excel becomes a pain to use.

There is another path – the programming route. As a physical scientist of moderate age I’ve followed these two data analysis paths in parallel. Excel for the quick look see and some presentation. Programming for bigger tasks, tasks I want to do repeatedly and types of data Excel simply can’t handle – like image data. For me the programming path started with FORTRAN and the NAG libraries, from which I moved into Matlab. FORTRAN is pure, traditional programming born in the days when you had to light your own computing fire. Matlab and competitors like Mathematica, R and IDL follow a slightly different path. At their core they are specialist programming languages but they come embedded in graphical environments which can be used interactively. You type code at a prompt and stuff happens, plots pop up and so forth. You can capture this interaction and put it into scripts/programs, or simply write programs from scratch.

Outside the physical sciences, data analysis often means databases. Physical scientists are largely interested in numbers, other sciences and business analysts are often interested in a mixture of numbers and categorical things. For example, in analysing the performance of a drug you may be interested in the dose (i.e. a number) but also in categorical features of the patient such as gender and their symptoms. Databases, and analysis packages such as R and SAS are better suited to this type of data. Business analysts appear to move from Excel to Tableau as their data get bigger and more complex. Tableau gives easy visualisation of database shaped data. It provides connectors to many different databases. My workflow at ScraperWiki is often Python to SQL database to Tableau.

Python for Data Analysis by Wes McKinney draws these threads together. The book is partly about the range of tools which make Python an alternative to systems like R, Matlab and their ilk and partly a guide to McKinney’s own contribution to this area: the pandas library. Pandas brings R-like dataframes and database-like operations to Python. It helps keep all your data analysis needs in one big Python-y tent. Dataframes are 2-dimensional tables of data whose rows and columns have indexes which can be numeric but are typically text. The pandas library provides a great deal of functionality to process Dataframes, in particular enabling filtering and grouping calculations which are reminiscent of the SQL database workflow. The indexes can be hierarchical. As well as the 2-dimensional Dataframe, pandas also provides 1-dimensional Series and a 3-dimensional Panel data structures.

I’ve already been using pandas in the Python part of my workflow. It’s excellent for importing data, and simplifies the process of reshaping data for upload to a SQL database and onwards to visualisation in Tableau. I’m also finding it can be used to help replace some of the more exploratory analysis I do in Tableau and SQL.

Outside of pandas the key technologies McKinney introduces are the ipython interactive console and the NumPy library. I mentioned the ipython notebook in my previous book review. ipython gives Python the interactive analysis capabilities of systems like Matlab. The NumPy library is a high performance library providing simple multi-dimensional arrays, comforting those who grew up with a FORTRAN background.

Why switch from commercial offerings like Matlab to the Python ecosystem? Partly it’s cost, the pricing model for Matlab has a moderately expensive core (i.e. $1000) with further functionality in moderately expensive toolboxes (more $1000s). Furthermore, the most painful and complex thing I did at my previous (very large) employer was represent users in the contractual interactions between my company and Mathworks to license Matlab and its associated tool boxes for hundreds of employees spread across the globe. These days Python offers me a wider range of high quality toolboxes, at it’s core it’s a respectable programming language with all the features and tooling that brings. If my code doesn’t run it’s because I wrote it wrong, not because my colleague in Shanghai has grabbed the last remaining network license for a key toolbox. R still offers statistical analysis with greater gravitas and some really nice, publication quality plotting but it does not have the air of a general purpose programming language.

The parts of Python for Data Analysis which I found most interesting, and engaging, were the examples of pandas code in “live” usage. Early in the book this includes analysis of first names for babies in the US over time, with later examples from the financial sector – in which the author worked. Much of the rest is very heavy on showing code snippets which is distracting from a straightforward reading of the book.  In some senses Mining the Social Web has really spoiled me – I now expect a book like this to come with an Ipython Notebook!

]]>
758220922
A sea of data https://blog.scraperwiki.com/2013/04/a-sea-of-data/ Tue, 30 Apr 2013 11:14:29 +0000 http://blog.scraperwiki.com/?p=758218445 Napoleon_saintheleneMy friend Simon Holgate of Sea Level Research has recently “cursed” me by introducing me to tides and sea-level data. Now I’m hooked. Why are tides interesting? When you’re trying to navigate a super-tanker into San Francisco Bay and you only have few centimetres of clearance, whether the tide is in or out could be quite important!

The French port of Brest has the longest historical tidal record. The Joint Archive for Sea Level has hourly readings from 1846. Those of you wanting to follow along at home should get the code:

[sourcecode language=”text”]
git clone git://github.com/drj11/sea-level-tool.git
cd sea-level-tool
virtualenv .
. bin/activate
pip install -r requirements.txt
[/sourcecode]

After that lot (phew!), you can get the data for Brest by going:

[sourcecode language=”text”]
code/etl 822a
[/sourcecode]

The sea level tool is written in Python and uses our scraperwiki library to store the sea level data in a sqlite database.

Tide data can be surprisingly complex (the 486 pages of [PUGH1987] are testimony to that), but in essence we have a time series of heights, z. Often even really simple analyses can tell us interesting facts about the data.

As Ian tells us, R is good for visualisations. And it turns out it has an installable RSQLite package that can load R dataframes from a sqlite file. And I feel like a grown-up data scientist when I use R. The relevant snippet of R is:

[sourcecode language=”r”]
library(RSQLite)
db <- dbConnect(dbDriver(‘SQLite’), dbname=’scraperwiki.sqlite’, loadable.extensions=TRUE)
bre <- dbGetQuery(db, ‘SELECT*FROM obs WHERE jaslid==”h822a” ORDER BY t’)
[/sourcecode]

I’m sure you’re all aware that the sea level goes up and down to make tides and some tides are bigger than others. Here’s a typical month at Brest (1999-01):

bre-ts

There are well over 1500 months of data for Brest. Can we summarise the data? A histogram works well:

bre-hist

Remember that this is a histogram of hourly sea level observations. So the two humps show the most frequent sea level heights that appear in the hourly series. These are clustered around two heights that are more commonly observed than all others. These are the mean low tide, and the mean high tide. The range, the distance between mean low tide and mean high tide, is about 2.5 metres (big tides, big data!).

This is a comparitively large range, certainly compared to a site like St Helena (where the British imprisoned Napoleon after his defeat at Waterloo). Let’s plot St Helena’s tides on the same histogram as Brest, for comparison:

sth2-hist

Again we have a mean low tide and a mean high tide, but this time the range is about 0.4 metres, and the entire span of observed heights including extremes fits into 1.5 metres. St Helena is a rock in the middle of a large ocean, and this small range is typical of the oceanic tides. It’s the shallow waters of a continental shelf, and complex basin dynamics in northwest Europe (and Kelvin waves, see Lucy’s IgniteLiverpool talk for more details) that gives ports like Brest a high tidal range.

Notice that St Helena has some negative sea levels. Sea level is measured to a 0-point that is fixed for each station but varies from station to station. It is common to pick that point as being the lowest sea level (either observed or predicted) over some period, so that almost all actual observations are positive. Brest follows the usual convention, almost all the observations are positive (you can’t tell from the histogram but there are a few negative ones). It is not clear what the 0-point on the St Helena chart is (it’s clearly not a low low water, and doesn’t look like a mean water level either), and I have exhausted the budget for researching the matter.

Tides are a new subject for me, and when I was reading Pugh’s book, one of the first surprises was the existence of places that do not get two tides a day. An example is Fremantle, Australia, which instead of getting two tides a day (semi-diurnal) gets just one tide a day (diurnal):

fre-ts

The diurnal tides are produced predominantly by the effect of lunar declination. When the moon crosses the equator (twice a nodical month), its declination is zero, the effect is reduced to zero, and so are the diurnal tides. This is in contrast to the twice-daily tides which, while they exhibit large (spring) and small (neap) tides, we still get tides whatever time of the month it is. Because of the modulation of the diurnal tide there is no “mean low tide” and “mean high tide”, tides of all heights are produced, and we get a single hump in the distribution (adding the fremantle data in red):

fre3-hist

So we’ve found something interesting about the Fremantle tides from the kind of histogram which we probably learnt to do in primary school.

Napoleon died on St Helena, but my investigations into St Helena’s tides will continue on the ScraperWiki data hub, using a mixture of standard platform tools, like the summarise tool, and custom tools, like a tidal analysis tool.

Image “Napoleon at Saint-Helene, by Francois-Joseph Sandmann,” in Public Domain from Wikipedia

]]>
758218445
‘Big Data’ in the Big Apple https://blog.scraperwiki.com/2011/09/big-data-in-the-big-apple/ Thu, 29 Sep 2011 15:05:25 +0000 http://blog.scraperwiki.com/?p=758215505 My colleague @frabcus captured the main theme of Strata New York #strataconf in his most recent blog post.  This was our first official speaking engagement in the USA as a Knight News Challenge 2011 winner.  Here is my twopence worth!

At first we were a little confused at the way in which the week long conference was split into three consecutive mini conferences with what looked like repetitive content.  The reality was that the one day Strata Jump Start was like an MBA for people trying to understand the meaning of ‘Big Data’.  It gave a 50,000 foot view of what is going on and made us think about the legal stuff, how it will impact the demand for skills and how the pace with which data is exploding will dramatically change the way in which businesses operate – every CEO should attend or watch the videos and learn!

The following two days called the Strata Summit were focused on what people need to

Big Apple…and small products

think about strategically to get business ready for the onslaught.  In his welcome address Edd Dumbill program chair for O’Reilly said “Computers should serve humans….we have been turned into filing clerks by computers….we spend our day sifting, sorting and filing information…something has gone upside down, fortunately the systems that we have created are also part of the solution…big data can help us…it may be the case that big data has to help us!”

To use the local lingo we took a ‘deep dive’ into various aspects of the challenges.  The sessions were well choreographed and curated.  We particularly liked the session ‘Transparency and Strategic Leaking’ by Dr Michael Nelson (Leading Edge Forum- CSC) where he talked about how companies need to be pragmatic in an age when it is impossible to stop data leaking out of the door.  Companies he said ‘are going to have to be transparent’ and ‘are going to have to have a transparency policy’.   He referred to a recent article in the Economist ‘The Leaking Corporation’ and its assertion that corporations that leak their own data ‘control the story’.

Courtesy of O’Reilly Media

Simon Wardley’s (Leading Edge Forum – CSC) ‘Situation Normal Everything Must Change’ segment made us laugh especially the philosophical little quips that came from his encounter with a London taxi driver – he conducted it at lightening speed and his explanation of ‘ecosystems’ and how big data offers a potential solution to the ‘Innovation Paradox’ was insightful.   It was a heavy duty session but worth it!

Courtesy of O’Reilly Media

There were tons of excellent sessions to peruse.  We really enjoyed Cathy O’Neill’s  What kinds of people are needed for data management’ which talked about data scientists and how they can help corporations  to discern ‘noise’ from signal.

Our very own Francis Irving was interviewed about how ScraperWiki relates to Big Data and Investigative Journalism.

Courtesy of O’Reilly Media

Unfortunately we did not manage to see many of the technology exhibitors #fail. However we did see some very sexy ideas including a wonderful software start-up called Kaggle.com – a platform for data prediction competitions and its Chief Data Scientist Jeremy Howard gave us some great ideas on how to manage ‘labour markets’.

..Oh yes and we checked out why it is called Strata….

We had to leave early to attend the Online New Association – #ONA event in Boston so we missed part III which was the two day Strata Conference itself – it is designed for people at the cutting edge of data –  the data scientists and data activists!  I just hope that we manage to get to Strata 2012 in Santa Clara next February.

In his closing address ‘Towards a global brain’  Tim O’Reilly gave a list of 10 scary things that are leading into the perfect humanitarian storm including…Climate Change, Financial Meltdown, Disease Control, Government inertia……so we came away thinking of a T-Shirt theme…Hmm we’re f**ked so lets scrape!!!

Courtesy Hugh MacLeod

]]>
758215505