R – ScraperWiki

Making a ScraperWiki view with R

Ian Hopkinson — Fri, 23 Aug 2013 17:52:34 +0000

In a recent post I showed how to use the ScraperWiki Twitter Search Tool to capture tweets for analysis. I demonstrated this using a search on the #InspiringWomen hashtag, using Tableau to generate a visualisation.

Here I’m going to show a tool made using the R statistical programming language which can be used to view any Twitter Search dataset. R is very widely used in both academia and industry to carry out statistical analysis. It is open source and has a large community of users who are actively developing new libraries with new functionality.

Although this viewer is a trivial example, it can be used as a template for any other R-based viewer. To break the suspense this is what the output of the tool looks like:

The tool updates when the underlying data is updated, the Twitter Search tool checks for new tweets on an hourly basis. The tool shows the number of tweets found and a histogram of the times at which they were tweeted. To limit the time taken to generate a view the number of tweets is limited to 40,000. The histogram uses bins of one minute, so the vertical axis shows tweets per minute.

The code can all be found in this BitBucket repository.

The viewer is based on the knitr package for R, which generates reports in specified formats (HTML, PDF etc) from a source template file which contains R commands which are executed to generate content. In this case we use Rhtml, rather than the alternative Markdown, which enables us to specify custom CSS and JavaScript to integrate with the ScraperWiki platform.

ScraperWiki tools live in their own UNIX accounts called “boxes”, the code for the tool lives in a subdirectory, ~/tool, and web content in the ~/http directory is displayed. In this project the http directory contains a short JavaScript file, code.js, which by the magic of jQuery and some messy bash shell commands, puts the URL of the SQL endpoint into a file in the box. It also runs a package installation script once after the tool is first installed, the only package not already installed is the ggplot2 package.

View the code on Gist.

The ScraperWiki platform has an update hook, simply an executable file called update in the ~/tool/hooks/ directory which is executed when the underlying dataset changes.

This brings us to the meat of the viewer: the knitrview.R file calls the knitr package to take the view.Rhtml file and convert it into an index.html file in the http directory. The view.Rhtml file contains calls to some functions in R which are used to create the dynamic content.

View the code on Gist.

Code for interacting with the ScraperWiki platform is in the scraperwiki_utils.R file, this contains:

a function to read the SQL endpoint URL which is dumped into the box by some JavaScript used in the Rhtml template.
a function to read the JSON output from the SQL endpoint – this is a little convoluted since R cannot natively use https, and solutions to read https are different on Windows and Linux platforms.
a function to convert imported JSON dataframes to a clean dataframe. The data structure returned by the rjson package is comprised of lists of lists and requires reprocessing to the preferred vector based dataframe format.

Functions for generating the view elements are in view-source.R, this means that the R code embedded in the Rhtml template are simple function calls. The main plot is generated using the ggplot2 library.

View the code on Gist.

So there you go – not the world’s most exciting tool but it shows the way to make live reports on the ScraperWiki platform using R. Extensions to this would be to allow some user interaction, for example by allowing them to adjust the axis limits. This could be done either using JavaScript and vanilla R or using Shiny.

What would you do with R in ScraperWiki? Let me know in the comments below or by email: ian@scraperwiki.com

A sea of data

David Jones — Tue, 30 Apr 2013 11:14:29 +0000

My friend Simon Holgate of Sea Level Research has recently “cursed” me by introducing me to tides and sea-level data. Now I’m hooked. Why are tides interesting? When you’re trying to navigate a super-tanker into San Francisco Bay and you only have few centimetres of clearance, whether the tide is in or out could be quite important!

The French port of Brest has the longest historical tidal record. The Joint Archive for Sea Level has hourly readings from 1846. Those of you wanting to follow along at home should get the code:

[sourcecode language=”text”]
git clone git://github.com/drj11/sea-level-tool.git
cd sea-level-tool
virtualenv .
. bin/activate
pip install -r requirements.txt
[/sourcecode]

After that lot (phew!), you can get the data for Brest by going:

[sourcecode language=”text”]
code/etl 822a
[/sourcecode]

The sea level tool is written in Python and uses our scraperwiki library to store the sea level data in a sqlite database.

Tide data can be surprisingly complex (the 486 pages of [PUGH1987] are testimony to that), but in essence we have a time series of heights, z. Often even really simple analyses can tell us interesting facts about the data.

As Ian tells us, R is good for visualisations. And it turns out it has an installable RSQLite package that can load R dataframes from a sqlite file. And I feel like a grown-up data scientist when I use R. The relevant snippet of R is:

[sourcecode language=”r”]
library(RSQLite)
db <- dbConnect(dbDriver(‘SQLite’), dbname=’scraperwiki.sqlite’, loadable.extensions=TRUE)
bre <- dbGetQuery(db, ‘SELECT*FROM obs WHERE jaslid==”h822a” ORDER BY t’)
[/sourcecode]

I’m sure you’re all aware that the sea level goes up and down to make tides and some tides are bigger than others. Here’s a typical month at Brest (1999-01):

There are well over 1500 months of data for Brest. Can we summarise the data? A histogram works well:

Remember that this is a histogram of hourly sea level observations. So the two humps show the most frequent sea level heights that appear in the hourly series. These are clustered around two heights that are more commonly observed than all others. These are the mean low tide, and the mean high tide. The range, the distance between mean low tide and mean high tide, is about 2.5 metres (big tides, big data!).

This is a comparitively large range, certainly compared to a site like St Helena (where the British imprisoned Napoleon after his defeat at Waterloo). Let’s plot St Helena’s tides on the same histogram as Brest, for comparison:

Again we have a mean low tide and a mean high tide, but this time the range is about 0.4 metres, and the entire span of observed heights including extremes fits into 1.5 metres. St Helena is a rock in the middle of a large ocean, and this small range is typical of the oceanic tides. It’s the shallow waters of a continental shelf, and complex basin dynamics in northwest Europe (and Kelvin waves, see Lucy’s IgniteLiverpool talk for more details) that gives ports like Brest a high tidal range.

Notice that St Helena has some negative sea levels. Sea level is measured to a 0-point that is fixed for each station but varies from station to station. It is common to pick that point as being the lowest sea level (either observed or predicted) over some period, so that almost all actual observations are positive. Brest follows the usual convention, almost all the observations are positive (you can’t tell from the histogram but there are a few negative ones). It is not clear what the 0-point on the St Helena chart is (it’s clearly not a low low water, and doesn’t look like a mean water level either), and I have exhausted the budget for researching the matter.

Tides are a new subject for me, and when I was reading Pugh’s book, one of the first surprises was the existence of places that do not get two tides a day. An example is Fremantle, Australia, which instead of getting two tides a day (semi-diurnal) gets just one tide a day (diurnal):

The diurnal tides are produced predominantly by the effect of lunar declination. When the moon crosses the equator (twice a nodical month), its declination is zero, the effect is reduced to zero, and so are the diurnal tides. This is in contrast to the twice-daily tides which, while they exhibit large (spring) and small (neap) tides, we still get tides whatever time of the month it is. Because of the modulation of the diurnal tide there is no “mean low tide” and “mean high tide”, tides of all heights are produced, and we get a single hump in the distribution (adding the fremantle data in red):

So we’ve found something interesting about the Fremantle tides from the kind of histogram which we probably learnt to do in primary school.

Napoleon died on St Helena, but my investigations into St Helena’s tides will continue on the ScraperWiki data hub, using a mixture of standard platform tools, like the summarise tool, and custom tools, like a tidal analysis tool.

Image “Napoleon at Saint-Helene, by Francois-Joseph Sandmann,” in Public Domain from Wikipedia

Book Review: R in Action by Robert I. Kabacoff

Ian Hopkinson — Wed, 27 Mar 2013 14:43:56 +0000

This is a review of Robert I. Kabacoff’s book R in Action which is a guided tour around the statistical computing package, R.

My reasons for reading this book were two-fold: firstly, I’m interested in using R for statistical analysis and visualisation. Previously I’ve used Matlab for this type of work, but R is growing in importance in the data science and statistics communities; and it is a better fit for the ScraperWiki platform. Secondly, I feel the need to learn more statistics. As a physicist my exposure to statistics is relatively slight – I’ve wondered why this is the case and I’m eager to learn more.

In both cases I see this book as an atlas for the area rather than an A-Z streetmap. I’m looking to find out what is possible and where to learn more rather than necessarily finding the detail in this book.

R in Action follows a logical sequence of steps for importing, managing, analysing, and visualising data for some example cases. It introduces the fundamental mindset of R, in terms of syntax and concepts. Central of these is the data frame – a concept carried over from other statistical analysis packages. A data frame is a collection of variables which may have different types (continuous, categorical, character). The variables form the columns in a structure which looks like a matrix – the rows are known as observations. A simple data frame would contain the height, weight, name and gender of a set of people. R has extensive facilities for manipulating and reorganising data frames (I particularly like the sound of melt in the reshape library).

R also has some syntactic quirks. For example, the dot (.) character, often used as a structure accessor in other languages, is just another character as far as R is concerned. The $ character fulfills a structure accessor-like role. Kabacoff sticks with the R user’s affection for using <- as the assignment operator instead of = which is what everyone else uses, and appears to work perfectly well in R.

R offers a huge range of plot types out-of-the-box, with many more a package-install away (and installing packages is a trivial affair). Plots in the base package are workman-like but not the most beautiful. I liked the kernel density plots which give smoothed approximations to histogram plots and the rug plots which put little ticks on the axes to show where the data in the body of that plot fall. These are all shown in the plot below, plotted from example data included in R.

The ggplot2 package provides rather more beautiful plots and seems to be the choice for more serious users of R.

The statistical parts of the book cover regression, power analysis, methods for handling missing data, group comparison methods (t-tests and ANOVA), and principle component and factor analysis, permutation and bootstrap methods. I found it a really useful survey – enough to get the gist and understand the principles with pointers to more in-depth information.

One theme running through the book, is that there are multiple ways of doing almost anything in R, as a result of its rich package ecosystem. This comes to something of a head with graphics in the final section: there are 4 different graphics systems with overlapping functionality but different syntax. This collides a little with the Matlab way of doing things where there is the one true path provided by Matlab alongside a fairly good, but less integrated, ecosystem of user-provided functionality.

R is really nice for this example-based approach because the base distribution includes many sample data sets with which to play. In addition, add-on packages often include sample data sets on which to experiment with the tools they provide. The code used in the book is all relatively short; the emphasis is on the data and analysis of the data rather than trying to build larger software objects. You can do an awful lot in a few lines of R.

As an answer to my statistical questions: it turns out that physics tends to focus on Gaussian-distributed, continuous variables, while statistics does not share this focus. Statistics is more generally interested in both categorical and continuous variables, and distributions cannot be assumed. For a physicist, experiments are designed where most variables are fixed, and the response of the system is measured as just one or two variables. Furthermore, there is typically a physical theory with which the data are fitted, rather than a need to derive an empirical model. These features mean that a physicist’s exposure to statistical methods is quite narrow.

Ultimately I don’t learn how to code by reading a book, I learn by solving a problem using the new tool – this is work in progress for me and R, so watch this space! As a taster, just half a dozen lines of code produced the appealing visualisation of twitter profiles shown below:

(Here’s the code: https://gist.github.com/IanHopkinson/5318354)

Data Business Models

thomaslevine — Wed, 27 Feb 2013 09:46:39 +0000

If it sometimes feels like the data business is full of buzzwords and hipster technical jargon, then that’s probably because it is. But don’t panic! I’ve been at loads of hip and non-hip data talks here and there and, buzzwords aside, I’ve come across four actual categories of data business model in this hip data ecosystem. Here they are:

Big storage for big people
Money in, insight out: Vertically integrated data analysis
Internal data analysis on an organization’s own data
Quantitative finance

1) Big storage for big people

This is mostly Hadoop. For example,

Teradata
Hortonworks
MapR
Cloudera

Some people are using NoHadoop. (I just invented this word.)

Datastax (Cassandra)
Couchbase (Couch but not the original Couch)
10gen (Mongo)

Either way, these companies sell consulting, training, hosting, proprietary special features &c. to big businesses with shit tons of data.

2) Money in, insight out: Vertically integrated data analysis

Several companies package data collection, analysis and presentation into one integrated service. I think this is pretty close to “research”. One example is AIMIA, which manages the Nectar card scheme; as a small part of this, they analyze the data that they collect and present ideas to clients. Many producers of hip data tools also provide hip data consulting, so they too fall into this category.

Data hubs

Some companies produce suites of tools that approach this vertical integration; when you use these tools, you still have to look at the data yourself, but it is made much easier. This approaches the ‘data hubs’ that Francis likes talking about.

Lots of advertising, web and social media analytics tools fall into this category. You just configure your accounts, let data accumulate, and look at the flashy dashboard. You still have to put some thought into it, but the collection, analysis and presentation are all streamlined and integrated and thus easier for people who wouldn’t otherwise do this themselves.

Tools like Tableau, ScraperWiki, RStudio (combined with its tangential R services) also fall into this category. You still have to do your analysis, but they let you do all of your analysis in one place, and connections between that place, your data sources and your presentatino media are easy. Well that’s the idea at least.

3) Internal data analysis

Places with lots of data have internal people do something with them. Any company that’s making money must have something like this. The mainstream companies might call these people “business analysts”, and they might do all their work in Excel. The hip companies are doing “data science” with open source software before it gets cool. And the New York City government has a team that just analyzes New York data to make the various government services more efficient. For the current discussion, I see these as similar sorts of people.

I was pondering distinguishing between analysis that affects businessy decisions from models that get written into software. Since I’m just categorising business models and these things could both be produced by the same guy working inside a company with lots of data, I chose not to distinguish between them.

4) Quantitative finance

Quantitative finance is special in that the data analysis is very close to a product in itself. The conclusion of analysis or algorithm is: “Make these trades when that happens.” Rather than “If you market to these people, you might sell more products.”

This has some interesting implications. For one thing, you could have a whole company doing quantative finance. On a similar note, I suspect that analyses can be more complicated because the analyses might only need to be conveyed to people with quantitative literacy; in the other categories, it might be more important to convey insights to non-technical managers.

The end

Pretend that I made some insightful, conclusionary conclusion in this sentence. And then get back to your number crunching.

Happy New Year and Happy New York!

Aine McGuire — Tue, 03 Jan 2012 20:32:42 +0000

We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams. We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations. https://scraperwiki.com/events/