python – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Book review: Python for Data Analysis by Wes McKinney Thu, 30 Jan 2014 08:40:57 +0000 PythonForDataAnalysis_coverAs well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is the universal solvent for data analysis problems – go look at any survey of data scientists. But Excel has it’s limitations. There are the technical limitations of something like a million rows maximum size but well before this size Excel becomes a pain to use.

There is another path – the programming route. As a physical scientist of moderate age I’ve followed these two data analysis paths in parallel. Excel for the quick look see and some presentation. Programming for bigger tasks, tasks I want to do repeatedly and types of data Excel simply can’t handle – like image data. For me the programming path started with FORTRAN and the NAG libraries, from which I moved into Matlab. FORTRAN is pure, traditional programming born in the days when you had to light your own computing fire. Matlab and competitors like Mathematica, R and IDL follow a slightly different path. At their core they are specialist programming languages but they come embedded in graphical environments which can be used interactively. You type code at a prompt and stuff happens, plots pop up and so forth. You can capture this interaction and put it into scripts/programs, or simply write programs from scratch.

Outside the physical sciences, data analysis often means databases. Physical scientists are largely interested in numbers, other sciences and business analysts are often interested in a mixture of numbers and categorical things. For example, in analysing the performance of a drug you may be interested in the dose (i.e. a number) but also in categorical features of the patient such as gender and their symptoms. Databases, and analysis packages such as R and SAS are better suited to this type of data. Business analysts appear to move from Excel to Tableau as their data get bigger and more complex. Tableau gives easy visualisation of database shaped data. It provides connectors to many different databases. My workflow at ScraperWiki is often Python to SQL database to Tableau.

Python for Data Analysis by Wes McKinney draws these threads together. The book is partly about the range of tools which make Python an alternative to systems like R, Matlab and their ilk and partly a guide to McKinney’s own contribution to this area: the pandas library. Pandas brings R-like dataframes and database-like operations to Python. It helps keep all your data analysis needs in one big Python-y tent. Dataframes are 2-dimensional tables of data whose rows and columns have indexes which can be numeric but are typically text. The pandas library provides a great deal of functionality to process Dataframes, in particular enabling filtering and grouping calculations which are reminiscent of the SQL database workflow. The indexes can be hierarchical. As well as the 2-dimensional Dataframe, pandas also provides 1-dimensional Series and a 3-dimensional Panel data structures.

I’ve already been using pandas in the Python part of my workflow. It’s excellent for importing data, and simplifies the process of reshaping data for upload to a SQL database and onwards to visualisation in Tableau. I’m also finding it can be used to help replace some of the more exploratory analysis I do in Tableau and SQL.

Outside of pandas the key technologies McKinney introduces are the ipython interactive console and the NumPy library. I mentioned the ipython notebook in my previous book review. ipython gives Python the interactive analysis capabilities of systems like Matlab. The NumPy library is a high performance library providing simple multi-dimensional arrays, comforting those who grew up with a FORTRAN background.

Why switch from commercial offerings like Matlab to the Python ecosystem? Partly it’s cost, the pricing model for Matlab has a moderately expensive core (i.e. $1000) with further functionality in moderately expensive toolboxes (more $1000s). Furthermore, the most painful and complex thing I did at my previous (very large) employer was represent users in the contractual interactions between my company and Mathworks to license Matlab and its associated tool boxes for hundreds of employees spread across the globe. These days Python offers me a wider range of high quality toolboxes, at it’s core it’s a respectable programming language with all the features and tooling that brings. If my code doesn’t run it’s because I wrote it wrong, not because my colleague in Shanghai has grabbed the last remaining network license for a key toolbox. R still offers statistical analysis with greater gravitas and some really nice, publication quality plotting but it does not have the air of a general purpose programming language.

The parts of Python for Data Analysis which I found most interesting, and engaging, were the examples of pandas code in “live” usage. Early in the book this includes analysis of first names for babies in the US over time, with later examples from the financial sector – in which the author worked. Much of the rest is very heavy on showing code snippets which is distracting from a straightforward reading of the book.  In some senses Mining the Social Web has really spoiled me – I now expect a book like this to come with an Ipython Notebook!

Mastering space and time with jQuery deferreds Wed, 28 Aug 2013 16:00:32 +0000 A screenshot of Rod Taylor enthusiastically grabbing a lever on his Time Machine in the 1960 film of the same nameRecently Zarino and I were pairing on making improvements to a new scraping tool on ScraperWiki. We were working on some code that allows the person using the tool to pick out parts of some scraped data in order to extract a date into a new database column. For processing the data on the server side we were using a little helper library called scrumble which does some cleaning in Python to produce dates in a standard format. Which is great for server-side, but we also needed to display a preview of the cleaned dates to the user, before it’s finally sent to the server for processing.

Rather than rewrite this Python code in JavaScript we thought we’d make a little script which could be called using the ScraperWiki exec endpoint to do the conversion for us on the server side.

Our code looked something like this:

var $tr = $('<tr>');

// for each cell in each row…
$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  // execute this command on the server…
  scraperwiki.exec('tools/ ' + date, function(response){
    // and put the result into this table cell…

Each time we needed to process a date with scrumble we made a call to our server side Python script via the exec endpoint. When the value comes back from the server, the callback function sets the content of the table cell to the value.

However when we started testing our code we hit a limit placed on the exec endpoint to prevent overloading the server (currently no more that 5 exec calls can be executing at once).

Our first thought was to just limit the rate at which we made requests so that we didn’t trip the rate limit, but our colleague Pete suggested we should think about batching up the requests to make them faster. Sending each one individually might work well with just a few requests, but what about when we needed to make hundreds or thousands of requests at a time?

How could we change it so that the conversion requests were batched, and the results were inserted into the right table cells once they’d been computed?

jQuery.deferred() to the rescue

We realised that we could use jQuery deferreds to allow us to do the batching. A deferred is like an I.O.U that says that at some point in the future a result will become available. Anybody who’s used jQuery to make an AJAX request will have used a deferred – you send off a request, and specify some callbacks to be executed when the request eventually succeeds or fails.

By returning a deferred we could delay the call to the server until all of the values to be converted have been collected and then make a single call to the server to convert them all.

Below is the code which does the batching:

scrumble = {
  deferreds: {},

  as_date: function (raw_date) {
    if (!this.deferreds[raw_date]) {
      d = $.Deferred()
      this.deferreds[raw_date] = d;
    return this.deferreds[raw_date].promise()

  process_dates: function () {
    var self = this;
    var raw_dates = _.keys(self.deferreds);
    var date_list = scraperwiki.shellEscape(JSON.stringify(raw_dates));
    var command = 'tool/ ' + date_list;
    scraperwiki.exec(command, function (response) {
      response_object = JSON.parse(response);
      $.each(response_object, function(key, value){

Each time as_date is called it creates or reuses a deferred which is stored in an object keyed on the raw_date string and then returns a promise (a deferred with a restricted interface) to the caller. The caller attaches a callback to the promise that will use the value once it is available.

To actually send the batch of dates off to be converted, we call the process_dates method. It makes a call to the server with all of the strings to be processed. When the result comes back from the server it “resolves” each of the deferreds with the processed value, which causes all of the callbacks to fire updating the user interface.

With this design the changes we had to make to our code were minimal. It was already using a callback to set the value of the table cell. It was just a case of attaching it to the jQuery promise returned by the scrumble.as_date method and calling scrumble.process_dates, after all of the items had been added, to make the server side call to convert all of the dates.

var $tr = $('<tr>');

$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));


Now instead of one call being made for every value that needs converting (whether or not that string has already been processed) a single call is made to convert all of the values at once. When the response comes back from the server, the promises are resolved and the user interface updates showing the user the preview as required. jQuery deferreds allowed us to make this change with minimal disruption to our existing code.

And it gets better…

Further optimisation (not shown here) is possible if process_dates is called multiple times. A little-known feature of jQuery deferreds is that they can only be resolved once. If you make an AJAX call like $.get('http://foo').done(myCallback) and then, some time later, call .done(myCallback) on that ajax response again, the callback myCallback is immediately called with the exact same arguments as before. It’s like magic.

We realised we could turn this quirky feature to our advantage. Rather than checking whether we’d already converted a date, and returning the pre-converted date on subsequent calls, rather than adding them to the queue to be processed, we just call the deferred .done() callback regardless, as if this was the first time. Deferreds that have already been handled are returned immediately, meaning we only send requests to the server if there are new dates that haven’t been processed yet.

jQuery deferreds helped us keep our user interface responsive, our network traffic low, and our code refreshingly simple. Not bad for a mysterious set of functions hidden halfway down the docs.

Programmers past, present and future Tue, 04 Jun 2013 09:20:38 +0000 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]> 1 758218506
The state of Twitter: Mitt Romney and Indonesian Politics Mon, 23 Jul 2012 09:16:53 +0000 It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can then map, graph or analyse further.

So, anyway, Tom and I decided it was about time to take a closer look at how you guys are using ScraperWiki to draw data from Twitter, and whether there’s anything we could do to make your lives easier in the process!

Getting under the hood at

As anybody who’s checkout out our source code will will know, we store a truck-load of information about each scraper and each run it’s ever made, in a MySQL database. Of 9727 scrapers that had run since the beginning of June, 601 accessed a URL. (Our database only stores the first URL that each scraper accesses on any particular run, so it’s possible that there are scripts that accessed twitter but not as the first URL.)

Twitter API endpoints

Getting more specific, these 601 scrapers accessed one of a number of Twitter’s endpoints, normally through the nominal API. We removed the querystring from each of the URLs and then looked for commonly accessed endpoints.

It turns out that search.json is by far the most popular entry point for ScraperWiki users to get Twitter data – probably because it’s the method used by the basic_twitter_scraper that has proved so popular on It takes a search term (like a username or a hashtag) and returns a list of tweets containing that term. Simple!

The next most popular endpoint – followers/ids.json – is a common way to find interesting user accounts to then scrape more details about. And, much to Tom’s amusement, the third endpoint, with 8 occurrences, was We can’t quite tell whether that’s a good or bad sign for his 2012 candidacy, but if it makes any difference, only one solitary scraper searched for Barack Obama.


We also looked at what people were searching for. We found 398 search terms in the scrapers that accessed the twitter search endpoint, but only 45 of these terms were called in more than one scraper. Some of the more popular ones were “#ddj” (7 scrapers), “occupy” (3 scrapers), “eurovision” (3 scrapers) and, weirdly, an empty string (5 scrapers).

Even though each particular search term was only accessed a few times, we were able to classify the search terms into broad groups. We sampled from the scrapers who accessed the twitter search endpoint and manually categorized them into categories that seemed reasonable. We took one sample to come up with mutually exclusive categories and another to estimate the number of scrapers in each category.

A bunch of scripts made searches for people or for occupy shenanigans. We estimate that these people- and occupy-focussed queries together account for between two- and four-fifths of the searches in total.

We also invented a some smaller categories that seemed to account for few scrapers each – like global warming, developer and journalism events, towns and cities, and Indonesian politics (!?) – But really it doesn’t seem like there’s any major pattern beyond the people and occupy scripts.

Family Tree

Speaking of the basic_twitter_scraper, we thought it would also be cool to dig into the family history of a few of these scrapers. When you see a scraper you like on ScraperWiki, you can copy it, and that relationship is recorded in our database.

Lots of people copy the basic_twitter_scraper in this way, and then just change one line to make it search for a different term. With that in mind, we’ve been thinking that we could probably make some better tweet-downloading tool to replace this script, but we don’t really know what it would look like. Maybe the users who’ve already copied basic_twitter_scraper_2 would have some ideas…

After getting the scraper details and relationship data into the right format, we imported the whole lot into the open source network visualisation tool Gephi, to see how each scraper was connected to its peers.

By the way, we don’t really know what we did to make this network diagram because we did it a couple weeks ago, forgot what we did, didn’t write a script for it (Gephi is all point-and-click..) and haven’t managed to replicate our results. (Oops.) We noticed this because we repeated all of the analyses for this post with new data right before posting it and didn’t manage to come up with the sort of network diagram we had made a couple weeks ago. But the old one was prettier so we used that :-)

It doesn’t take long to notice basic_twitter_scraper_2’s cult following in the graph. In total, 264 scrapers are part of its extended family, with 190 of those being descendents, and 74 being various sorts of cousins – such as scrape10_twitter_scraper, which was a copy of basic_twitter_scraper_2’s grandparent, twitter_earthquake_history_scraper (the whole family tree, in case you’re wondering, started with twitterhistory-scraper, written by Pedro Markun in March 2011).

With the owners of all these basic_twitter_scraper(_2)’s identified, we dropped a few of them an email to find out what they’re using the data for and how we could make it easier for them to gather in the future.

It turns out that Anna Powell-Smith wrote the basic_twitter_scraper at a journalism conference and Nicola Hughes reused it for loads of ScraperWiki workshops and demonstrations as basic_twitter_scraper_2. But even that doesn’t fully explain the cult following because people still keep copying it. If you’re one of those very users, make sure to send us a reply – we’d love to hear from you!!


We’ve posted our code for this analysis on Github, along with a table of information about the 594 Twitter scrapers that aren’t in vaults (out of 601 total Twitter scrapers) in case you’re as puzzled as we are by our users’ interest in Twitter data

Now here’s video of a cat playing a keyboard.

]]> 2 758217376
Software Archaeology and the ScraperWiki Data Challenge at #europython Fri, 29 Jun 2012 09:24:27 +0000 There’s a term in technical circles called “software archaeology” – it’s when you spend time studying and reverse-engineering badly documented code, to make it work, or make it better. Scraper writing involves a lot of this stuff. ScraperWiki’s data scientists are well accustomed with a bit of archaeology here and there.

But now, we want to do the same thing for the process of writing code-that-does-stuff-with-data. Data Science Archaeology, if you like. Most scrapers or visualisations are pretty self-explanatory (and an open platform like ScraperWiki makes interrogating and understanding other people’s code easier than ever). But working out why the code was written, why the visualisations were made, and who went to all that bother, is a little more difficult.

ScraperWiki Europython poster explaining the Data Challenge about European fishing boats

That’s why, this Summer, ScraperWiki’s on a quest to meet and collaborate with data science communities around the world. We’ve held journalism hack days in the US, and interviewed R statisticians from all over the place. And now, next week, Julian and Francis are heading out to Florence to meet the European Python community.

We want to know how Python programmers deal with data. What software environments do they use? Which functions? Which libraries? How much code is written to ‘get data’ and if it runs repeatedly? These people are geniuses, but for some reason nobody shouts about how they do what they do… Until now!

And, to coax the data science rock stars out of the woodwork, we’re setting a Data Challenge for you all…

In 2010 the BBC published the article about the ‘profound’ decline in fish stocks shown in UK Records. “Over-fishing,” they argued, “means UK trawlers have to work 17 times as hard for the same fish catch as 120 years ago.” The same thing is happening all across Europe, and it got us ScraperWikians wondering: how do the combined forces of legislation and overfishing affect trawler fleet numbers?

We want you to trawl (ba-dum-tsch) through this EU data set and work out which EU country is losing the most boats as fisherman strive to meet the EU policies and quotas. The data shows you stuff like each vessel’s license number, home port, maintenance history and transfer status, and a big “DES” if it’s been destroyed. We’ll be giving away a tasty prize to the most interesting exploration of the data – but most of all, we want to know how you found your answer, what tools you used, and what problems you overcame. So check it out!!


PS: #Europython’s going to be awesome, and if you’re not already signed up, you’re missing out. ScraperWiki is a startup sponsor for the event and we would like to thank the Europython organisers and specifically Lorenzo Mancini for his help in printing out a giant version of the picture above, ready for display at the Poster Session.

]]> 2 758217302
Local ScraperWiki Library Thu, 07 Jun 2012 15:24:28 +0000 It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.

How to use

pip install scraperwiki_local 

A dump truck dumping its payload
You can then import scraperwiki in scripts run on your local computer. The scraperwiki.sqlite component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local.

pip install dumptruck


DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:

Complex cell values

What happens if you do this?

import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw'][], {'shopping_list': shopping_list})

On a ScraperWiki server, shopping_list is converted to its unicode representation, which looks like this:

[u'carrots', u'orange juice', u'chainsaw'] 

In the local version, it is encoded to JSON, so it looks like this:

["carrots","orange juice","chainsaw"] 

And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.

Case-insensitive column names

SQL is less sensitive to case than Python. The following code works fine in both versions of the library.

In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]:[], {'shopping_list': shopping_list})
In [3]:[], {'sHOpPiNg_liST': shopping_list})
In [4]:'* from swdata')
Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]

Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.

In [5]: data ='* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]

The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.

In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']

Here’s why. In the hosted version, returns a list of ordinary dictionaries. In the local version, returns a list of special dictionaries that have case-insensitive keys.

Develop locally

Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.

If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!

]]> 5 758217004
How to stop missing the good weekends Fri, 20 Jan 2012 09:27:12 +0000 The BBC's Michael Fish presenting the weather in the 80s, with a ScraperWiki tractor superimposed over LiverpoolFar too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks — an activity which is somewhat weather dependent.

Luckily, help is at hand in the form of the ScraperWiki email alert system.

As you may have noticed, when you do any work on ScraperWiki, you start to receive daily emails that go:

Dear Julian_Todd,

Welcome to your personal ScraperWiki email update.

Of the 320 scrapers you own, and 157 scrapers you have edited, we
have the following news since 2011-12-01T14:51:34:

Histparl MP list - :
  * ran 1 times producing 0 records from 2 pages
  * with 1 exceptions, (XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<!DOCTYP')

...Lots more of the same

This concludes your ScraperWiki email update till next time.

Please follow this link to change how often you get these emails,
or to unsubscribe:

The idea behind this is to attract your attention to matters you may be interested in — such as fixing those poor dear scrapers you have worked on in the past and are now neglecting.

As with all good features, this was implemented as a quick hack.

I thought: why design a whole email alert system, with special options for daily and weekly emails, when we already have a scraper scheduling system which can do just that?

With the addition of a single flag to designate a scraper as an emailer (plus a further 20 lines of code), a new fully fledged extensible feature was born.

Of course, this is not counting the code that is in the Wiki part of ScraperWiki.

The default code in your emailer looks roughly like so:

import scraperwiki
emaillibrary = scraperwiki.utils.swimport("general-emails-on-scrapers")
subjectline, headerlines, bodylines, footerlines = emaillibrary.EmailMessageParts("onlyexceptions")
if bodylines:
    print "n".join([subjectline] + headerlines + bodylines + footerlines)

As you can see, it imports the 138 lines of Python from general-emails-on-scrapers, which I am not here to talk about right now.

Using ScraperWiki emails to watch the weather

Instead, what I want to explain is how I inserted my Good Weather Weekend Watcher by polling the weather forecast for Holyhead.

My extra code goes like this:

weatherlines = [ ]
if == 2:  # Wednesday
    url = ""
    html = urllib.urlopen(url).read()
    root = lxml.html.fromstring(html)
    rows = root.cssselect("div.tableWrapper table tr")
    for row in rows:
        #print lxml.html.tostring(row)
        metweatherline = row.text_content().strip()
        if metweatherline[:3] == "Sat":
            subjectline += " With added weather"
            weatherlines.append("*** Weather warning for the weekend:")
            weatherlines.append("   " + metweatherline)

What this does is check if today is Wednesday (day of the week #2 in Python land), then it parses through the Met Office Weather Report table for my chosen location, and pulls out the row for Saturday.

Finally we have to handle producing the combined email message, the one which can contain either a set of broken scraper alerts, or the weather forecast, or both.

if bodylines or weatherlines:
    if not bodylines:
        headerlines, footerlines = [ ], [ ]   # kill off cruft surrounding no message
    print "n".join([subjectline] + weatherlines + headerlines + bodylines + footerlines)

The current state of the result is:

*** Weather warning for the weekend:
  Mon 5Dec

  7 °C
  33 mph
  47 mph
  Very Good

This was a very quick low-level implementation of the idea with no formatting and no filtering yet.

Email alerts can quickly become sophisticated and complex. Maybe I should only send a message out if the wind is below a certain speed. Should I monitor previous days’ weather to predict whether the sea will be calm? Or I could check the wave heights on the off-shore buoys? Perhaps my calendar should be consulted for prior engagements so I don’t get frustrated by being told I am missing out on a good weekend when I had promised to go to a wedding.

The possibilities are endless and so much more interesting than if we’d implemented this email alert feature in the traditional way, rather than taking advantage of the utterly unique platform that we happened to already have in ScraperWiki.

]]> 1 758215936
Job advert: Lead programmer Thu, 27 Oct 2011 11:04:47 +0000 Oil wells, marathon results, planning applications…

ScraperWiki is a Silicon Valley style startup, in the North West of England, in Liverpool. We’re changing the world of open data, and how data science is done together on the Internet.

We’re looking for a programmer who’d like to:

  • Revolutionise the tools for sharing data, and code that works with data, on the Internet.
  • Take a lead in a lean startup, having good hunches on how to improve things, but not minding when A/B testing means axing weeks of code.

In terms of skills:

  • Be polyglot enough to be able to learn Python, and do other languages (Ruby, Javascript…) where necessary.
  • Be able to find one end of a clean web API and a test suite from another.
  • We’re a small team, so need to be able to do some DevOps on Linux servers.
  • Desirable – able to make igloos.

About ScraperWiki:

  • We’ve got funding (Knight News Challenge winners) and are in the brand new field of “data hubs”.
  • We’re before product/market fit, so it’ll be exciting, and you can end up a key, senior person in a growing company.

Some practical things:

We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.

Must be willing to either relocate to Liverpool, or able to work from home and travel to our office here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

  • A link to a previous project that you’ve worked on that you’re proud of, or a description of it if it isn’t publicly visible.
  • A link to a scraper or view you’ve made on ScraperWiki, involving a dataset that you find interesting for some reason.
  • Any questions you have about the job.

Along to with the word swjob2 in the subject (and yes, that means no agencies, unless the candidates do that themselves)

Pool temperatures, company registrations, dairy prices…

]]> 2 758215716
Lots of new libraries Wed, 26 Oct 2011 11:53:50 +0000

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

  • csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
  • requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
  • Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
  • pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
  • PyTidyLib, to access the old classic C library that cleans up HTML files.
  • SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
  • matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
  • Google Data (gdata) for calling various Google APIs to get data.
  • Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

  • tmail for parsing emails.
  • typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
  • Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

  • GeoIP for turning IP addresses into countries and cities.
Let us know if there are any libraries you need for scraping or data analysis!
]]> 2 758215702
Scraped Data Something to Tweet About Wed, 01 Jun 2011 16:11:18 +0000 I’m a coding pleb, or The Scraper’s Apprentice as I like to call myself. But I realise that’s no excuse, as many of the ScraperWiki users I talk to have not had formal coding lessons themselves. Indeed, some of our founders aren’t formally trained (we have a doctorate in Chemistry here!).

I’ve been attempting to scrape using the ScraperWiki editor (Lord knows I wouldn’t know where to start otherwise) and the most obvious place to start is the CSV plains of government data. So I parked my ScraperWiki digger outside the Cabinet Office. I began by scraping Special Advisers’ gifts and hospitality and then moved onto Ministers’ meetings as well as gifts and hospitality and also Permanent Secretaries (under the strict guidance of my scraper guru, The Julian). The first things I noticed was the blatant irregularity of standards and formats. Some fields were left blank assuming the previous entry was implied. And dates! Not even Goolge Refine could work out what was going on there.

As much as this was an exercise in scraping, I wanted to do something with it. I wanted to repurpose the data. I wrote some blog posts after looking through some of the datasets but the whole point of liberating data, especially government data, is to make it democratic. I figured that each line of data had the potential to be a story, I just didn’t know it. Chris Taggart said that data helps you find the part of the haystack the needle is in and not just the needle itself. So if I scatter that part of the haystack someone else might find the needle.

So with the help of Ross and Tom, I set up a Scrape_No10 Twitter account to to tweet out the meetings, gifts and hospitalities at No.10 in chronological order (hint: for completely erratic datetime formats code it all to be lexigraphical text e.g. 2010-09-14 and you can order it ok). The idea being that not only is each row of data in the public domain but I can fix the hashtag (#Scrape10). I’ve set the account to tweet out 3 tweets every 3 hours. That way a single tweet has the potential (depending on how many followers it can attract) to cause the hashtag to trend. And that is the sign that someone somewhere has added an interesting piece of information. In fact, Paul Lewis, Investigative Journalist at the Guardian noted “Tweets have an uncanny ability to find their destination”. So rather than filtering social media to find news-worthy information I’m using social media as a filter for public information and as a crowd sourcing alert system.

I ran this as an experiment, explaining all on my own blog. I was surprised when a certain ScraperWiki star wanted the Twitter bot code so I’ve copied it and taken out all the authorisation keys. You can view it here and use it as a template. This blog explains very clearly how to set things up at the Twitter end. So let’s get your data out into the open (but not spamming). If you make a Twitter bot please let me know the username. You can email me at nicola(at) I’d like to keep tabs of them all in the blog roll.

Oh and the script for getting dates into lexigraphical format (SQL can then order it) is:

import datetime
import dateutil.parser

def parsemonths(d):
    d = d.strip()
    print d
    return dateutil.parser.parse(d, yearfirst=True, dayfirst=True).date()

data['Date'] = parsemonths(data['Date'])
]]> 2 758214909