Tom Mortimer-Jones – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Mastering space and time with jQuery deferreds Wed, 28 Aug 2013 16:00:32 +0000 A screenshot of Rod Taylor enthusiastically grabbing a lever on his Time Machine in the 1960 film of the same nameRecently Zarino and I were pairing on making improvements to a new scraping tool on ScraperWiki. We were working on some code that allows the person using the tool to pick out parts of some scraped data in order to extract a date into a new database column. For processing the data on the server side we were using a little helper library called scrumble which does some cleaning in Python to produce dates in a standard format. Which is great for server-side, but we also needed to display a preview of the cleaned dates to the user, before it’s finally sent to the server for processing.

Rather than rewrite this Python code in JavaScript we thought we’d make a little script which could be called using the ScraperWiki exec endpoint to do the conversion for us on the server side.

Our code looked something like this:

var $tr = $('<tr>');

// for each cell in each row…
$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  // execute this command on the server…
  scraperwiki.exec('tools/ ' + date, function(response){
    // and put the result into this table cell…

Each time we needed to process a date with scrumble we made a call to our server side Python script via the exec endpoint. When the value comes back from the server, the callback function sets the content of the table cell to the value.

However when we started testing our code we hit a limit placed on the exec endpoint to prevent overloading the server (currently no more that 5 exec calls can be executing at once).

Our first thought was to just limit the rate at which we made requests so that we didn’t trip the rate limit, but our colleague Pete suggested we should think about batching up the requests to make them faster. Sending each one individually might work well with just a few requests, but what about when we needed to make hundreds or thousands of requests at a time?

How could we change it so that the conversion requests were batched, and the results were inserted into the right table cells once they’d been computed?

jQuery.deferred() to the rescue

We realised that we could use jQuery deferreds to allow us to do the batching. A deferred is like an I.O.U that says that at some point in the future a result will become available. Anybody who’s used jQuery to make an AJAX request will have used a deferred – you send off a request, and specify some callbacks to be executed when the request eventually succeeds or fails.

By returning a deferred we could delay the call to the server until all of the values to be converted have been collected and then make a single call to the server to convert them all.

Below is the code which does the batching:

scrumble = {
  deferreds: {},

  as_date: function (raw_date) {
    if (!this.deferreds[raw_date]) {
      d = $.Deferred()
      this.deferreds[raw_date] = d;
    return this.deferreds[raw_date].promise()

  process_dates: function () {
    var self = this;
    var raw_dates = _.keys(self.deferreds);
    var date_list = scraperwiki.shellEscape(JSON.stringify(raw_dates));
    var command = 'tool/ ' + date_list;
    scraperwiki.exec(command, function (response) {
      response_object = JSON.parse(response);
      $.each(response_object, function(key, value){

Each time as_date is called it creates or reuses a deferred which is stored in an object keyed on the raw_date string and then returns a promise (a deferred with a restricted interface) to the caller. The caller attaches a callback to the promise that will use the value once it is available.

To actually send the batch of dates off to be converted, we call the process_dates method. It makes a call to the server with all of the strings to be processed. When the result comes back from the server it “resolves” each of the deferreds with the processed value, which causes all of the callbacks to fire updating the user interface.

With this design the changes we had to make to our code were minimal. It was already using a callback to set the value of the table cell. It was just a case of attaching it to the jQuery promise returned by the scrumble.as_date method and calling scrumble.process_dates, after all of the items had been added, to make the server side call to convert all of the dates.

var $tr = $('<tr>');

$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));


Now instead of one call being made for every value that needs converting (whether or not that string has already been processed) a single call is made to convert all of the values at once. When the response comes back from the server, the promises are resolved and the user interface updates showing the user the preview as required. jQuery deferreds allowed us to make this change with minimal disruption to our existing code.

And it gets better…

Further optimisation (not shown here) is possible if process_dates is called multiple times. A little-known feature of jQuery deferreds is that they can only be resolved once. If you make an AJAX call like $.get('http://foo').done(myCallback) and then, some time later, call .done(myCallback) on that ajax response again, the callback myCallback is immediately called with the exact same arguments as before. It’s like magic.

We realised we could turn this quirky feature to our advantage. Rather than checking whether we’d already converted a date, and returning the pre-converted date on subsequent calls, rather than adding them to the queue to be processed, we just call the deferred .done() callback regardless, as if this was the first time. Deferreds that have already been handled are returned immediately, meaning we only send requests to the server if there are new dates that haven’t been processed yet.

jQuery deferreds helped us keep our user interface responsive, our network traffic low, and our code refreshingly simple. Not bad for a mysterious set of functions hidden halfway down the docs.

ScraperWiki Downtime Mon, 27 Jun 2011 16:24:56 +0000 In order to allow for essential maintenance (our diggers been digging the web for so long it’s in dire need of an MOT!) we’re going to have to take the ScraperWiki website down for a short amount of time on Wednesday (29th June 2011). The site will be down from 13:00 UTC (14:00 BST).

We hope to keep the amount of downtime to under an hour. There will be no cupcakes if it goes over an hour!

A holding page will be displayed during the outage and when it is removed the site will be available to use again.

ScraperWiki adds Ruby as its third language Tue, 07 Sep 2010 17:13:30 +0000 We’re very pleased to announce that the third official ScraperWiki language is Ruby!

Ruby was a much requested enhancement to ScraperWiki and complements the two existing languages Python and PHP. We hope that its inclusion will encourage a new community of developers to start writing screen scrapers.

For more information about ScraperWiki see our FAQ.

The Ruby logo is copyright © 2006, Yukihiro Matsumoto. It is released under the terms of the Creative Commons Attribution-ShareAlike 2.5 License.

Video: Liverpool Hacks and Hackers Hack Day Wed, 11 Aug 2010 13:07:27 +0000 Liverpool John Moores University Open Labs has just released a video of our Hacks Meet Hackers event that took place in Liverpool last month.

The video gives a flavour of what happened when journalist, bloggers, software developers and artists came together to work on interesting and novel ways of exploring and using public data. You can read a roundup of the day at this link.

Video produced by The Hatch on behalf of Open Labs:

]]> 7 758213772