tools – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Mastering space and time with jQuery deferreds https://blog.scraperwiki.com/2013/08/mastering-space-and-time-with-jquery-deferreds/ Wed, 28 Aug 2013 16:00:32 +0000 http://blog.scraperwiki.com/?p=758219328 A screenshot of Rod Taylor enthusiastically grabbing a lever on his Time Machine in the 1960 film of the same nameRecently Zarino and I were pairing on making improvements to a new scraping tool on ScraperWiki. We were working on some code that allows the person using the tool to pick out parts of some scraped data in order to extract a date into a new database column. For processing the data on the server side we were using a little helper library called scrumble which does some cleaning in Python to produce dates in a standard format. Which is great for server-side, but we also needed to display a preview of the cleaned dates to the user, before it’s finally sent to the server for processing.

Rather than rewrite this Python code in JavaScript we thought we’d make a little script which could be called using the ScraperWiki exec endpoint to do the conversion for us on the server side.

Our code looked something like this:

var $tr = $('<tr>');

// for each cell in each row…
$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  // execute this command on the server…
  scraperwiki.exec('tools/do-scrumble.py ' + date, function(response){
    // and put the result into this table cell…
    $td.html(JSON.parse(response));
  });
  $td.appendTo($tr);
});

Each time we needed to process a date with scrumble we made a call to our server side Python script via the exec endpoint. When the value comes back from the server, the callback function sets the content of the table cell to the value.

However when we started testing our code we hit a limit placed on the exec endpoint to prevent overloading the server (currently no more that 5 exec calls can be executing at once).

Our first thought was to just limit the rate at which we made requests so that we didn’t trip the rate limit, but our colleague Pete suggested we should think about batching up the requests to make them faster. Sending each one individually might work well with just a few requests, but what about when we needed to make hundreds or thousands of requests at a time?

How could we change it so that the conversion requests were batched, and the results were inserted into the right table cells once they’d been computed?

jQuery.deferred() to the rescue

We realised that we could use jQuery deferreds to allow us to do the batching. A deferred is like an I.O.U that says that at some point in the future a result will become available. Anybody who’s used jQuery to make an AJAX request will have used a deferred – you send off a request, and specify some callbacks to be executed when the request eventually succeeds or fails.

By returning a deferred we could delay the call to the server until all of the values to be converted have been collected and then make a single call to the server to convert them all.

Below is the code which does the batching:

scrumble = {
  deferreds: {},

  as_date: function (raw_date) {
    if (!this.deferreds[raw_date]) {
      d = $.Deferred()
      this.deferreds[raw_date] = d;
    }
    return this.deferreds[raw_date].promise()
  },

  process_dates: function () {
    var self = this;
    var raw_dates = _.keys(self.deferreds);
    var date_list = scraperwiki.shellEscape(JSON.stringify(raw_dates));
    var command = 'tool/do-scrumble-batch.py ' + date_list;
    scraperwiki.exec(command, function (response) {
      response_object = JSON.parse(response);
      $.each(response_object, function(key, value){
        self.deferreds[key].resolve(value);
      });
    });
  }
}

Each time as_date is called it creates or reuses a deferred which is stored in an object keyed on the raw_date string and then returns a promise (a deferred with a restricted interface) to the caller. The caller attaches a callback to the promise that will use the value once it is available.

To actually send the batch of dates off to be converted, we call the process_dates method. It makes a call to the server with all of the strings to be processed. When the result comes back from the server it “resolves” each of the deferreds with the processed value, which causes all of the callbacks to fire updating the user interface.

With this design the changes we had to make to our code were minimal. It was already using a callback to set the value of the table cell. It was just a case of attaching it to the jQuery promise returned by the scrumble.as_date method and calling scrumble.process_dates, after all of the items had been added, to make the server side call to convert all of the dates.

var $tr = $('<tr>');

$.each(row, function (index, value) {
  $td = $('<td>');
  var date = scraperwiki.shellEscape(JSON.stringify(value));
  scrumble.as_date(date).done(function(response){
    $td.html(JSON.parse(response));
  });
  $td.appendTo($tr);
});

scrumble.process_dates();

Now instead of one call being made for every value that needs converting (whether or not that string has already been processed) a single call is made to convert all of the values at once. When the response comes back from the server, the promises are resolved and the user interface updates showing the user the preview as required. jQuery deferreds allowed us to make this change with minimal disruption to our existing code.

And it gets better…

Further optimisation (not shown here) is possible if process_dates is called multiple times. A little-known feature of jQuery deferreds is that they can only be resolved once. If you make an AJAX call like $.get('http://foo').done(myCallback) and then, some time later, call .done(myCallback) on that ajax response again, the callback myCallback is immediately called with the exact same arguments as before. It’s like magic.

We realised we could turn this quirky feature to our advantage. Rather than checking whether we’d already converted a date, and returning the pre-converted date on subsequent calls, rather than adding them to the queue to be processed, we just call the deferred .done() callback regardless, as if this was the first time. Deferreds that have already been handled are returned immediately, meaning we only send requests to the server if there are new dates that haven’t been processed yet.

jQuery deferreds helped us keep our user interface responsive, our network traffic low, and our code refreshingly simple. Not bad for a mysterious set of functions hidden halfway down the docs.

]]>
758219328
Scraping PDFs: now 26% less unpleasant with ScraperWiki https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/ https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/#comments Fri, 17 Dec 2010 10:20:53 +0000 http://blog.scraperwiki.com/?p=758214147 Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.

So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.

Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.

The PDF to HTML Preview.

It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.

Scraping PDFs: the problem…

Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.

It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.

Adobe HQ in California

Adobe HQ in California. Locals say that only one person works inside - a reference to PDFs' bloated filesize.

ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.

But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.

…and the solution

So, the PDF previewer works as follows:

  • Grabs the data. Gets the XML using pdftoxml.
  • Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
  • Adds Javascript onclick events. Attaches simple events so that when you click on a word or phrase, you see the underlying XML.

Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!

We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/feed/ 2 758214147