javascript – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Programmers past, present and future Tue, 04 Jun 2013 09:20:38 +0000 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]> 1 758218506
Book review: JavaScript: The Good Parts by Douglas Crockford Tue, 23 Apr 2013 08:47:06 +0000 JavaScript: The Good PartsThis week I’ve been programming in JavaScript, something of a novelty for me. Jealous of the Dear Leader’s automatically summarize tool I wanted to make something myself, hopefully a future post will describe my timeline visualising tool. Further motivations are that web scraping requires some knowledge of JavaScript since it is a key browser technology and, in its prototypical state, the ScraperWiki platform sometimes requires you to launch a console and type in JavaScript to do stuff.

I have two books on JavaScript, the one I review here is JavaScript: The Good Parts by Douglas Crockford – a slim volume which tersely describes what the author feels the best bits of JavaScript, incidently highlighting the bad bits. The second book is the JavaScript Bible by Danny Goodman, Michael Morrison, Paul Novitski, Tia Gustaff Rayl which I bought some time ago, impressed by its sheer bulk but which I am unlikely ever to read let alone review!

Learning new programming languages is easy in some senses: it’s generally straightforward to get something to happen simply because core syntax is common across many languages. The only seriously different language I’ve used is Haskell. The difficulty with programming languages is idiom, the parallel is with human languages: the barrier to making yourself understood in a language is low, but to speak fluently and elegantly needs a higher level of understanding which isn’t simply captured in grammar. Programming languages are by their nature flexible so it’s quite possible to write one in the style of another – whether you should do this is another question.

My first programming language was BASIC, I suspect I speak all other computer languages with a distinct BASIC accent. As an aside, Edsger Dijkstra has said:

[…] the teaching of BASIC should be rated as a criminal offence: it mutilates the mind beyond recovery.

  • so perhaps there is no hope for me.

JavaScript has always felt to me a toy language: it originates in a web browser and relies on HTML to import libraries but nowadays it is available on servers in the form of node.js, has a wide range of mature libraries and is very widely used. So perhaps my prejudices are wrong.

The central idea of JavaScript: The Good Parts is to present an ideal subset of the language, the Good Parts, and ignore the less good parts. The particular bad parts of which I was glad to be warned:

  • JavaScript arrays aren’t proper arrays with array-like performance, they are weird dictionaries;
  • variables have function not block scope;
  • unless declared inside a function variables have global scope;
  • there is a difference between the equality == and === (and similarly the inequality operators). The short one coerces and then compares, the longer one does not, and is thus preferred. 

I liked the railroad presentation of syntax and the section on regular expressions is good too.

Railroad syntax diagram - for statement

Elsewhere Crockford has spoken approvingly of CoffeeScript which compiles to JavaScript but is arguably syntactically nicer, it appears to hide some of the bad parts of JavaScript which Crockford identifies.

If you are new to JavaScript but not to programming then this is a good book which will give you a fine start and warn you of some pitfalls. You should be aware that you are reading about Crockford’s ideal not the code you will find in the wild.

]]> 1 758218489
5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) Fri, 25 May 2012 07:13:33 +0000

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

Job advert: Lead programmer Thu, 27 Oct 2011 11:04:47 +0000 Oil wells, marathon results, planning applications…

ScraperWiki is a Silicon Valley style startup, in the North West of England, in Liverpool. We’re changing the world of open data, and how data science is done together on the Internet.

We’re looking for a programmer who’d like to:

  • Revolutionise the tools for sharing data, and code that works with data, on the Internet.
  • Take a lead in a lean startup, having good hunches on how to improve things, but not minding when A/B testing means axing weeks of code.

In terms of skills:

  • Be polyglot enough to be able to learn Python, and do other languages (Ruby, Javascript…) where necessary.
  • Be able to find one end of a clean web API and a test suite from another.
  • We’re a small team, so need to be able to do some DevOps on Linux servers.
  • Desirable – able to make igloos.

About ScraperWiki:

  • We’ve got funding (Knight News Challenge winners) and are in the brand new field of “data hubs”.
  • We’re before product/market fit, so it’ll be exciting, and you can end up a key, senior person in a growing company.

Some practical things:

We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.

Must be willing to either relocate to Liverpool, or able to work from home and travel to our office here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

  • A link to a previous project that you’ve worked on that you’re proud of, or a description of it if it isn’t publicly visible.
  • A link to a scraper or view you’ve made on ScraperWiki, involving a dataset that you find interesting for some reason.
  • Any questions you have about the job.

Along to with the word swjob2 in the subject (and yes, that means no agencies, unless the candidates do that themselves)

Pool temperatures, company registrations, dairy prices…

]]> 2 758215716
Job advert: Product / UX lover Mon, 14 Feb 2011 15:35:38 +0000

ScraperWiki is a Silicon Valley style startup, but based in the UK. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web product designer who is…

  • Able to make design decisions to launch features by themselves.
  • Capable of writing CSS and HTML, and some Javascript.

Other bits…

  • Loves to balance colour, size, order and prominence on websites.
  • Knows what a web scraper is, and would like to learn to write one.
  • Thinks that data can change the world, but only if we use it right.
  • Either good at working remotely, or willing to relocate to the North West.
  • Desirable – able to make igloos.

To apply – send the following:

  • An example of a website you’ve made that you’re proud of
  • If you have one, a visualisation you’ve made of some data (any data!)
  • Oh, and I guess we’d better see your CV

Along to with the word swjob2 in the subject.

Job advert: Web designer/programmer Wed, 05 Jan 2011 11:29:30 +0000 Care about oil spills, newspapers or lost cats?

ScraperWiki is a Silicon Valley style startup, but in the North West of England, in Liverpool. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web designer/programmer who is…

  • Capable of writing standards compliant CSS and HTML, and some Javascript.
  • Loves to balance colour, size, order and prominence on websites.
  • Good enough at Photoshop to make any mockups and icons required.
  • Likes to talk to and track users, and then do what’s needed to make their experience better.
  • Server-side coding (Python) a plus but not essential.
  • Knows what a web scraper is, and would like to learn to write one.
  • Thinks that data can change the world, but only if we use it right.
  • Desirable – able to make igloos.

Some practical things…

  • We’re early stage, spending our seed funding. So be aware things will go either way – we’ll crash and burn, or you’ll be a key, senior person in a growing company.
  • We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.
  • Must be willing to either relocate to Liverpool, or able to work from home and travel here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

  • An example of a website you’ve made that you’re proud of
  • If you have one, a visualisation of some data (any data!)

Along to with the word swjob1 in the subject.

Scraping PDFs: now 26% less unpleasant with ScraperWiki Fri, 17 Dec 2010 10:20:53 +0000 Got a PDF you want to get data from?
Try our easy web interface over at!

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.

So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.

Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.

The PDF to HTML Preview.

It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.

Scraping PDFs: the problem…

Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.

It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.

Adobe HQ in California

Adobe HQ in California. Locals say that only one person works inside - a reference to PDFs' bloated filesize.

ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.

But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.

…and the solution

So, the PDF previewer works as follows:

  • Grabs the data. Gets the XML using pdftoxml.
  • Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
  • Adds Javascript onclick events. Attaches simple events so that when you click on a word or phrase, you see the underlying XML.

Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!

We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.

Got a PDF you want to get data from?
Try our easy web interface over at!
]]> 2 758214147