Developer – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Saving time with GOV.UK design standards Thu, 04 Feb 2016 08:46:36 +0000 While building the Civil Service People Survey (CSPS) site, ScraperWiki had to deal with the complexities of suppressing data to avoid privacy leaks and making technology to process tens of millions of rows in a fraction of a second.

We didn’t also have time to spend on basic web design. Luckily the Government’s Resources for designers, part of the Government Service Design Manual, saved us from having to.

In this blog post I talk through specific things where the standards saved us time and increased quality.

This is useful for managers – it’s important to know some details to avoid getting too distant from how projects really work. If you’re a developer or designer who’s about to make a site for the UK Government, there are lots of practical links!

Header and footer

To style the header and the footer we used the govuk_template, via a mustache version which automatically updates itself from the original. This immediately looks good.

CSPS about page

The look takes advantage of years old GDS work to be responsiveaccessible and have considered typography. Without using the template we’d have made something useful, but ugly and inaccessible without extra budget for design work.

It also reduces maintenance. The templates are constantly updated, and every now and again we quickly update the copy of them that we include. This keeps our design up to date with the standard, fixes bugs, and ensures compatibility with new devices.


The template doesn’t have styling for your content. That’s over in a separate module called the govuk_frontend_template. It has a bunch of useful CSS and Javascript pieces. GOV.UK elements is a good guide to how to use them, with its live examples.

For example, there aren’t many forms and buttons in CSPS, nevertheless they look good.

CSPS login form

The frontend template is full of useful tiny things, such as easily styling external links.

CSPS external link

And making just the right alpha or beta banner.

CSPS alpha banner

The tools aren’t perfect. We would have liked slightly more styling to use inside the pages. Some inside Government are arguing for a comprehensive framework similar to Bootstrap.

All I really want is not to have to define my own <h1>!


Privacy is important to users. Despite the flaws in the EU regulations on telling consumers about browser cookies, the goal of informing people how they are being tracked is really important.

Typically in a web project this would involve a bit of discussion over how and when to do so, and what it should look like. For us, it just magically came with the Government’s template.

CSPS cookies

I say magically, we also had to carefully note down all the cookies we use. A useful thing to do anyway!

Error messages

There are now lots of recently made Government digital services to poach bits of web design from.

The probate service has some interesting tabs and other navigation. The blood donation service dashboard has grey triangles to show data increased/decreased. The new digital marketplace has a complex search form with various styles.

I can’t even remember where we took this error message format from – lots of places use it.

CSPS login error

If you’d like to find more, doing a web search for is a great way to start exploring.

Which car should I (not) buy? Find out, with the ScraperWiki MOT website… Wed, 23 Sep 2015 15:14:58 +0000 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:


What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead., by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at!
Book review: Docker Up & Running by Karl Matthias and Sean P. Kane Fri, 17 Jul 2015 11:00:56 +0000 This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Dr Ian Hopkinson is Senior Data Scientist at ScraperWiki, where we often use Docker to help customers manage their data. You can read more about our professional services here.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia Mon, 06 Jul 2015 10:00:46 +0000 learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Scientists and Engineers… of What? Fri, 26 Jun 2015 11:25:00 +0000 “All scientists are the same, no matter their field.” OK that sounds like a good ‘quotable’ quote, and since I didn’t see it said by anyone else, I can claim it as my own saying. The closest quote to this I saw was “No matter what engineering field you’re in, you learn the same basic science and mathematics. And then maybe you learn a little bit about how to apply it.” by Noam Chomsky. These statements are similar but not quite the same.


While the former focuses on what scientists actually DO, the later has more to do with what people LEARN in the process of becoming engineers. The aim of this post is to try to prove that scientists and engineers are essentially the same in terms of the methods, processes and procedures they use to get their job done, no matter their field of endeavour.

To make clearer the argument I’m trying to put up, I am narrowing down my comparison to two ‘types’ of scientists. The first I’d call ‘mainstream scientists’ and the second are data scientists. Who are mainstream scientists? Think of them as the sort of physicists, mathematicians, scientists and engineers that worked on the Orion project. (If you haven’t heard of this project, please watch this video of what the Orion project was about).

So Project Orion was about man trying to ascend to Mars by an atomic bomb propelled by nuclear reactions. Just watching the video and thinking about the scientific process followed, the ‘trial-and-error’ methodology (sic) and the overall project got me thinking that data scientists are just like that! So let’s get down to the actual similarities.

To start with, every introduction to science usually begins with a description of the scientific ‘method’ which (with a little variation here and there) includes: formulation of a question, hypothesis, prediction, testing, analysis, replication, external review, data recording and sharing. (this version of the scientific process was borrowed here). Compare this with the software development life cycle that a data scientist would normally follow: Requirement gathering and analysis, Design, Implementation or coding, Testing, Deployment, Maintenance (source). It’s not difficult now to see that one process was derived from the other, is it?

Moving on, the short name I’d give to much of the scientific and software development process is ‘trial-and-error’ methodology (name has actually been upgraded to ‘Agile’ methodology). Project Orion’s ‘mainstream’ scientists tried (and failed at) several options for getting the rocket to escape Earth’s gravity. Data scientists try several ways to get their analytics done. In both scenarios, sometimes, an incremental step damages the entire progress made so far, and the question of ‘how do we get back to the last good configuration?’ arises. Data scientists have been having good success in recent times in this regard by using some form of version control system (like Git). How do the mainstream scientists manage theirs? I don’t know about now, but Project Orion didn’t have a provision for that.

So are mainstream scientists and data scientists the same? I’ll say a definite yes since they follow similar methods to get their work or research done. If you’re a data scientist, feel free now to identify with every other scientist in the world. Don’t feel any less a scientist because your work does not overtly affect people’s lives (like displacing people for fear of nuclear contamination, or damaging earth’s landscape as an unexpected by-product of your experiments) as mainstream scientists do. In reality, with the tools you have at your disposal as a data scientist, you have the potential to do more damage than that!

And one other quote of Noam Chomsky’s would be a good way to end this post: “If you’re teaching today what you were teaching five years ago, either the field is dead or you are.” So scientists are forward-thinking people, ever innovative, no matter their field, and that’s what makes them scientists.

Technology Radar Report Fri, 26 Jun 2015 07:43:33 +0000 Creating a sustainable technology company involves keeping up with technology. The thing about technology is that it changes, and we have to look to the future, and invest our time now in things that will be valuable in the future. Or, we could switch to doing SharePoint consultancy for the rest of our lives, but I think most of us here would regard that as “checking out”.

This is a partly personal perspective of the future as I see it from our little hill in the Northwest of England (Brownlow Hill). Iʼm really just sketching out a few things that I see as being important for ScraperWiki. And since theyʼre important for ScraperWiki, theyʼre important for you! Or at least, you might be interested too.


The future is already here – itʼs just not evenly distributed. — William Gibson

Gibsonʼs quote certainly applies to the software industry. All of the things I highlight already exist and are in use (some for quite a long time now), they just haven’t reached saturation yet. So looking to the near future is a matter of looking to the now, and making an educated guess as to what technologies will become increasingly abundant.

Python 3

(I have been saying this for 6 years now, but) Python 3 is a real thing and in five yearsʼ time we will have stopped using Python 2 and all switched to Python 3. If you think this all seems obvious, I don’t think we can say the same about the transition from Perl 5 to Perl 6 (which lives in a perpetual state of being “out by christmas”) or from Latex 2e to Latex 3.

Encouragingly, in 2015 real people are using it for real projects (including ScraperWiki!). I would now consider it foolish to start a greenfield Python project in Python 2. If you maintain a Python library, it is starting to look negligent if it doesn’t work with Python 3.

Your Python 2 programming skills will mostly transfer to Python 3. There will be some teething trouble: print() and urllib still fox me sometimes, and I find myself using list() a lot more when debugging (because more things are generators). Niggly details aside, basically everything works and most things are a bit better.

The Go Programming Language

Globally I think the success of Go (the programming language) still remains uncertain, but its ecosystem is now large enough to sustain it in its own right. The risks here are not particularly technical but in the community. I think we would have difficulty hiring a Go programmer (we would have to find a programmer and train them).

The challenge for the next year or so is to work out what existing skills people have that transfer to Go, and, related to that, what a good framework of pre-cursor skills for learning Go looks like. Personally speaking, when learning Go my C skills help me a lot, as does the fact that I already know what a coroutine is. I would say that knowledge of Java interfaces will help.

I don’t think there’s a good path to learning Go yet, it will be interesting to see what develops. For the “Go curious” the Tour of the Go Programming Language is worth a look.

Docker / containers

Docker is healthy, and while it might not win the “container wars” clearly containers are a thing that are going to be technically useful for the next few years (flashback to OS VM). Effort in learning Docker is likely to also be useful in other “API over container” solutions.


Increasingly software is accessed not via a library but via a service available on the web (Software as a Service, SaaS). For example, ScraperWiki has a service to convert PDFs to tables.

ScraperWiki already use a few of these (for email delivery, database storage, accounting, payments, uptime alerts, notifications), and we’ll almost certainly be using more in the future. The obvious difference compared to using a library or building it yourself is that Software as a Service has a direct monetary cost. But that doesn’t necessarily make it more expensive. Consider e-mail delivery. ScraperWiki definitely has the technical expertise to manage our own mail delivery. But as a startup, we don’t have the time to maintain mail servers or the desire keep our mail server skills up to date. We’d rather buy that expertise in the form of the service that Sendgrid offers.

The future is much like the present. We will continue to make buy/build decisions, and increasingly the “buy” side will be a SaaS. The challenges will be in evaluating the offerings. Do they have a nice icon?

Amazon Web Services (AWS)

The mother of all SaaS.

It’s not going away and it’s getting increasingly complex. Amazon release new products every few weeks or so, and the web console becomes increasingly bewildering. I think @frabcus’s observation that “operating the AWS console” is a skill is spot on. I think there is an analogy (suggested by @IanHopkinson_) with the typing pool to desktop word processor transition: a low-paid workforce skilled in typing got replaced by giving PCs with word processors to high-paid executives with no typing skills. We no longer need IT technicians to build racks and wire them together, but instead relatively well paid devops staff do it virtually.

Cloud Formation. It’s a giant “JSON language” that describes how to create and wire together any piece of AWS infastructure.


Probably the thing to look at though. Even if we don’t use it directly (for example, we might use some replacement for Elastic Beanstalk or generate Cloud Formation files with scripts), knowing how to read it will be useful.

Big instances versus MapReduce

Whilst I think MapReduce will remain an important technology for the sector as a whole, this will be in opposition to the “single big instance”. Don’t get too hung up on terminology, I’m really using MapReduce as a placeholder for all MapReduce and hadoop-like “big data query” technologies.

Amazon Web Services makes it possible to rent “High Performance Computing” class nodes, for reasonable amounts of money. In 2015, you can get a 16 core (32 hyperthreads) instance with 60 or 244 Gigabytes of RAM for a couple of bucks per hour. I think the gap between laptops and big instances is widening, meaning that more ad hoc analysis will be done on a transient instance. You can process some pretty big datasets with 244 GB of RAM without needing to go all Hadoopy.

That is not to say that we should ignore MapReduce, but the challenge may be to find datasets of interest that actually require it.


Snowden’s revelations tell us that the NSA, and other state-level actors, are basically everywhere. In particular, there are hostile actors in the data centre. We should consider node to node communications as going across the public internet, even if they are in the same data centre. Practically speaking, this means HTTPS / TLS everywhere.

If we provide a data service to our clients using AWS then ideally only the client, us, and AWS should have access to that data. It is unfortunate that AWS have to have access to the data, but it is practical necessity. Having trusted AWS, we can’t stop them (or even know) shipping all of our data to the NSA, so it is a matter of their reputation that they not do that. At least if we encrypt our network traffic, AWS have to take fairly aggressive steps to send our data to anyone else (they have to fish our session keys out of their RAM, or mass transfer the contents of their RAM somewhere).

There is lots more to do and discuss here. Fortunately ScraperWiki is pretty healthy in this regard, we are sensitive to it and we’re always discussing security.

Browser IDE

Here I’m talking about the “behind the scenes” world that is accessed from the Developer Tools. There is an awesome box of tools there. Programmers are probably all aware of the JavaScript Console and the Web Inspector, but these are the tip of a very large and featureful iceberg. Almost everything is dynamic: adding and disabling CSS rules updates the page live, as does editing the HTML. There is a fully featured single-step debugger that includes a code editor. Only the other day I learnt of the “emulate mobile device” mode for screen size and network.

Spend time poking about with the Developer Tools.

Machine Learning

Although it’s not an area that I know much about, I suspect that it’s not just a buzzword and it may turn out to be useful.

git / Version Control

git is great and there is a lot to learn, but don’t forget its broader historical context. Believe it or not git is not the first version control tool to come along, and is not the first Software Configuration Management company. Just because git does it one particular way doesn’t mean that that way is best. It means that it is merely good enough for one person to manage the flow of patches of patches that go to make up the Linux kernel. I would also remind everyone that git != github. Practically, be aware of which bits of your workflow are git, and which are github.

(I’m bound to say something like that, Software Configuration Management used to be part of my consultancy expertise)

Google have declared this race won. They’ve shut down their own online code management product and have started hosting projects on github.

A plausible future is where everyone uses git and most people are blind to there being anything better and most people think that git == github. Whingeing aside, that future is a much better place to work in than if sourceforge had won.

The Future Technology Radar Report

Who knows what will be on the radar in the future.

]]> 1 758223193
Elasticsearch and elasticity: building a search for government documents Mon, 22 Jun 2015 08:39:44 +0000 A photograph of clouds under a magnifying glass.

Examining Clouds” by Kate Ter Harr, licensed under CC BY 2.0.

Based in Paris, the OECD is the Organisation for Economic Co-operation and Development. As the name suggests, the OECD’s job is to develop and promote new social and economic policies.

One part of their work is researching how open countries trade. Their view is that fewer trade barriers benefit consumers, through lower prices, and companies, through cost-cutting. By tracking how countries vary, they hope to give legislators the means to see how they can develop policies or negotiate with other countries to open trade further.

This is a huge undertaking.

Trade policies not only operate across countries, but also by industry.  This process requires a team of experts to carry out the painstaking research and detective work to investigate current legislation.

Recently, they asked us for advice on how to make better use of the information available on government websites. A major problem they have is searching through large collections of document to find relevant legislation. Even very short sections may be crucial in establishing a country’s policy on a particular aspect of trade.

Searching for documents

One question we considered is: what options do they have to search within documents?

  1. Use a web search engine. If you want to find documents available on the web, search engines are the first tool of choice. Unfortunately, search engines are black boxes: you input a term and get results back without any knowledge of how those results were produced. For instance, there’s no way of knowing what documents might have been considered in any particular search. Personalised search also governs the results you actually see. One normal-looking search of a government site gave us a suspiciously low number of results on both Google and Bing. Though later searches found far more documents, this is illustrative of the problems of search engines for exhaustive searching.
  2. Use a site’s own search feature. This is more likely to give us access to all the documents available. But, every site has a different layout and there’s a lack of a unified user interface for searching across multiple sites at once. For a one-off search of documents, having to manually visit and search across several sites isn’t onerous. Repeating this for a large number of searches soon becomes very tedious.
  3. Build our own custom search tool. To do this, we need to collect all the documents from sites and store those in a database that we run. This way we know what we’ve collected, and we can design and implement searches according to what the OECD need.


Enter Elasticsearch: a database designed for full text search and one which seemed to fit our requirements.

Getting the data

To see how Elasticsearch might help the OECD, we collected several thousand government documents from one website.

We needed to do very little in the way of processing. First, we extracted text from each web page using Python’s lxml. Along with the URL and the page title, we then created structured documents (JSON) suitable for storing in Elasticsearch.

Running Elasticsearch and uploading documents

Running Elasticsearch is simple. Visit the release page, download the latest release and just start it running. One sensible thing to do out of the box is change the default cluster name — the default is just elasticsearch. Making sure Elasticsearch is firewalled off from the internet is another sensible precaution.

When you have it running, you can simply send documents to it for storage using a HTTP client like curl:

curl "http://localhost:localport/documents/document" -X POST -d @my_document.json

For the few thousand documents we had, this wasn’t sluggish at all, though it’s also possible to upload documents in bulk should this prove too slow.


Once we have documents stored, the next thing to do is query them!

Other than very basic queries, Elasticsearch queries are written in JSON, like the documents it stores, and there’s a wide variety of query types bundled into Elasticsearch.

Query JSON is not difficult to understand, but it can become tricky to read and write due to the Russian doll-like structure it quickly adopts. In Python, the addict library is a useful one for making it easier to more directly write queries out without getting lost inside an avalanche of {curly brackets}.

As a demo, we implemented a simple phrase matching search using the should keyword.

This allows combination of multiple phrases, favouring documents containing more matches. If we use this to search for, e.g. "immigration quota"+"work permit", the results will contain one or both of these phrases. However, results with both phrases are deemed more relevant.

The Elasticsearch Tool


With our tool, researchers can enter a search, and very quickly get back a list of URLs, document titles and a snippet of a matching part of the text.


What we haven’t implemented is the possibility of automating queries which could also save the OECD a lot of time. Just as document upload is automated, we could run periodic keyword searches on our data. This way, Elasticsearch could be scheduled to lookout for phrases that we wish to track. From these results, we could generate a summary or report of the top matches which may prompt an interested researcher to investigate.

Future directions

For (admittedly small scale) searching, we had no problems with a single instance of Elasticsearch. To improve performance on bigger data sets, Elasticsearch also has built-in support for clustering, which looks straightforward to get running.

Clustering also ensures there is no single point of failure. However, there are known issues in that current versions of Elasticsearch can suffer document loss if nodes fail.

Provided Elasticsearch isn’t used as the only data store for documents, this is a less serious problem. It is possible to keep checking that all documents that should be in Elasticsearch are indeed there, and re-add them if not.

Elasticsearch is powerful, yet easy to get started with. For instance, its text analysis features support a large number of languages out of the box. This is important for the OECD who are looking at documents of international origin.

It’s definitely worth investigating if you’re working on a project that requires search. You may find that, having found Elasticsearch, you’re no longer searching for a solution.

Four specific things “agile” saved us from doing at ONS Mon, 01 Jun 2015 14:07:56 +0000 There’s lots of both hype and cynicism around “agile”. Instead, look at this part of the original agile declaration.

We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

Responding to change over Following a plan

That is, while there is value in the items on the right, we value the items on the left more. (Source: The Agile Manifesto)

In this article, I’m going to give some concrete examples where this “flexible” aspect of agile saved us from doing unnecessary work.

The project is the Databaker project we have been working on with the Office for National Statistics. You can read a general introduction to the project on our blog.

Discovery phase

ScraperWiki does a Discovery phase as the first part of a project. This is usually time boxed, and small enough (5 to 20 days work) to easily and quickly get budget for. It gives an assessment of what technical solution will meet user needs, and what it will cost.

We met the ONS at csv,conf, where Dragon’s presentation on our open source spreadsheet-scraping library XYPath attracted their attention. So it was no surprise that our discovery report said:

In an ideal world this conversion process would be fully automated, however this is not possible with current technology.

This was the key simplification of the project. We made no attempt to use machine learning, or to build a fancy graphical interface – AI just isn’t good enough yet.

Instead, our solution was for the end user to learn a little bit of Python programming. To some, that appears to make their job harder. It doesn’t – complex technical user interfaces are very hard to learn. They’re also either expensive or impossible to make.

Here are the four things which at discovery we thought we’d have to do, but that we didn’t need to do in practice.

1. Running recipes

Running from command lineBy the end of the second sprint, the customer was using the software to write and run recipes. They were using the initial command-line version of the tool.

We’d thought that for ease of use under Windows we’d need to do slightly more than that – a lightweight GUI, or perhaps drag-and-drop integration into Explorer. This is what our discovery report said:

The graphical interface will also provide the ability to select recipe files and spreadsheet files, as well as reporting and logging conversion issues.

It turned out that ONS staff were more than happy with the command line. It worked, they understood it, and it let them script with batch files. Without this immediate feedback, we would have built something more, unnecessarily.

2. Editing recipes

Notepad++ editing databaker recipeI don’t know where I got the idea that we’d need some kind of fancy editor integration. I think that much earlier I’d got overly excited about integrating the project into the ScraperWiki Data Science Platform.

We spent a little bit of time during the project researching text editors. In the end though, the answer was super simple – just use Notepad++. Both our team and the ONS already used it. It’s free.

It wasn’t until I watched the customer install it on their system, and happily start using it for real work, that I properly understood that we just didn’t need anything else.

3. Debugging recipes

Databaker highlighterWe spent some time looking into different Python UI interface toolkits, and integrations between Excel and Python (xlwings, which was good but too slow), looking for a way to provide more debugging. Our discovery report said:

An elaboration of this simplest tool is to provide a graphical interface which provides support to the user to write recipes by highlighting in a representation of the spreadsheet the cells selected by each line in a recipe.

Once again, the best answer was really simple – from Python generate spreadsheets with highlighting. No need for any new interface – Excel was the interface. No need for complex integration, Python can generate basic Excel files reliably. Simple to use, attractive, and very helpful for debugging.

We’d thought we might make something more interactive. Once again, actual delivery showed that this wasn’t a priority.

4. Installation

I was worried about installation at the start of the project. It was to be delivered on Windows, and as a company, ScraperWiki are used to delivering on the web. As individuals, we’d made installers for Windows before, but weren’t particularly looking forward to it.

Once again, the answer was to use something that already existed.  The project is anyway open source. So we simply used the standard Python packaging system, PyPi.

Anaconda Python installerCombined with the fantastic Anaconda installer (I love Continuum Analytics!), we ended up with these installation and upgrade instructions.

Not only are the instructions super short for the end user, they are also all the code for this part of the project (well, that and some PyPi configuration files we either had or would have needed anyway).

We couldn’t have planned this in advance, as we knew more about both the deployment environment and what we needed to deploy when we made the decision.


What did all this time saving get spent on?

These more important things:

1) Dozens of tiny bugs and fixes to the recipe language to make it possible to convert more spreadsheets found in the wild.

2) Spending more time with the customer, both in formal training sessions, and informally, so they were using the software to deliver at maximum rate.

As a result the project was very successful – it finished on time and on budget. It delivered more value than we had hoped to deliver.

Book review: How Linux works by Brian Ward Tue, 14 Apr 2015 07:43:18 +0000 hlw2e_cover-new_webA break since my last book review since I’ve been coding, rather than reading, on the commute into the ScraperWiki offices in Liverpool. Next up is How Linux Works by Brian Ward. In some senses this book follows on from Data Science at the Command Line by Jeroen Janssens. Data Science was about doing analysis with command line incantations, How Linux Works tells us about the system in which that command line exists and makes the incantations less mysterious.

I’ve had long experience with doing analysis on Windows machines, typically using Matlab, but over many years I have also dabbled with Unix systems including Silicon Graphics workstations, DEC Alphas and, more recently, Linux. These days I use Ubuntu to ensure compatibility with my colleagues and the systems we deploy to the internet. Increasingly I need to know more about the underlying operating system.

I’m looking to monitor system resources, manage devices and configure my environment. I’m not looking for a list of recipes, I’m looking for a mindset. How Linux Works is pretty good in this respect. I had a fair understanding of pipes in *nix operating systems before reading the book, another fundamental I learnt from How Linux Works was understanding that files are used to represent processes and memory. The book is also good on where these files live – although this varies a bit with distribution and time. Files are used liberally to provide configuration.

The book has 17 chapters covering the basics of Linux and the directory hierarchy, devices and disks, booting the kernel and user space, logging and user management, monitoring resource usage, networking and aspects of shell scripting and developing on Linux systems. They vary considerably in length with those on developing relatively short. There is an odd chapter on rsync.

I got a bit bogged down in the chapters on disks, how the kernel boots, how user space boots and networking. These chapters covered their topics in excruciating detail, much more than required for day to day operations. The user startup chapter tells us about systemd, Upstart and System V init – three alternative mechanisms for booting user space. Systemd is the way of the future, in case you were worried. Similarly, the chapters on booting the kernel and managing disks at a very low level provide more detail than you are ever likely to need. The author does suggest the more casual reader skip through the more advanced areas but frankly this is not a directive I can follow. I start at the beginning of a book and read through to the end, none of this “skipping bits” for me!

The user environments chapter has a nice section explaining clearly the sequence of files accessed for profile information when a terminal window is opened, or other login-like activity. Similarly the chapters on monitoring resources seem to be pitched at just the right level.

Ward’s task is made difficult by the complexity of the underlying system. Linux has an air of “If it’s broke, fix it and if ain’t broke, fix it anyway”. Ward mentions at one point that a service in Linux had not changed for a while therefore it was ripe for replacement! Each new distribution appears to have heard about standardisation (i.e. where to put config files) but has chosen to ignore it. And if there is consistency in the options to Linux commands it is purely co-incidental. I think this is my biggest bugbear in Linux, I know which command to use but the right option flags are more just blindly remembered.

The more Linux-oriented faction of ScraperWiki seemed impressed by the coverage of the book. The chapter on shell scripting is enlightening, providing the mindset rather than the detail, so that you can solve your own problems. It’s also pragmatic in highlighting where to to step in shell scripting and move to another language. I was disturbed to discover that the open-square bracket character in shell script is actually a command. This “explain the big picture rather than trying to answer a load of little questions”, is a mark of a good technical book.  The detail you can find on Stackoverflow or other Googling.

How Linux Works has a good bibliography, it could do with a glossary of commands and an appendix of the more in depth material. That said it’s exactly the book I was looking for, and the writing style is just right. For my next task I will be filleting it for useful commands, and if someone could see their way to giving me a Dell XPS Developer Edition for “review”, I’ll be made up.

NewsReader – Hack 100,000 World Cup Articles Wed, 16 Apr 2014 13:52:23 +0000 NWR_logo_narrowJune 10, The Hub Westminster (@NewsReader)

Ian Hopkinson has been telling you about our role in the NewsReader project.  We’re making a thing that crunches large volumes of news articles.  We’re combining natural language processing and semantic web technology.  It’s an FP7 project so we’re working with a bunch of partners across Europe.

We’re 18 months into the project and we have something to show off.  Please think about joining us for a fun ‘hack’ event on June 10th in London at  ‘The Hub’, Westminster.  There are 100,000 World Cup news articles we need to crunch and we hope to dig out some new insights from a cacophony of digital noise.  There will be light refreshments throughout the day.  Like all good hack events there will be an end of day reception and we would like you to present your findings and give us some feedback on the experience. (the requisite beer and pizza will be provided)

All of our partners will be there LexisNexis, SynerScope, VU University (Amsterdam), University of the Basque Country (San Sebastian) and Fondazione Bruno Kessler (Trento).  They’re a great team, very knowledgeable in this field, and they love what they are doing.

Ian recently made a short video about the project which is a useful introduction.

If you are a journalist, an editor, a linked data enthusiast or data professional we hope you will care about this kind of innovation.

Please sign up here  ‘NewsReader eventbrite invitation’  and tell your friends.

logo long (for screen 72dpi)