data science – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… Wed, 23 Sep 2015 15:14:58 +0000 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:


What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead., by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at!
Book review: Data Science at the Command Line by Jeroen Janssens Tue, 10 Feb 2015 11:29:25 +0000 datascienceatthecommandlineIn the mixed environment of ScraperWiki we make use of a broad variety of tools for data analysis. Data Science at the Command Line by Jeroen Janssens covers tools available at the Linux command line for doing data analysis tasks. The book is divided thematically into chapters on Obtaining, Scrubbing, Modeling, Interpreting Data with “intermezzo” chapters on parameterising shell scripts, using the Drake workflow tool and parallelisation using GNU Parallel.

The original motivation for the book was a desire to move away from purely GUI based approaches to data analysis (I think he means Excel and the Windows ecosystem). This is a common desire for data analysts, GUIs are very good for a quick look-see but once you start wanting to repeat analysis or even repeat visualisation they become more troublesome. And launching Excel just to remove a column of data seems a bit laborious. Windows does have its own command line, PowerShell, but it’s little used by data scientists. This book is about the Linux command line, examples are all available on a virtual machine populated with all of the tools discussed in the book.

The command line is at its strongest with the early steps of the data analysis process, getting data from places, carrying out relatively minor acts of tidying and answering the question “does my data look remotely how I expect it to look?”. Janssens introduces the battle tested tools sed, awk, and cut which we use around the office at ScraperWiki. He also introduces jq (the JSON parser), this is a more recent introduction but it’s great for poking around in JSON files as commonly delivered by web APIs. An addition I hadn’t seem before was csvkit which provides a suite of tools for processing CSV at the command line, I particularly like the look of csvstat. csvkit is a Python tool and I can imagine using it directly in Python as a library.

The style of the book is to provide a stream of practical examples for different command line tools, and illustrate their application when strung together. I must admit to finding shell commands deeply cryptic in their presentation with chunks of options effectively looking like someone typing a strong password. Data Science is not an attempt to clear the mystery of these options more an indication that you can work great wonders on finding the right incantation.

Next up is the Rio tool for using R at the command line, principally to generate plots. I suspect this is about where I part company with Janssens on his quest to use the command line for all the things. Systems like R, ipython and the ipython notebook all offer a decent REPL (read-evaluation-print-loop) which will convert seamlessly into an actual program. I find I use these REPLs for experimentation whilst I build a library of analysis functions for the job at hand. You can write an entire analysis program using the shell but it doesn’t mean you should!

Weka provides a nice example of smoothing the command line interface to an established package. Weka is a machine learning library written in Java, it is the code behind Data Mining: Practical Machine Learning Tools and techniques. The edges to be smoothed are that the bare command line for Weka is somewhat involved since it requires a whole pile of boilerplate. Janssens demonstrates nicely how to do this by developing automatically autocompletion hints for the parts of Weka which are accessible from the command line.

The book starts by pitching the command line as a substitute for GUI driven applications which is something I can agree with to at least some degree. It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree. My tendency would be to move from the command line to Python fairly rapidly perhaps using ipython or ipython notebook as a stepping stone.

Data Science at the Command Line is definitely worth reading if not following religiously. It’s a showcase for what is possible rather than a reference book as to how exactly to do it.

]]> 4 758222531
Book review: Data Science for Business by Provost and Fawcett Fri, 02 May 2014 10:14:11 +0000 datascienceforbusinessMarginalia are an insight into the mind of another reader. This struck me as a I read Data Science for Business by Foster Provost and Tom Fawcett. The copy of the book had previously been read by two of my colleagues. One of whom had clearly read the introductory and concluding chapters but not the bit in between. Also they would probably not be described as a capitalist, “red in tooth and claw”! My marginalia have generally been hidden since I have an almost religious aversion to defacing a book in any way. I do use Evernote to take notes as I go though, so for this review I’ll reveal them here.

Data Science for Business is the book I wasn’t going to read since I’ve already read Machine Learning in Action, Data Mining: Practical Machine Learning Tools and Techniques, and Mining the Social Web. However, I gave in to peer pressure. The pitch for the book is that it is for people who will manage data scientists rather than necessarily be data scientists themselves. The implication here is that you’re paying these data scientists to increase your profits, so you better make sure that’s what they’ll do. You need to be able to understand what data science can and cannot do, ask reasonable questions of data scientists of their models and understand the environment the data scientist needs to thrive.

The book covers several key algorithms: decision trees, support vector machines, logistic regression, k-Nearest Neighbours and term frequency-inverse document frequency (TF-IDF) but not in any great depth of implementation. To my mind it is surprisingly mathematical in places, given the intended audience of managers rather than scientists.

The strengths of the book are in the explanations of the algorithms in visual terms, and in its focus on the expected value framework for evaluating data mining models. Diversity of explanation is always a good thing; read enough different explanations and one will speak directly to you. It also spends more of its time discussing practical applications than other books on data mining. An example on “churn” runs through the book. “Churn” is the loss of customers at the end of a contract, in this case the telecom industry is used as an illustration.

A couple of nuggets I picked up:

  • You can think of different machine learning algorithms in terms of the decision boundary they produce and how that looks. Overfitting becomes a decision boundary which is disturbingly intricate. Support vector machines put the decision boundary as far away from the classes they separate as possible;
  • You need to make sure that the attributes that you use to build your model will be available at the point of use. That’s to say there is no point in building a model for churn which needs an attribute from a customer which is only available just after they’ve left you. Sounds a bit obvious but I can easily see myself making this mistake;
  • The expected value framework for evaluating models. This combines the probability of an event, i.e. the result of a promotion campaign with the value of the outcome. Again churn makes a useful demonstration. If you have the choice between a promotion which is successful with 10 users with an average spend of £10 per year or 1 user with an average spend of £200 then you should obviously go with the latter rather than the former. This reminds me of expectation values in quantum mechanics and in statistical physics.

The title of the book, and the related reading demonstrate that data science, machine learning and data mining are used synonymously. I had a quick look at the popularity of these terms over the last few years. You can see the results in the Google Ngram viewer here. Somewhat to my surprise data science still lags far behind other terms despite the recent buzz, this is perhaps because Google only expose data to 2008.

Which book should you read?

All of them!

If you must buy only one then make it Data Mining, it is encyclopaedic and covers high level business overview, toy implementation and detailed implementation in some depth. If you want to see the code, then get Machine Learning in Action – but be aware that ultimately you are most likely going to be using someone else’s implementation of the core machine learning algorithms. Mining the Social Web is excellent if you want to see the code and are particularly interested in social media. And read Data Science for Business if you are the intended managerial audience or one who will be doing data mining in a commercial environment.

Scraping Spreadsheets with XYPath Wed, 12 Mar 2014 17:11:12 +0000 Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.



So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries.  Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

ScraperWiki – Professional Services Mon, 15 Jul 2013 16:38:52 +0000 How would you go about collecting, structuring and analysing 100,000 reports on Romanian companies?

You could use ScraperWiki to write and host you own computer code that carries out the scraping you need, and then use our other self-service tools to clean and analyse the data.

But sometimes writing your own code is not a feasible solution. Perhaps your organisation does not have the coding skills in the required area. Or maybe an internal team needs support to deploy their own solutions, or lacks the time and resources to get the job done quickly.

That’s why, alongside our new platform, ScraperWiki also offers a professional service specifically tailored to corporate customers’ needs.

Recently, for example, we acquired data from the Romanian Ministry of Finance for a client. Our expert data scientists wrote a computer program to ingest and structure the data, which was fed into the client’s private datahub on ScraperWiki. The client was then able to make use of ScraperWiki’s ecosystem of built-in tools, to explore their data and carry out their analysis…

Like the “View in a table” tool, which lets them page through the data, sort it and filter it:


The “Summarise this data” tool which gives them a quick overview of their data by looking at each column of the data and making an intelligent decision as to how best to portray it:


And the “Query with SQL” tool, which allows them to ask sophisticated questions of their data using the industry-standard database query language, and then export the live data to their own systems:


Not only this, ScraperWiki’s data scientists also handcrafted custom visualisation and analysis for the client. In this case we made a web page which pulled in data from the dataset directly, and presented results as a living report. For other projects we have written and presented analysis using R, a very widely used open source statistical analysis package, and Tableau, the industry-leading business intelligence application.

The key advantage of using ScraperWiki for this sort of project is that there is no software to install locally on your computer. Inside corporate environments the desktop PC is often locked down, meaning custom analysis software cannot be easily deployed. Ongoing maintenance presents further problems. Even in more open environments, installing custom software for just one piece of analysis is not something users find convenient. Hosting data analysis functionality on the web has become an entirely practical proposition; storing data on the web has long been commonplace and it is a facility which we use every day. More recently, with developments in browser technology it has become possible to build a rich user experience which facilitates online data analysis delivery.

Combine these technologies with our expert data scientists and you get a winning solution to your data questions – all in the form of ScraperWiki Professional Services.

Hi, I’m Paul Thu, 18 Apr 2013 11:05:37 +0000 Hi!paul furley

I’m the latest member of ScraperWiki, joining the Data Science team this week.

Data Science is a fascinating new direction for me, being “officially” an Electronic Engineer. I’ve spent the last couple of years in a large company hammering out fast C++ and trying (unsuccessfully) to convert everyone to Python. But what really excites me about Data Science is the application of software to discover meaning in data. With the amount of data we’re generating every minute, I feel there must be countless opportunities to understand and exploit the information contained within.

I’ve written some scrapers in the past for trying to discover investment opportunities. The first compared sales and rental prices from RightMove to identify good buy-to-let areas and more recently I’ve being analysing dividend payments of companies listed on the London Stock Exchange. Once these are a bit more polished and migrated to the new ScraperWiki platform, I’ll post an update and hopefully others will find the data useful.

First impressions of ScraperWiki are great, I’m surrounded by talented and enthusiastic people – it’s hard to ask for more than that.


Here’s my Twitter and blog.

]]> 2 758218443
From future import Tue, 19 Mar 2013 10:40:26 +0000 Time flies when you’re building a platform.

At the start of the year, we announced the beginnings of a new, more powerful, more flexible ScraperWiki. More powerful because it exposes industry standards like SQL, SSH, and a persistent filesystem to developers, so they can scrape and crunch and export data pretty much however they like. More flexible because, at its heart, the new ScraperWiki is an ecosystem of end-user tools, enabling domain experts, managers, journalists, kittens to work with data without writing a line of code.

At the time, we were happy to announce all of our corporate Data Services customers were happily using the new platform (admittedly, with a few rough edges!). Lots has changed since then (seriously – take a look at the code!) and we’ve learnt a lot about how users from all sorts of different backgrounds expect to see, collect and interact with their data. As a guy with UX roots, this stuff is fascinating – perhaps something for future blogs posts!

Anyway, back to the future…

Last week, we invited our ‘Premium Account’ holders from ScraperWiki Classic, to come try the new ScraperWiki out. Each of them had their own private data hub, pre-installed with all of their Classic scrapers and views. And they all have access to a basic suite of tools for importing, visualising and exporting data (but there are far more to come).

Zarino's data hub

The feedback we’ve had so far has been really positive, so I wanted to say a big public thank you to everyone in this first tranche of users – you awesome, data-wrangling trail-blazers, you.

But we’re not standing still. Since our December announcement, we’ve collated a shortlist of early adopters: people who are pushing the boundary of what Classic can offer, or who have expressed interest in the new platform on our blog, mailing list, or Twitter. And once we’ve made some improvements, and put the finishing touches on our first set of end-user tools, we’ll be inviting them to put new ScraperWiki to the test.

If you’d like to be part of that early adopter shortlist, leave a comment below, or email We’d love to have you on board.

]]> 6 758218165
The next evolution of ScraperWiki Fri, 21 Dec 2012 15:49:12 +0000 Quietly, over the last few months, we’ve been rebuilding both the backend and the frontend of ScraperWiki.

The new ScraperWiki has been built from the ground up to be more powerful for data scientists, and easier to use for everyone else. At its core, it’s about empowering people to take a hold of their data, to analyse it, combine it, and make value from it.



We can’t wait to let you try it in January. In the meantime, however, we’re pleased to announce that all of our corporate customers are already migrating to the new ScraperWiki for scraping, storing and visualising their private datasets.

If you want data scraping, cleaning or analysing, then you can join them. Please get in touch. We’ve got a data hub and a team of data scientists itching to help.

]]> 7 758217809