Matthew Hughes – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Data Science (and ScraperWiki) comes to the Cabinet Office Thu, 05 Dec 2013 09:40:25 +0000 The Cabinet Office is one of the most vital institutions in British government, acting as the backbone to all decision making and supporting the Prime Minister and Deputy Prime Minister in their running of the the United Kingdom. On the 19th of November, I was given an opportunity to attend an event run by this important institution where I would be mentoring young people from across London in Data Science and Open Data.


The event was held at the headquarters of the UK Treasury, which occupies a palatial corner of Westminster overlooking St James’s Park, just a stone throw away from Buckingham Palace. Also in attendance were programmers, data scientists, project managers and statisticians from the likes of BT, Experian,, the Department for Education and the Foreign and Commonwealth Office, as well as my colleague Aine McGuire from ScraperWiki.

After a spot of chatting and ‘getting to know you’, the mentors and mentees split off into small groups where they’d start working on interesting ways they could use open government data; in particular data from the Department for Education.

Despite only having a day to work on their projects, each of the teams produced something incredible. Here’s what they made:


Students from a sixth-form college in Hammersmith and from the University in Greenwich chose to put together mapping technologies and open data to make it easy for parents to find good schools in their area.


They even managed to create a tablet-ready demonstration product built using Unity 3D, which displayed a number of schools in England and Wales, with data about the academic performance of the school being displayed opposite. Despite the crippling time constraints of the day, they managed to create something that worked quite well and ended up winning the award for ‘best use of Open Data’.


In British parlance, NEET is someone who is Not in Education, Employment or Training. It’s a huge problem in the UK, wasting huge amounts of human potential and money.


But what if you could use Open Data to make it inspire young people to challenge themselves and take advantage opportunities related to their interests? And what if that came packaged in a nice, accessible phone app? That’s what one of the teams in attendance did, resulting in Neetx.

Cherry Picker

The explosion in speciality colleges (confusingly, these are almost all high-schools) under the Labor government has made it easy for pupils with very specific interests to choose a school that works for them.

But what if you wanted a bit more detail? What if you wanted to send your child to a school that was really, really good at sciences? What if you wanted to cherry pick (see what I did there?) schools based upon their performance based upon their performance in certain key areas? Cherry Picker makes it easy to do just that.

University Aggregator

Finding the right university can be hard. There’s so much to be taken into consideration, and there’s so much information out there. What if someone gathered it all, and merged it into a single source where parents and prospective students could make an informed decision?


That’s what one of the teams attending proposed. They suggested that in addition to information from the National Student Survey and government data, they could also use information from Which?, The Telegraph and The Guardian’s university league tables. This idea also got a great reception from the mentors and judges in attendance, and is one idea I would love to see become a reality.


I left the Cabinet Office impressed with the quality of the mentorship offered, the quality of the ideas given as well as the calibre of the students attending. The Cabinet Office really ought to be commended for putting on such an amazing event.

Were you in attendance? Let me know what you thought about it in the comments box below.

Exploring Stack Exchange Open Data Wed, 14 Aug 2013 16:01:48 +0000 Inspired by my long commute and the pretty dreadful EDM music blasting out in my gym, I’ve found myself on a bit of a podcast kick lately. Besides my usual NPR fare (If you’ve not yet listened to an episode of This American Life with Ira Glass, you’ve missed out), I’ve been checking out the Stack Exchange podcast; a fairly irreverent take on the popular Q & A website hosted by the founders. On the 51st episode, they announced the opening of their latest site which focuses on the exciting world of open data.

Perhaps the most common complaint I’ve heard since I’ve started surrounding myself with data scientists is that getting specific sets of data can be frustratingly hard. Often, you will find that what you can get by scraping from a website is more than sufficient. That said if you’re looking for something oddly specific like the nutritional information of all food products on the shelves of UK supermarkets, you can quickly find yourself hitting some serious brick walls.

That’s where Stack Exchange Open Data comes in. It follows the typical formula that Stack Overflow has adhered to since its inception. Good questions rise to the top whilst bad ones fade into irrelevance.


The aim of this site is to provide a handy venue for finding useful datasets to analyze or use in projects. Despite only opening quite recently, it has garnered a large userbase and people are asking interesting questions and getting helpful answers. These range from finding out information about German public transportation to global terrain data .

Will you be using Stack Exchange Open Data in one of your future projects? Has Stack Exchange Open Data helped out out find a particularly elusive dataset? Let me know in the comments below.

My First Month As an Intern At ScraperWiki Fri, 09 Aug 2013 16:37:43 +0000 The role of an intern is often a lowly one. Intern duties usually consist of the provision of caffeinated beverages, screeching ‘can I take a message?’ into phones and the occasional promenade to the photocopier and back again.

ScraperWiki is nothing like that. Since starting in late May, I’ve taken on a number of roles within the organization and learned how a modern-day, Silicon Valley style startup works.

How ScraperWiki Works

It’s not uncommon for computer science students to be taught some project management methodologies at university. For the most part though, they’re horribly antiquated.

ScraperWiki is an XP/Scrum/Agile shop. Without a doubt, this is something that is definitely not taught at university!

Each day starts off with a ‘stand up’. Each member of the ScraperWiki team says what they intend to accomplish in the day. It’s also a great opportunity to see if one one of your colleagues is working on something on which you’d like to collaborate.

Collaboration is key at ScraperWiki. From the start of my internship, I was pair programming with the many other programmers who are on staff. For those of you who haven’t heard of it before, pair programming is where two people use one computer to work on a project. It’s nothing like this:

This is awesome, because it’s a totally non-passive way of learning. If you’re driving, you’re getting first-hand experience of writing code. If you’re navigating, then you get the chance to mentally structure the code that you’re working on.

In addition to this, every two weeks we have a retrospective where we look at how the previous fortnight went and where the next steps we intend to take as an organization. We write a bunch of sticky-notes where list what was good and what was bad about the previous week. These are then put into logical groups. We then vote for the group of stickies which best represent where we feel that we should focus our efforts as an organization.

What We Work On

Perhaps the most compelling argument for someone to do an internship at ScraperWiki is that you can never really predict what you’re going to do from one day to the next. You might be working on an interesting data science project with Dragon or Paul, doing front end development with Zarino or making the platform even more robust with Chris. As a fledgling programmer, you really get an opportunity to discover what you enjoy.

During my time working at ScraperWiki, I’ve had the opportunity to learn about some new, up and coming web technologies, including CoffeeScript, Express and Backbone.js.  These are all pretty fun to work with.

It’s not all work and no play too. Most days we go out to a local restaurant and get some food and eat lunch together. Usually it’s some variety of Middle-Eastern, American or Chinese. It’s also usually pretty delicious!


All in all, ScraperWiki is a pretty awesome place to intern. I’ve learned so much in just a few weeks, and I’ll be sad to leave everyone when I go back to my second year of university in October.

Have you interned anywhere before? What was it like? Let me know in the comments below!

]]> 1 758219215
We’ve migrated to EC2 Wed, 17 Jul 2013 15:36:35 +0000 When we started work on the ScraperWiki beta, we decided to host it ‘in the cloud’ using Linode, a PaaS (Platform as a Service) provider. For the uninitiated, Linode allows people to host their own virtual Linux servers without having to worry about things like maintaining their own hardware.

On April 15th 2013, Linode were hacked via a ColdFusion zero-day exploit. The hackers were able to access some of Linode’s source code, one of their web servers, and notably, their customer database. In a blog post released the next day, they assured us that all the credit card details they store are encrypted.

Soon after, however, we noticed fraudulent purchases on the company credit card we had associated with our Linode account. It seems that we were not alone in this. We immediately cancelled the card and started to make plans to switch to another VPS provider.

These days, the one of the biggest names in PaaS is Amazon AWS. They’re the market leader and their ecosystem and SLA are more in line with the expectations of our corporate customers. Their API is also incredibly powerful. It’s no wonder that even prior to the Linode hack, we had investigated migrating the ScraperWiki beta platform to Amazon EC2.

Since mid June, all code and data on is stored on Amazon’s EC2 platform. Amongst other improvements, you should all have started to notice a significant increase in the speed of Scraperwiki tools.

We have a lot of confidence in the EC2 platform. Amazon have been in existence for a very long time and they have an excellent track record in the PaaS field, where they have curated a reputation for reliability and security. It is for these reasons why we feel confident in putting our user’s data on their servers.

The integrity of any data stored on our service is paramount. We are therefore greatly encouraged by AWS’ backup solution, EBS, which we are currently using. It has afforded us the ability to store our backups in two different geographical regions. Should a region ever go down, we are able to easily and quickly restore ScraperWiki, ensuring a minimum of disruption for our customers. 

Finally, we’re excited to announce that we’re using Canonical’s Juju to manage how we deploy our servers. We’re impressed with what we’ve seen of it so far. It seems to be a really powerful, feature rich product and it has saved us a lot of time. We’re looking forward to it allowing us to better scale our product and spend less time on migrations and deployments. It will also allow us to easily migrate our servers to any OpenStack provider, should we wish to.

The changes we’re making to our platform will result in ScraperWiki being faster and more resistant to disruption. As developers and data scientists ourselves, we understand the necessity for reliable tools and we’re really looking forward to you – the user – having an even better Scraperwiki experience.

]]> 1 758218832
Scraping for kittens Wed, 10 Jul 2013 16:27:29 +0000 Like most people who possess a pulse and an internet connection, I think kittens are absurdly cute and quite possibly the vehicle in which humanity will usher in an era of world peace. I mean, who doesn’t? They’re adorable.

I was genuinely curious as to what country has the cutest kittens. I therefore decided to write a tool to find out! I used our new ScraperWiki platform, Python and the Flickr API, and I decided to search for all the photos that contain geotags and a reference to the word ‘kitten’. All the data that I’d retrieve would then be plotted on a map.

My Development Environment And The Flickr API

Flickr has a really powerful, mature, well documented API. It comes with a reasonably generous limit and if you ever get stuck using it, you’ll discover that there’s a great deal of support out there.

Gaining access to the API is pretty trivial too. Just sign into Flickr, click here and follow the instructions. Once you’ve received your API key and secret, put it in a safe place. We’re going to need it later.

First, however, you will need to log into Scraperwiki and create yourself a new dataset. Once you’ve done that, you will be able to SSH in and set up your development environment. Whilst you may have your own preferences, I recommend using our Data Services Scraper Template.

Here’s what Paul Furley – its author – has to say about it.

“The Scraper Template allows you to set up a box in a minimum  amount of time and allows you to be consistent across all of your boxes. It provides all the common stuff that people use in scrapers, such as unicode support, scheduling and autodocs, as well as helping you manage your virtual environment with virtualenv. It handles the boring stuff for you.”

How I Scraped Flickr

There are a couple of prerequisites that you’ll need to satisfy for your tool to work. Firstly, if you’re using a virtualenv, you should ensure that you have the ‘scraperwiki’ library installed.

It’s also essential that you install ‘flickrapi’. This is the awesome little library that will handle all the heavy lifting, when it comes to scraping Flickr. Both of these libraries are easily found in ‘pip’ and can be installed by running the following command:

$ pip install flickrapi scraperwiki

If you’ve used the Scraper Template,  you’ll find a file called ‘’ in ‘~/tool/’. Delete the contents of this file. If it doesn’t already exist, create it. In your favorite text editor, open it up and add the following lines:

View the code on Gist.

Here, we’re importing the modules and classes we need, assigning our API key to a variable and instantiating the FlickrAPI class.

View the code on Gist.

Later on, we’re going to write a function that contains the bulk of our scraper. We want this function to be executed whenever our program is run. The above two lines do exactly that.

View the code on Gist.

In our ‘main’ function, we call call ‘flickr.walk(). This handy little method gives you access to Flickr’s search engine. We’re passing in two parameters. The first searches for photos that contain the keyword ‘kittens’. The second gives us access to the geotags associated with each photo.

We then iterate through our results. Because we’re looking for photos that have geotags, if any of our results have a latitude value of ‘0’, we want to move on to the next item in our results. Otherwise we assign the the title, the unique flickr ID number associated with the photo, its URL and its coordinates to variables and we then call ‘submit_to_scraperwiki’. This is a function we’ll define that allows us to insert our results into a SQLite file that will be presented in a ScraperWiki view.

View the code on Gist.

Submit_to_scraperwiki is a handy little function that that takes the dictionary of values that we’re going to shove into our database. This contains all the results that we pull from Flickr and then shoves them into a table called ‘kittens’.

Screen Shot 2013-07-03 at 10.26.54

So, we’ve located all our kitties. What now? Let’s plot them on a map!

In your browser, navigate to your dataset. Inside it, click on ‘More tools’. You’ll see a list of all the possible tools you can use in order to better visualize your data.


As you can see, there’s a lot of tools that you can use. We just want to select ‘View on a map’. This automatically looks at our data and places it on a map. This works because the tool recognises and extracts the latitude and longitude columns that are stored in the database.

Once you’ve added this tool to your datahub, you can then see each result that you have stored on a map. Here’s what it looks like!


When you click on a ‘pin’, you’ll see all the information relating to the photo of a kitty it represents.

Screen Shot 2013-07-02 at 15.23.00

My First Tool

This was the first tool that I’ve made on the new platform. I was a bit apprehensive, as I had previously only used ScraperWiki Classic and I was very much used to writing my scrapers in the browser.

However, I soon discovered that the latest iteration of the Scraperwiki platform is nothing short of a joy to use. It’s something that was obviously designed from the ground up with the user’s experience in mind.

Things just worked. Vim has a bunch of useful plugins installed, including syntax highlighting. There’s a huge choice of programming languages. The ‘View on a map’ tool just worked. It was snappy and responsive. It’s also really, really fun.

You can try out my tool too! We have decided to adapt it to a general purpose Flickr search tool, and it is available to download right now! Next time you create a new dataset, have a look at ‘Flickr Geo Search’ and tell me what you think!

Screen Shot 2013-07-04 at 10.16.19

So, what’s your next tool going to be?

]]> 2 758219022
Hi, I’m Matthew Hughes Fri, 07 Jun 2013 16:40:04 +0000 Hello! My name is Matthew Hughes, and I am Scraperwiki’s newest intern, where I will be working predominantly on product and tools alongside the likes of Chris Blower and David Jones.

Currently, I’m reading Computing at Liverpool Hope University, where I am about to enter my second year of study. When I’m not hammering out code or squinting at an error log, you’ll likely find me with a cup of coffee in my hand, curled up with a John Green novel or watching a Woody Allen film.

In this brief introductory blog post, I’ve been tasked with telling you about why I wanted to work at ScraperWiki. Truth be told, there are a great many reasons. It’s an awesome company to work for and is staffed with some of the most amazingly smart people I’ve ever had the fortune to come across. The product itself matters to a great many people, and has been lovingly crafted by people who are amongst the best in their field. There is also a culture within the company that fosters a great deal of creativity and respects the creative process. The coffee is pretty great too.

From the perspective of an internship, I’ve learned a great deal. In just five days, I’ve gotten a better understanding of how Express and Backbone work. They’ve also achieved the impossible and pried me away from my text-editor of choice and turned me into a proud Vim user. This is a job where I’m constantly challenged and learning, and I’m incredibly grateful for the opportunity I’ve been provided.

I don’t use Twitter, but you can read my blog here or contact me on Facebook here.

]]> 1 758218823