scraper – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Scraping for kittens Wed, 10 Jul 2013 16:27:29 +0000 Like most people who possess a pulse and an internet connection, I think kittens are absurdly cute and quite possibly the vehicle in which humanity will usher in an era of world peace. I mean, who doesn’t? They’re adorable.

I was genuinely curious as to what country has the cutest kittens. I therefore decided to write a tool to find out! I used our new ScraperWiki platform, Python and the Flickr API, and I decided to search for all the photos that contain geotags and a reference to the word ‘kitten’. All the data that I’d retrieve would then be plotted on a map.

My Development Environment And The Flickr API

Flickr has a really powerful, mature, well documented API. It comes with a reasonably generous limit and if you ever get stuck using it, you’ll discover that there’s a great deal of support out there.

Gaining access to the API is pretty trivial too. Just sign into Flickr, click here and follow the instructions. Once you’ve received your API key and secret, put it in a safe place. We’re going to need it later.

First, however, you will need to log into Scraperwiki and create yourself a new dataset. Once you’ve done that, you will be able to SSH in and set up your development environment. Whilst you may have your own preferences, I recommend using our Data Services Scraper Template.

Here’s what Paul Furley – its author – has to say about it.

“The Scraper Template allows you to set up a box in a minimum  amount of time and allows you to be consistent across all of your boxes. It provides all the common stuff that people use in scrapers, such as unicode support, scheduling and autodocs, as well as helping you manage your virtual environment with virtualenv. It handles the boring stuff for you.”

How I Scraped Flickr

There are a couple of prerequisites that you’ll need to satisfy for your tool to work. Firstly, if you’re using a virtualenv, you should ensure that you have the ‘scraperwiki’ library installed.

It’s also essential that you install ‘flickrapi’. This is the awesome little library that will handle all the heavy lifting, when it comes to scraping Flickr. Both of these libraries are easily found in ‘pip’ and can be installed by running the following command:

$ pip install flickrapi scraperwiki

If you’ve used the Scraper Template,  you’ll find a file called ‘’ in ‘~/tool/’. Delete the contents of this file. If it doesn’t already exist, create it. In your favorite text editor, open it up and add the following lines:

View the code on Gist.

Here, we’re importing the modules and classes we need, assigning our API key to a variable and instantiating the FlickrAPI class.

View the code on Gist.

Later on, we’re going to write a function that contains the bulk of our scraper. We want this function to be executed whenever our program is run. The above two lines do exactly that.

View the code on Gist.

In our ‘main’ function, we call call ‘flickr.walk(). This handy little method gives you access to Flickr’s search engine. We’re passing in two parameters. The first searches for photos that contain the keyword ‘kittens’. The second gives us access to the geotags associated with each photo.

We then iterate through our results. Because we’re looking for photos that have geotags, if any of our results have a latitude value of ‘0’, we want to move on to the next item in our results. Otherwise we assign the the title, the unique flickr ID number associated with the photo, its URL and its coordinates to variables and we then call ‘submit_to_scraperwiki’. This is a function we’ll define that allows us to insert our results into a SQLite file that will be presented in a ScraperWiki view.

View the code on Gist.

Submit_to_scraperwiki is a handy little function that that takes the dictionary of values that we’re going to shove into our database. This contains all the results that we pull from Flickr and then shoves them into a table called ‘kittens’.

Screen Shot 2013-07-03 at 10.26.54

So, we’ve located all our kitties. What now? Let’s plot them on a map!

In your browser, navigate to your dataset. Inside it, click on ‘More tools’. You’ll see a list of all the possible tools you can use in order to better visualize your data.


As you can see, there’s a lot of tools that you can use. We just want to select ‘View on a map’. This automatically looks at our data and places it on a map. This works because the tool recognises and extracts the latitude and longitude columns that are stored in the database.

Once you’ve added this tool to your datahub, you can then see each result that you have stored on a map. Here’s what it looks like!


When you click on a ‘pin’, you’ll see all the information relating to the photo of a kitty it represents.

Screen Shot 2013-07-02 at 15.23.00

My First Tool

This was the first tool that I’ve made on the new platform. I was a bit apprehensive, as I had previously only used ScraperWiki Classic and I was very much used to writing my scrapers in the browser.

However, I soon discovered that the latest iteration of the Scraperwiki platform is nothing short of a joy to use. It’s something that was obviously designed from the ground up with the user’s experience in mind.

Things just worked. Vim has a bunch of useful plugins installed, including syntax highlighting. There’s a huge choice of programming languages. The ‘View on a map’ tool just worked. It was snappy and responsive. It’s also really, really fun.

You can try out my tool too! We have decided to adapt it to a general purpose Flickr search tool, and it is available to download right now! Next time you create a new dataset, have a look at ‘Flickr Geo Search’ and tell me what you think!

Screen Shot 2013-07-04 at 10.16.19

So, what’s your next tool going to be?

]]> 2 758219022
The state of Twitter: Mitt Romney and Indonesian Politics Mon, 23 Jul 2012 09:16:53 +0000 It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can then map, graph or analyse further.

So, anyway, Tom and I decided it was about time to take a closer look at how you guys are using ScraperWiki to draw data from Twitter, and whether there’s anything we could do to make your lives easier in the process!

Getting under the hood at

As anybody who’s checkout out our source code will will know, we store a truck-load of information about each scraper and each run it’s ever made, in a MySQL database. Of 9727 scrapers that had run since the beginning of June, 601 accessed a URL. (Our database only stores the first URL that each scraper accesses on any particular run, so it’s possible that there are scripts that accessed twitter but not as the first URL.)

Twitter API endpoints

Getting more specific, these 601 scrapers accessed one of a number of Twitter’s endpoints, normally through the nominal API. We removed the querystring from each of the URLs and then looked for commonly accessed endpoints.

It turns out that search.json is by far the most popular entry point for ScraperWiki users to get Twitter data – probably because it’s the method used by the basic_twitter_scraper that has proved so popular on It takes a search term (like a username or a hashtag) and returns a list of tweets containing that term. Simple!

The next most popular endpoint – followers/ids.json – is a common way to find interesting user accounts to then scrape more details about. And, much to Tom’s amusement, the third endpoint, with 8 occurrences, was We can’t quite tell whether that’s a good or bad sign for his 2012 candidacy, but if it makes any difference, only one solitary scraper searched for Barack Obama.


We also looked at what people were searching for. We found 398 search terms in the scrapers that accessed the twitter search endpoint, but only 45 of these terms were called in more than one scraper. Some of the more popular ones were “#ddj” (7 scrapers), “occupy” (3 scrapers), “eurovision” (3 scrapers) and, weirdly, an empty string (5 scrapers).

Even though each particular search term was only accessed a few times, we were able to classify the search terms into broad groups. We sampled from the scrapers who accessed the twitter search endpoint and manually categorized them into categories that seemed reasonable. We took one sample to come up with mutually exclusive categories and another to estimate the number of scrapers in each category.

A bunch of scripts made searches for people or for occupy shenanigans. We estimate that these people- and occupy-focussed queries together account for between two- and four-fifths of the searches in total.

We also invented a some smaller categories that seemed to account for few scrapers each – like global warming, developer and journalism events, towns and cities, and Indonesian politics (!?) – But really it doesn’t seem like there’s any major pattern beyond the people and occupy scripts.

Family Tree

Speaking of the basic_twitter_scraper, we thought it would also be cool to dig into the family history of a few of these scrapers. When you see a scraper you like on ScraperWiki, you can copy it, and that relationship is recorded in our database.

Lots of people copy the basic_twitter_scraper in this way, and then just change one line to make it search for a different term. With that in mind, we’ve been thinking that we could probably make some better tweet-downloading tool to replace this script, but we don’t really know what it would look like. Maybe the users who’ve already copied basic_twitter_scraper_2 would have some ideas…

After getting the scraper details and relationship data into the right format, we imported the whole lot into the open source network visualisation tool Gephi, to see how each scraper was connected to its peers.

By the way, we don’t really know what we did to make this network diagram because we did it a couple weeks ago, forgot what we did, didn’t write a script for it (Gephi is all point-and-click..) and haven’t managed to replicate our results. (Oops.) We noticed this because we repeated all of the analyses for this post with new data right before posting it and didn’t manage to come up with the sort of network diagram we had made a couple weeks ago. But the old one was prettier so we used that :-)

It doesn’t take long to notice basic_twitter_scraper_2’s cult following in the graph. In total, 264 scrapers are part of its extended family, with 190 of those being descendents, and 74 being various sorts of cousins – such as scrape10_twitter_scraper, which was a copy of basic_twitter_scraper_2’s grandparent, twitter_earthquake_history_scraper (the whole family tree, in case you’re wondering, started with twitterhistory-scraper, written by Pedro Markun in March 2011).

With the owners of all these basic_twitter_scraper(_2)’s identified, we dropped a few of them an email to find out what they’re using the data for and how we could make it easier for them to gather in the future.

It turns out that Anna Powell-Smith wrote the basic_twitter_scraper at a journalism conference and Nicola Hughes reused it for loads of ScraperWiki workshops and demonstrations as basic_twitter_scraper_2. But even that doesn’t fully explain the cult following because people still keep copying it. If you’re one of those very users, make sure to send us a reply – we’d love to hear from you!!


We’ve posted our code for this analysis on Github, along with a table of information about the 594 Twitter scrapers that aren’t in vaults (out of 601 total Twitter scrapers) in case you’re as puzzled as we are by our users’ interest in Twitter data

Now here’s video of a cat playing a keyboard.

]]> 2 758217376
5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) Fri, 25 May 2012 07:13:33 +0000

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

ScraperWiki scrapers: now 53% more useful! Wed, 16 Nov 2011 12:01:07 +0000

It’s Christmas come early at ScraperWiki HQ as we deliver—like elves popping boxes under the data digging Christmas tree—a bunch of great new improvements to the ScraperWiki site. We’ve been working on these for a while, so it’s great to finally let you all use them!

First up: a new look for your scrapers

The most obvious change will hit you as soon as you look at a scraper – the overview page now sports a svelte, functional new layout. The roster of changes is as long as Santa’s list, so I’ll just pick out a few…

The blue header at the top of the page is now way more informative. As well as the scraper’s title and creator, you can also see the language it’s written in, the domain it scrapes, the number of records in its datastore, and its privacy status. No more hunting around the page: everything you need is there in one place. Hurrah!

Further down, you’ll notice the history and discussion pages have now been merged into the main page, meaning you’ll spend less time flicking between tabs and more time editing or investigating the scraper.

Meanwhile, the page as a whole is a lot more organised. Everything to do with runs (the current status, the last run, the pages scraped, the schedule) is up in the top left. Everything to do with the datastore (including the data preview and download options) is just below that, and everything to do with the scraper’s relationship to other scrapers (like tags, forks, copies and views) is just below that. Neat.

Speaking of which, the data preview has had some serious attention. It’s now way more interactive: you can sort on any column, alter the number of rows displayed, and page through all of the data in all of the tables, with just a few clicks. And other features like syntax-highlighted table schemas and a nifty drop-down when you have too many tabs to fit on the page, should keep ScraperWiki power-users fast and efficient.

And those are just the headline changes. There have also been a load of great tweaks, like a View Source button so you never have to worry about breaking someone’s scraper when you’re just taking a look, and an easy Share button to get your scrapers on Facebook, Twitter and Google+. So go try out the new page, and as ever, we’d really love your feedback.

Never miss a comment again

As well as moving the scraper discussion (or ‘chat’, as it’s now called) onto the main page to make it more obvious, we’ve also enabled email notifications for comments. Now, when someone comments on your scraper, you’ll get a swish new email showing you who they are, what they said, and how to reply (thanks to ScraperWiki’s new engineer, David Jones, for his input on this!).

If, however, all this conversation is a little too much for you, Ebenezer, then you can disable comment notifications by unchecking the box in your Edit Profile page.

And while we were at it: Messages!

For a while, users have grumbled that it’s far too difficult to contact other users. And quite right too – we never anticipated that our developers would be such social creatures! So, we’ve added a “Send a Message” button to everyone’s profile (kudos to Chris Hannam for helping out!). The messages are sent as emails via, meaning the other user never sees your email address – just your name, your message and a link to your profile. And, as with comment notifications, if you want to disable sending and receiving of user messages, just uncheck the box in your Edit Profile page.

]]> 2 758215833
ScraperWiki Datastore – The SQL. Wed, 06 Apr 2011 16:54:58 +0000 Recently at ScraperWiki we replaced the old datastore, which was creaking under the load, with a new, lighter and faster solution – all your data is now stored in Sqlite tables as part of the move towards pluggable datastores. In addition to the new increase in performance, using Sqlite also provides some other benefits such as allowing us to transparently modify the schema and accessing the data using SQL via the ScraperWiki API or via the Sqlite View.  If you don’t know SQL, or just need to try and remember the syntax there is a great SQL tutorial available at  which might get you started.

For getting your data out of ScraperWiki you can try using the Sqlite View, which makes it easier to add the fields you want to query as well as performing powerful queries on the data.  To explain how you do this we’ll use the Scraper created by Nicola in her recent post Special Treatment for Special Advisers in No. 10 which you can access on ScraperWiki, and from there create a new view. If you choose General Sqlite View, you’ll get a nice easy interface to query and study the data.  This dataset shows data from the Cabinet Office (UK central Government) and logs gifts given to advisers for the top ministers – all retrieved by Nicola after only having known how to program for three weeks.

If you’re more confident with your SQL, you can access a more direct interface after clicking the ‘Explore with ScraperWiki API’ link on the overview page for any scraper. This will also give you a link that you can use elsewhere to get direct access to your data in JSON or CSV format.  For those that are still learning SQL, or not quite as confident as they’d like to be, using the Sqlite View is a good place to start.  When you first get to the Sqlite View you’ll see something similar to the following, but without the data already shown.

As you can see, the view gives you a description of the fields in the Sqlite table (highlighted in yellow) and a set of fields where you can enter the information you require. If you are feeling particularly lazy you can simply click on the highlighted column names and they will be added to the SELECT field for you! Accessing data across scrapers is done slightly differently, and is hopefully the subject of a future post.  By default this view will display the output data as a table but you can change it to do what you wish by editing the HTML and Javascript underneath – it is pretty straight forward. Once you have added the fields you wish to find (making sure to use ` to surround any field names with spaces in) clicking the query button will make a request to the ScraperWiki API and display the results on your page. It also shows you the full query so that you can copy your query and save it away for future use.

Now that you have an interface where you can modify your SQL, you can now access your data almost any way you want it!  You can do simple queries by just leaving the SELECT field set to * which will return all of the columns, or you can specify the individual columns and the order they will be retrieved. You can even set their title by using the AS keyword. Setting the SELECT field to “`Name of Organisation` AS Organisation” allows will show that field with the new shorter column name.

Aside from ordering your results (putting a field name in ORDER BY, followed by desc if you want descending order), limiting your results (adding the number of records into LIMIT)  and the aforementioned renaming of columns, one thing the Sqlite will let you do is group your results to show information that isn’t immediately visible in the full result set.  Using the Special Advisers scraper again, the following view shows how by grouping the data on `Name of Organisation` and using the count function in the SELECT field we can show the total number of gifts given by each organisation – surely a lot faster than counting how many times each Organisation appears in the full output!  

In addition to using the count function in SELECT you could also use sum, or even avg to obtain an average of some numerical values. Not only can you add these individual functions into your SELECT field, you can get a lot more complicated to get a better overall view of the data, as in the Arts Council Cuts scraper. Here you can see the output for the total revenue per year and average percent change by artform and draw your own conclusions on where the cuts are, or are not happening.

SELECT `Artform `,
    sum(`Total Revenue 10-11`) as `Total Revenue for this year`,
    sum(`11-12`) as `Total Revenue for 2011-2012`,
    sum(`12-13`) as `Total Revenue for 2012-2013`,
    sum(`13-14`) as `Total Revenue for 2013-2014`,
    (avg(`Real percent change -Oct inflation estimates-`)*100) 
    as `Average % change over 4 years (Oct inflation estimates)`
FROM swdata
GROUP BY `Artform `
ORDER BY `Total Revenue for this year` desc"

If there is anything you’d like to see added to any of these features, let us know either in the comments or via the website.

]]> 2 758214595