video – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Have a Happy Open Data Day with ScraperWiki Fri, 02 Dec 2011 14:19:37 +0000 Tomorrow is Open Data Day, and if you’re planning on hacking the web with the Open Knowledge Foundation or Random Hacks of Kindness or your own data hack of choice, here’s some fuel for your fight ( and if you’ve already driven our digger then check out the last video on our API so you can build applications!):

Start Talking to Your Data – Literally! Fri, 23 Sep 2011 15:22:38 +0000 Because ScraperWiki has a SQL database and an API with SQL extraction, I can SQL inject (haha!) straight into the API URL and use the JSON output.

So what does all that mean? I scraped the CSV files of Special Advisers’ meetings gifts and hospitalities at Number 10. This is being updated as the data is published because I can schedule the scraper to run. If it fails to run I get notified via email.

Now, I’ve written a script that publishes this information along with data from 4 other scrapers relating to Number 10 Downing Street, to a twitter account, Scrape_No10. Because I’ve made a twitter bot, I can tweet out a sentence and control the order and timing of tweets. I can even attach a hashtag which I can then rescrape to find what the social media sphere has attached to each data point. This has the potential to have the data go fish for you, as a journalist, but it is not immediately useful to the newsroom.

So I give you MoJoNewsBot! I have written a script as a module in an IRC chat bot. This queries my data via the ScraperWiki API and injects what I write into the SQL and extracts the answer from the resultant JSON file, giving me a written output into the chat room. For example:

Now I can write the commands in a private chat window with MoJoNewsBot or I can do it in the room. This means that rooms can be made for the political team in a newsroom or the environment team or the education team, and they can have their own bots with modules specific to their data streams. That way, computer assisted reporting can be collaborative and social. If you’re working on a story that has a political and an educational angle then you pop into both rooms. So both teams can see what you’re asking of the data. In that sense, you’ve got a social, data driven, virtual newsroom. As such, I’ve added other modules for the modern journalist.

With MoJoNewsBot you can look for twitter trends, search tweets, lookup last tweets, get the latest headlines from various news sources and check Google News. The bot has basic functions like Google search, Wolfram Alpha lookup, Wikipedia lookup, reminder setting and even a weather checker.

Here’s an example of the code needed to query the API and return a string from the JSON:

type = 'jsondict'
scraper = 'special_advisers_gifts_and_hospitality'
site = ''
query = ('SELECT `Name of Special Adviser`, `Type of hospitality
received`, `Name of Organisation`, `Date of Hospitality`
FROM swdata WHERE `Name of Special Adviser` = "%s" ORDER BY
`Date of Hospitality` desc' % userinput)

params = { 'format': type, 'name': scraper, 'query': query}	

url = site + urllib.urlencode(params)

jsonurl = urllib2.urlopen(url).read()
swjson = json.loads(jsonurl)

for entry in swjson[:Number]:
    ans = ('On ' + entry["Date of Hospitality"] + ' %s'
          % userinput + ' got ' +
          entry["Type of hospitality received"] + ' from '
          + entry["Name of Organisation"])

This is just a prototype and a proof of concept. I would add to the module so the query could cover a specific date range. After that, I could go back to ScraperWiki and write a scraper that pulls in the other 4 Number 10 scrapers and constructs the larger database. Then all I need to do is change the name of the scraper in my module to this new one and I can now query the much larger dataset that includes ministers and permanent secretaries!

Now that’s computer assisted reporting!

PS: have fixed the bug in .gn so the links match the headlines

Conquering Copyright and Scaling Open Data Projects – How Chris Taggart is Counting Culture Fri, 09 Sep 2011 14:05:18 +0000 Chris Taggart is a founder of OpenlyLocal and OpenCorporates. He says “When people ask what I do I say I open up data, sometimes whether people like it or not.” In the beginning he didn’t really expect much to come of his first scrapers “other than maybe being told off by the councils, because all the councils at that time had got things on their website saying this is copyright”.

He did it anyway with a very profound outcome:

I expected them to send me a take down notice … actually that didn’t happen. What did happen is that a couple of councils contacted us and said we like what you’re doing, will you start scraping us.

His first success spurred him on to create an even more ambitious project. Corporate data. He knew he’d be looking at a vast array of sources scattered across the web, in different languages and formats. So he made call out on ScraperWiki for OpenCorporates. It currently has information from 22 million companies across 28 jurisdictions. And it’s an alpha! I caught up with him on Skpye to find out what he’s learnt about conquering copyright and scaling open data projects.

]]> 1 758215374
ScraperWiki Tutorial Screencast for Non-Programmers Mon, 15 Aug 2011 17:24:45 +0000 If you’ve been going through our first ambitious tutorial and taster session for non-coders then good for you! I hope you found it enlightening. For those of you yet to try it, here it is.

It is a step-by-step guide so please give it a go and don’t just try and follow the answers as you’ll learn more from rummaging around out site. Also check out the introductory video at the start of the tutorial if you’re not familiar with ScraperWiki. So don’t look at the answers which are in screencast form below unless you have had a go!

Here’s the twitter scraper and datastore download. This is the first part of the tutorial where you fork (make a copy of) a basic Twitter scraper, run it for your chosen query, download the data and schedule it to run at a frequency to allow the data to be refreshed and accumulated:

The next one is a SQL Query View which looks at the data with a journalistic eye in ScraperWiki. This is the second part of the tutorial where you look into a datastore using the SQL language and find out which are the top 10 publications receiving complaints from the Press Complaints Commission and also who are the top 10 making the complaints:

And last we show you how to get a live league table view that updates with a scraper. This is the final part of the tutorial where you make a live league table of the above query that refreshes when the original scraper updates:

If you have any questions please feel free to contact me nicola[at] For full training sessions or scraping projects like OpenCorporates or AlphaGov contact aine[at]

]]> 1 758215242
Hacks & Hackers Glasgow: the BBC College of Journalism video Tue, 12 Apr 2011 08:07:25 +0000 Last month we celebrated the final leg of our UK & Ireland Hacks & Hackers tour in Glasgow, at an event hosted by BBC Scotland and supported by BBC College of Journalism and Guardian Open Platform. You can read more about it here. Other coverage includes:

The BBC College of Journalism kindly filmed the whole thing and the videos are now available to watch. The whole playlist can be viewed here, or watch each segment in the clips below:

]]> 4 758214651
Hacks & Hackers RBI: The video Fri, 10 Dec 2010 10:04:45 +0000 Media reporter Rachel McAthy has produced this excellent video from last month’s Hacks & Hackers Hack Day at RBI. View it on, or below. More on the event at this link.

Video: Hacks and Hackers Hack Day Manchester Sun, 17 Oct 2010 22:37:46 +0000

Hacks and Hackers Hack Day Manchester at Vision+Media in Salford, on 15th October 2010. Filmed (on a Flip) and edited by Joseph Stashko, who has kindly allowed us to re-publish the video here. A write-up of the day can be found at this link.

]]> 1 758213944