Nicola Hughes – ScraperWiki Extract tables from PDFs and scrape the web

Have a Happy Open Data Day with ScraperWiki Fri, 02 Dec 2011 14:19:37 +0000

Seize the Data Day with Open Knowledge Foundation Thu, 01 Dec 2011 18:35:48 +0000 On December 3rd at the Barbican Centre in London, the Open Knowledge Foundation will be inviting everyone and anyone who can wrangle up a good bit of data or wrangle the wranglers of data, to a Seize the Data event.

So they could do with some screen scraping, data extracting, coding extraordinaires (i.e. you).

So if you’re free and happen to be in London, please park your diggers outside their door and start turning the scraper cogs towards evil PDFs.

Sign up here.

Mapping @TahrirSupplies Fri, 25 Nov 2011 14:53:31 +0000

Click on image to go to map

One of our users I recently met in New York said ScraperWiki is “a great tool for hacktivism”. Because of this we have a lot of ‘hacktivists’ in our community. One such ‘hacktivist’ is Thomas Levine. He’s recently scraped @TahrirSupplies, a twitter account set up to crowd-source the need for suplies at Tahrir Square and matching them with availabilities in the surrounding area.

In under 30 lines, he’s mapped the supplies from the tweets (see above). For that piece of good work, he’s our user of the week and today’s follow Friday. You can find him on ScraperWiki and also buzzing about on Google Groups.

The ScraperWiki digger is very glad to have him on board!!

ScraperWiki in 3 minutes Wed, 23 Nov 2011 16:36:59 +0000

As you’ve probably noticed, Zarino and the team recently upgraded all of your scraper pages to have a lovely new, more useful look. So we’re rolling out a new set of introductory screencasts going through our new look site and giving you a bit of a flavour of the many things you can do on ScraperWiki.

I’ve started with a three minute whirlwind tour. Just an example of some of the things you can do. We’ll be adding all the screencasts to the documentation. It’s all for you so if you like what we’re doing spread the word (i.e this video!).

With Tools, Tables and Tours: We’re Looking to Liberate Data Across the US Mon, 14 Nov 2011 17:16:30 +0000 As part of the Knight News Challenge entry, we at ScraperWiki said we would roll out Journalism Data Camps across the U.S. We had done what we called “Hacks and Hackers Hack Day” events across the U.K. and Ireland, bringing journalists and coders together. This happened at the same time as HacksHackers in the U.S. — great minds and whatnot!

Now we’re scaling up when it comes to exploring the data prospects of the new world. We are heading across the U.S. on a data liberation front. But where do we start, and where do we go? Well, firstly we want to liberate data. And lots and lots of people can use data. More importantly, we want to bring together anyone who wants to work with data to tell a story, provide insight or build an application.

So how do you go about finding where the right mixes are? Well, I scraped the data, mapped it, and visualized it, of course! I scraped media organizationsRRubyPython, and PHP meetup groups, data conferences, some B2B media as well as HacksHackers Chapters, and the top journalism schools. All in all, almost 13,000 data points were collected from different scrapers. So I put them into Google Fusion Tables and voila! (Please click on the image to be taken to the map)

A heat map gives me the hotspots for the concentrations of data points. These are biased towards the media sector, as there are many more outlets than interest groups and journalism schools. But it’s a good gauge of where we can build interest for the events.

Drilling down through the data using filter and aggregate, I got the breakdown of the proportion of the groups we want to reach for each locale. With some rough and ready image manipulation (I use Gimp as it’s open source), I mashed up a visualization scaling the pie charts so that the pixel radius corresponds to the size of the dataset for that location.

Now, it’s not an exact science nor is it news site-ready. But the speed in which I can look for a guide from data is now set to the digital time clock. 13,000 data points collected, cleaned and visualized in half a day. This is now a loose guide but also a tool. And this is the sort of quick thinking, quick gathering and quick analyzing we want to see at our events. So think big data. Think multiple sources. Think multiple tools. And then you can extrapolate for multiple uses!

We haven’t settled on our tour locations yet, so watch this space for details. We’re also getting clues for where to go from the data underground, so don’t think the data is giving everything away. We hope to see you there!

Solving Healthcare Problems with Open Source Software Fri, 11 Nov 2011 13:05:59 +0000 This year, EHealth Insider brought a new feature to their annual EHI Live exhibition: a healthcare skunkworks that gave visitors the chance to ask questions about how open source software can be used to solve healthcare problems.

ScraperWiki, of course had to be one of the invited guests to exhibit at the skunkworks. So as is our way, we drove an agile data mining sprint on the first day of the exhibition. The idea was to convene a small group of developers, give them coffee and an Internet connection, and see if they could create useful healthcare and NHS data sets by the end of the day. Attendees at the ScraperWiki exhibit could watch development progress on the scrapers in real time! It was thrilling!

Four developers participated in the sprint, from ScraperWiki and NHS Connecting for Health. By the end of the day, they had written multiple scrapers delivering data about:

* World Health Organisation outbreak alerts and responses

* Communicable and respiratory disease incidence data from the Royal
College of GPs

* Health information standards from the NHS Information Standards Board

* Foodborne outbreaks in the US, from the Centers for Disease Control
and Prevention

* Suppliers registered with the UK Government Procurement Service

One very lucky developer, Jacob Martin, from NHS Connecting for Health, won the coveted ScraperWiki mug for writing the most scrapers over the course of the day (*applause*).

But it’s not just about the scraping, it’s the ideals of ‘open’ that can be enlightening in such a short period of time given the will and the right equipment. As Shaun Hills, from NHS Connecting for Health, commented: “Interoperability and data exchange are important parts of healthcare IT. It was interesting and useful to see how technology like ScraperWiki can be used in this area. It was also good to brush up on my Python coding and still deliver something in a few hours.”

So watch out healthcare – you’re being ScraperWikied!

Some ScraperWikiLovin’ at MozFest Mon, 07 Nov 2011 14:50:56 +0000 This weekend saw ideas made reality, collaborations fostered and the future web bloom. The Mozilla Festival was all about making the web and making it happen in two days! Here at ScraperWiki we like doing that with data, so as well as contributing to the Data Driven Journalism Handbook, we held a quick fire ScraperWiki round.

And when I say quick I mean ~1hr! With a couple of geeks in hand, some eager journalist types, laptops and our ever articulate CEO, Francis Irving, we set to work, well, talking about data. The fact is there are many pre-scraping steps to consider:

  1. What is the general area you are interested in?
  2. Can you find other people, especially geeks, with that interest?
  3. When you have done so, you need to find where the data is that relates to your field of interest
  4. Once you’ve got a list of interesting data, you need to look at its structure (non-programmatically) in order to decide on a hypothesis to test
  5. Then you need to recruit your geek (who should be involved in all of the above steps) to start deconstructing the data i.e. seeing what can be scraped
  6. At this point you all need to work together to decide the schema of the scraper datastore i.e. the headings and their attributes
  7. Iterate until your data can answer your hypothesis or alter your hypothesis (it could be that you can mash the scraper with another dataset)
  8. Get working on answering your hypothesis. The outcome could be a query, a visualization or an application
  9. Go back to your data and iterate again so that the structure fits your outcome
  10. Pat yourselves on the back, have a beer and keep in touch for your next project

This may seem a bit much but this is how you make, iterate, and mediate for the web. The Mozilla Festival proved that this is achievable and enjoyable. In that vein, we got a scraper in 1hr! So a big cheer to Alex Poderoso for winning the coveted ScraperWiki mug.

To catch up on the MozFest fun, here is  the first draft of the Data Journalism Handbook. The festival premiered an amazing HTML5 documentary called The One Millionth Tower. You can catch up with all the rest including teaching kids to code with Hackasaurus and hacking video with popcorn.js (and an octocopter!) and loads more at the Mozilla Festival website.

Which witch is which? Wed, 02 Nov 2011 17:35:50 +0000 I hope you all had a spooky Halloween and have gorged yourselves on candy. In keeping with the festive cheer, The Guardian Datastore has hopped aboard and with a digger in resident, Chris Blower, and scraped the court cases of Scottish witches.

Thanks to Lisa Evans we can now see that the most common reason for a court case to be held against a suspected witch arises from the defendant being implicated by another women convicted as, or suspected to be, a witch. Shocking! (Note: our digger is gender neutral!).

So hats off (pun intended) to the digger, the digital journalist and the data!

Growing back to the Future: Allotments in the UK, open data stories and interventions Mon, 31 Oct 2011 15:34:16 +0000 IMGP4515This is a guest blog post from Farida Vis. She attended EuroHack at the Open Government Data Camp 2011. It consisted of a series of short talks combined with plenty of opportunities for hacking in groups in the second part the workshop.

On the day, we were given an introduction to data driven journalism by data journalist Nicolas Kayser-Brill, who has recently launched J++, a new media company that builds data journalism applications. Friedrich Lindenberg (OKF) and Aidan McGuire (ScraperWiki) gave a thorough overview of scraping, mainly focusing on the very popular ScraperWiki with Friedrich highlighting its application to EU spending data. Finally Chris Taggart (Open Corporates) talked about EU spending data as well as Open Corporates, and gave a hands-on workshop on Google Refine.

My personal interest lies in rather everyday data, related to ‘mundane issues’ that people relate to easily, principally because they feature in their everyday lives. This allows for a rethinking of political participation and civic engagement beyond the rather stale ways in which this is measured traditionally. I’m interested in what Liz Azyan has started calling ‘really useful’ data, which has the ordinary end user firmly in mind. Personally I find huge spending data difficult to get my head round (but I guess I’m not alone in that) and so I’m interested in exploring a more manageable example and seeing how far I can take it. So for some time now, I have been looking at the issue of allotments in the UK. At EuroHack I had not really intended to pitch my project, but having briefly talked about what I was doing to Aidan McGuire before the start of the workshop, he highlighted it on my behalf and then there was luckily no turning back. I was delighted with people’s interest in the project. Below highlights what we looked at on the day and what happened next.


What issue did we look at?

An allotment is a small plot of publicly owned land you rent from the council for a small annual fee, giving people the possibility to grow their own fruit and vegetables. I have an allotment myself (here’s a picture) and was lucky that when I decided to get one eleven years ago the waiting list was only two months, so my partner and I got one nearly immediately. Since then those numbers have shot up to the extent that on our site in South Manchester the waiting list is now fifteen years, highlighting a nationwide problem. The last few years have seen a staggering increase in demand, no doubt fuelled by growing broader environmental concern and awareness, yet no significant increase in the numbers of extra allotments have been created to meet this demand. The New Local Government Network reports that during the 1940s there were around 1.4 million allotments in the UK with only 200,000 today, which partly reflects that ‘growing your own’ goes through cycles of popularity. During a period of complete lack of interest, it is difficult for councils to hold on to this land as allotments that nobody wants. But what do you do when it seems everybody wants one again?

Earlier this year, the Department for Communities and Local Government issued a public consultation on 1294 Statutory Duties pertaining to local authorities to possibly reduce their number. These duties included Section 23 of the Allotments Act from 1908, which ensures local authorities provide allotments (and should take seriously such a request made by at least six tax paying citizens in a council), causing some newspapers to suggest that ‘The Good Life’ was now under threat. The Act remained unchanged however and this summer the government announced that of the 6,103 responses received, nearly half contained a comment on the Allotments Act, suggesting on a ‘straw poll’ level at least that this is an issue people care about.


What were we interested in?

Although it is tempting to simply highlight this problem in a different way, with additional data and accompanying visualisations, I was keen to highlight that whilst I do think there is an issue with councils not providing more sites, it is also clear to me that they are not exactly in a position to necessarily do so given the current economic climate. So therefore whatever we did, it was important to me that we used part of the day to start thinking about alternative solutions to the waiting list crisis. For example by identifying underused plots of lands (brown field sites and others), which could serve as temporary growing spaces (pop-up allotments anyone?). In my attempt to ‘do something about this’ I was joined by Daniela Silva and Pedro Markun from the Sao Paulo based think-and-do tank Esfera; data journalist Nicolas Kayser-Brill; python/js developer and self described open data fan Anna Powell-Smith, and finally Andrew Mackenzie who was at the OGD camp to film, part of an ongoing project that records the open data movement.

What data did we have?

Although there is very little allotment data available, as councils rarely publish it, Transition Town West Kirby (TTWK), led by Margaret and Ian Campbell, has for the last three years used the Freedom of Information Act, to obtain allotment waiting list data through WhatDoTheyKnow. They publish this data, along with a report each year and these figures are now widely used in the mainstream media. The reports however focus on national averages and do not highlight specific differences between councils or identify councils where problems are particularly severe. My co-researcher at Leicester, Yana Manyukhina and I had recently put in our own FOI request to build on the TTWK data. Our request focused on rental cost, water charges, whether discounts were available to plot holders. Aside from this we also requested the tenancy agreements councils use to manage their allotment sites. An analysis of these agreements may reveal further differences between councils, which could prove to be significant to citizens living in these locations. Because I am Manchester based, we also had a look at allotment location data manually collected by Feeding Manchester, which is interested in sustainable food for Greater Manchester.


What did we do?

After my introduction, Anna decided to work on the FOI data, using Google Fusion tables. In a UK context, The Guardian Data Store frequently uses these in order to highlight differences per council related to a specific topic. I had previously standardised the TTWK data so that each council now included a figure for how many people were waiting per every 100 allotments (the data set also includes further details about number of sites and allotments per council). Anna and I decided that we would add data from the FOI Yana and I had to the TTWK data, namely: the rental cost, water charges, and discounts given. I need to do further work on standardising the rent charge per council, which now is still expressed in a range of different old fashioned measurements. Allotment sizes were traditionally measured in ‘poles’ and ‘rods’ (from 1908 onwards a standard plot was 10 rods), though many now use square yards and metres.

Pedro and Nicolas both worked on building a series of scrapers, using ScraperWiki, scraping the Feeding Manchester data, Landshare data (Landshare is an initiative that is already offering alternatives, matching up individuals who have land, with those who wish to cultivate it) as well as a number of council sites. Aside from this Pedro and I also worked with an idea that ScraperWiki’s Julian Todd had given me at an earlier meeting (at OKCON in Berlin), and that is to use OpenStreetMap to get people to mark up allotments. In our extended idea (usefully articulated by Andrew Mackenzie on the day), other possible growing spaces, possibly with a newly agreed land use tag could also be mapped. In the end Pedro built a site that pulled in all the OSM data to show allotment sites in the UK and would update daily every time a new allotment was marked up on OpenStreetMap.

What happened next?

The enthusiasm and the great work we did during hackday meant that I wanted to reflect this in my presentation at the camp the next day. I addressed this desire to both highlight the issues over current allotment data collection (lack of ontologies), access to or knowledge of this data combined with this huge surge in demand from ordinary people wanted to grow their own produce. Going beyond simply a better visualisation of council data obtained via FOIs I strongly emphasised the possibility for a technological intervention into this growing (pardon the pun) issue, by building stronger ontologies for allotment data (Pedro and I talked about this a lot afterwards), but also to think beyond the unproductive ‘councils just need to provide more allotments’ deadlock. Following my presentation I had various offers from people keen to help out with the mapping, but one person on Twitter confirmed my feeling that in order to get a lot of people to map, to do this directly in OpenStreetMap was still quite a daunting prospect for the ordinary end user. I toyed with the idea of filming a simple step-by-step tutorial, but in the end Pedro suggested to use a new, more user friendly interface, one he is currently developing for the Sao Paulo Council in Brazil. This is currently still under development, but we will hopefully have an update soon.

Anna and I made excellent progress and had a great chat with Lisa Evans from the Guardian Data Store, at the camp to present, who expressed an interest in putting the allotment data on the date store. I will work with Anna over the next few days to complete the data set and do a short write up. Hopefully releasing this data through such a well known and respected site might generate some further interest. Daniela also interviewed me for the Esfera blog and she has written up our EuroHack day in Portuguese here.

All this flurry of activity did not go unnoticed and the project has now received official support from the OKF, with Community Coordinator Kat Braybrooke as the key liaison. Although Kat and I had talked for months about this project already, it seemed that it needed the critical mass, collective brainstorming and hacking at EuroHack and afterwards to push this open data part of the project to the next level. Kat and I will be meeting with a range of NGOs and interested parties soon, who have expressed an interest in pulling resources and making a joint intervention in to this problem. It is hard to express how exciting it was to connect with such amazing people at EuroHack, who all did such a tremendous amount of work on this project and especially to end up with such a great result. An OKF site highlighting the mapping project will launch shortly and we hope to give you further updates in the not-too-distant future. Watch this (growing) space!

If you would like to get involved or receive further info on the project, feel free to get in touch via email or twitter.

Farida Vis from the University of Leicester in the UK (where she teaches Media and Communication) recently took part in EuroHack, a pre-conference workshop in Warsaw, Poland, on 19 October, at the Open Government Data Camp 2011, organised by the European Journalism Centre and the Open Knowledge Foundation. Farida is very grateful to the EU Commission for supporting her attendance at EuroHack and the OGD Camp with a travel bursary. 

Meet Data Without Borders Fri, 28 Oct 2011 16:03:02 +0000 Whilst navigating the rugged landscape of ‘Big Data‘, we’ve crossed many barriers, traversed wide plains and battled through choppy weather. Along the journey we’ve picked up a lot of hitchhikers (almost 5,000) and laid down a lot of data tracks (over 75 million records in our data store).

We’ve also come across some interesting creatures along the way. Dinosaurs of the web age, colonies of data cleaners and civic hackers. We’ve even met some species in their early evolutionary stages.

Whilst travelling across The Atlantic we’ve observed a new species of data digger: the Data Scientist. They also have a vessel of their own, looking to ship data prospectors to the Big Data promised land. So without further ado, I give you Data Without Borders.

They are on a humanitarian mission and looking to spread their cause across the pond. Their mission is to “match non-profits in need of data analysis with freelance and pro bono data scientists who can work to help them with data collection, analysis, visualization, or decision support”. They are organizing data dives in order to bring exciting new problems to the data community and help solve social, environmental, and community problems alongside nonprofits and NGOs.

So if you’re interested in data, have skills or interests you want to share, hop on our digger. With the help of Jake Porway and the gang, we’ll smuggle you across the data border!

PS: There was a reconnaissance mission in New York recently (i.e. a data dive). To see what sort of data micro-financing institutions can get, you can pick up the trail on the mixmarket tag.