ScraperWiki Admin – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Student scraping in Liverpool: football figures and flying police Thu, 23 Dec 2010 12:50:27 +0000 A final Hacks & Hackers report to end 2010! Happy Christmas from everyone at ScraperWiki!

Earlier this month ScraperWiki put on its first ever student event, at Liverpool John Moores University in partnership with Open Labs for students from both LJMU’s School of Journalism and the School of Computing & Mathematical Sciences, as well as external participants. This fabulous video comes courtesy of the Hatch. Alison Gow, digital executive editor at the Liverpool Daily Post and the Liverpool Echo has kindly supplied us with the words (below the video).

Report: Hacks and Hackers Hack Day – student edition

By Alison Gow

At the annual conference of the Society of Editors, held in Glasgow in November, there was some debate about journalist training and whether journalism students currently learning their craft on college courses were a) of sufficient quality and b) likely to find work.

Plenty of opinions were presented as facts and there seemed to be no recognition that today’s students might not actually want to work for mainstream media once they graduated – with their varied (and relevant) skill sets they may have very different (and far more entrepreneurial) career plans in mind.

Anyway, that was last month. Scroll forward to December 8 and a rather more optimistic picture of the future emerges. I got to spend the day with a group of Liverpool John Moores University student journalists, programmers and lecturers, local innovators and programming experts, and it seemed to me that the students were going to do just fine in whatever field they eventually chose.

This was Hacks Meet Hackers (Students) – the first event that ScraperWiki (Liverpool’s own scraping and data-mining phenomenon that has done so much to facilitate collaborative learning projects between journalists and coders) had held for students. I was one of four Trinity Mirror journalists lucky enough to be asked along too.

Brought into being through assistance from the excellent LJMU Open Labs team, backed by LJMU journalism lecturer Steve Harrison, #hhhlivS as it was hashtagged was a real eye-opener. It wasn’t the largest group to attend a ScraperWiki hackday I suspect, but I’m willing to bet it was one of the most productive; relevant, viable projects were crafted over the course of the day and I’d be surprised if they didn’t find their way onto the LJMU Journalism news website in the near future.

The projects brought to the presentation room at the end of the day were:

  • The Class Divide: Investigating the educational background of Britain’s MPs
  • Are Police Helicopters Effective in Merseyside?
  • Football League Attendances 1980-2010
  • Sick of School: The link between ill health and unpopular schools

The prize for Idea With The Most Potential went to the Police Helicopters project. This group had used a sample page from Merseyside Police helicopter movements report, which showed time of flight, geography, outcome and duration. They also determined that of the 33% of solved crimes, 0.03% involved the helicopter. Using the data scraped for helicopter flights, and comparing it to crimes and policing costs data, the group extrapolated it cost £1,675 per hour to fly the helicopter (amounting to more than £100,000 a month), and by comparing it to average officer salaries projected this could fund recruitment of 30 extra police officers. The team also suggested potential spin-off ideas around the data.

The Best Use of Data went to the Football League Figures team an all-male bunch of journos and student journos aided by hacker Paul Freeman who scraped data of every Football League club and brought it together into a database that could be used to show attendance trends. These included the dramatic drop in Liverpool FC attendances during the Thatcher years and the rises that coincided with exciting new signings, plunging attendances for Manchester City and subsequent spikes during takeovers, and the affects of promotion and relegation Premier League teams. The team suggested such data could be used for any number of stories, and would prove compelling information for statistics-hungry fans.

The Most Topical project went to the Class Divide group – LJMU students who worked with ScraperWiki’s Julian Todd to scrape data from the Telegraph’s politics web section and investigate the educational backgrounds of MPs. The group set out to investigate whether parliament consisted mainly of privately-educated elected members. The group said the data led them to discover most Lib Dem MPs were state educated, and that there was no slant of figures between state and privately educated MPs, contrary to what might have been expected. They added the data they had uncovered would prove particularly interesting once the MPs’ vote was held on University tuition fees.

The Best Presentation and the Overall Winner of the hackday went to Sick of Schools by Scraping The Barrel – a team of TM journos and students, hacker Brett and student nurse Claire Sutton – who used Office for National Statistics, Census, council information, and scraped data from school prospectuses and wards to investigate illness data and low demand for school places in Sefton borough. By overlaying health data with school places demand they were able to highlight various outcomes which they believed would be valuable for a range of readers, from parents seeking school places to potential house buyers.

Paul Freeman, described in one tweet as the “the Johan Cruyff of football data scraping” was presented with a Scraperwiki mug as the Hacker of the Day, for his sterling work on the Football League data.

Judges Andy Goodwin, of Open Labs, and Chris Frost, head of the Journalism department, praised everyone for their efforts and Aine McGuire, of ScraperWiki, highlighted the great quality of the ideas, and subsequent projects.  It was a long day but it passed incredibly quickly – I was really impressed not only by the ideas that came out but by the collaborative efforts between the students on their projects.

From my experience of the first Hacks Meet Hackers Day (held, again with support from Open Labs, in Liverpool last summer) there was quite a competitive atmosphere not just between the teams but even within teams as members – usually the journalists – pitched their ideas as the ones to run with. Yesterday was markedly less so, with each group working first to determine whether the data supported their ideas, and adapting those projects depending on what the information produced, rather than having a complete end in sight before they started. Maybe that’s why the projects that emerged were so good.

The Liverpool digital community is full of extraordinary people doing important, innovative work (and who don’t always get the credit they deserve). I first bumped into Julian and Aidan as they prepared to give a talk at a Liver and Mash libraries event earlier this year – I’d never heard of ScraperWiki and I was bowled over by the possibilities they talked about (once I got my brain around how it worked). Since then team has done so much to promote the cause of open data, data journalism, the opportunities it can create, and the worth and value it can have for audiences; Scraperwiki hackdays are attended by journalists from all media across the UK, eager to learn more about data-scraping and collaborative projects with hackers.

With the Hacks Meet Hackers Students day, these ideas are being brought into the classroom, and the outcome can only benefit the colleges, students and journalism in the future. It was a great day, and the prospects for the future are exciting.

Watch this space for more ScraperWiki events in 2011!

]]> 1 758214157
Hacks & Hackers RBI: The video Fri, 10 Dec 2010 10:04:45 +0000 Media reporter Rachel McAthy has produced this excellent video from last month’s Hacks & Hackers Hack Day at RBI. View it on, or below. More on the event at this link.

Hacks & Hackers RBI: Snow mashes, truckstops and moving home Fri, 10 Dec 2010 09:51:56 +0000

Sarah Booker (@Sarah_Booker on Twitter), digital content and social media editor for the Worthing Herald series, has kindly  provided us with this guest blog from the recent  Scraperwiki B2B Hacks and Hackers Hack day at RBI. Pictures courtesy of RBI’s Adam Tinworth.

Dealing with data is not new to me. Throughout my career I have dealt with plenty of stats, tables and survey results.

I have always asked myself, what’s the real story? Is this statistically significant? What are the numbers rather than the percentages? Paying attention in maths O level classes paid off because I know the difference between mean and mode, but there had to be more.

My goal was greater understanding so I decided to go along to the Scraperwiki day at Reed Business Information. I wanted to find out ways to get at information, learn how to scrape and create beautiful things from the data discovered.

It didn’t take long to realise I wanted to run before I could walk. Ideas are great, but when you’re starting out it’s difficult to deal with something when it turns out the information is full of holes.

My data sets were unstructured, my comma separated values (CSV) had gaps and it was almost impossible to parse it within the timeframe. My projects were abandoned after a couple of hours work, but as well as learning new terms I was able to see how Scraperwiki worked, even though I can’t work it myself, yet.

What helped me understand the structure, if not the language, was spending time with Scraperwiki co-founder Julian Todd. Using existing scraped data, he showed me how to make minor adjustments and transform maps.

Being shown the code structure by someone who understands it helped to build up my confidence to learn more in the future.

Our group eventually came up with an interesting idea to mash up the #uksnow Twitter feed with pre-scraped restaurant data, calling it a snow hole.  It has the potential to be something but didn’t end up being an award-winning product by the day’s end.

Other groups produced extremely polished work. Where the Truck Stops was particularly impressive for combining information about crimes at truckstops with locations to find the most secure.

They won best scrape for achieving things my group had dreamed of. The top project, Is It Worth It? had astonishingly brilliant interactive graphics polishing an interesting idea.

Demand for workers and the cost of living in an area were matched with job aspirations to establish if it was worth moving. There has to be a future in projects like this.

It was a great experience and I went away with a greater understanding of structuring data gathering before it can be processed into something visual and a yearning to learn more.

Read more here:

]]> 1 758214121
Hacks & Hackers Belfast: ‘You don’t realize how similar coding and reporting are until you watch a hack and a technologist work together to create something’ Tue, 07 Dec 2010 10:06:41 +0000 In November, Scraperwiki went to Belfast and participant Lyra McKee, CEO, NewsRupt (creators of the news app Qluso) has kindly supplied us with this account!

The concept behind Hacks and Hackers, a global phenomenon, is simple: bring a bunch of hacks (journalists) and hackers (coders) together to build something really cool that other journalists and industry people can use. We were in nerd heaven.

The day kicked off with a talk from the lovely Francis Irving (@frabcus), Scraperwiki’s CEO. Francis talked about Scraperwiki’s main use-scraping data, stats & facts from large datasets – and the company’s background, from being built by coder Julian Todd to
getting funded by 4IP.

After that, the gathered geeks split off into groups, all with the same
goal: scrape data and find an explosive, exclusive story. First, second and third prizes would be awarded at the end of the day.

You don’t realize how similar coding and reporting are until you
watch a hack and a technologist work together to create something.
Both vocations have the same core purpose: creating something
useful that others can use (or in the hack’s case, unearthing
information that is useful to the public).

The headlines that emerged out of the day were amazing. ‘Mr No Vote’ won first prize. When citizen hacks Ivor Whitten, Matt Johnston and coder Robert Moore of e-learning company Learning
Pool used Scraperwiki to scrape electoral data from local government websites, they found that over 60% of voters in every constituency in Northern Ireland (save one) abstained from voting in the last election, raising questions about just how democratically MPs and MLAs have been elected.

What was really significant about the story was that the guys were able to uncover it within a number of hours. One member of Team Qluso, an ex investigative journalist, was astounded, calling Scraperwiki a “gamechanger” for the industry. It was an almost historical event, seeing technology transform a small but significant part of the industry: the process of finding and analyzing data. (A process that, according to said gobsmacked Team Qluso member, used to take days, weeks, even months).

If you get a chance to chat with the Scraperwiki team, take it
with both hands: these guys are building some cracking tools for
hacks’n’hackers alike.

Finally, from the Scraperwiki town, a big thank you to the McKeown family for all their hospitality in Belfast!

]]> 1 758214080
Lichfield Hacks and Hackers: PFIs, plotting future care needs, what’s on in Lichfield and mapping flood warnings Mon, 15 Nov 2010 13:02:29 +0000 The winners with judges Lizzie and Rita. Pic: Nick Brickett

By Philip John, Journal LocalThis has been cross-posted on the Journal Local blog.

It may be a tiny city but Lichfield has shown that it has some great talent at the Hacks and Hackers Hack Day.

Sponsored by Lichfield District Council and Lichfield-based Journal Local, the day was held at the George Hotel and attended by a good selection of local developers and journalists – some coming from much further afield.

Once the introductions were done and we’d all contributed a few ideas the work got started and five teams quickly formed around those initial thoughts.

The first two teams decided to look into Private Finance Initiatives (PFIs) and Information Asset Registers (IARs). The first of these scraped information from 470 councils to show which of these published information about PFIs. The results showed that only 10% of councils actually put out any details of PFIs, highlighting a lack of openness in that area.

Also focused on PFIs was the ‘PFI wiki’ project which scraped the Partnerships UK database of PFIs and re-purposed it to allow deeper interrogation, such as by region and companies. It clearly paves the way for an OpenCharities style site for PFIs.

Future care needs was the focus of the third team who mapped care homes along with information on ownership, public vs private status and location. The next step, they said, is to add the number of beds and match that to the needs of the population based on demographic data, giving a clearer view of whether the facilities exist to cater for the future care needs in the area.

A Lichfield-related project was the focus of the fourth group who aimed to create a comprehensive guide to events going on in Lichfield District. Using about four or five scrapers, they produced a site that collated all the events listing sites serving Lichfield into one central site with a search facility. The group also spawned a new Hacks/Hackers group to continue their work.

Last but not least, the fifth group worked on flood warning information. By scraping the Environment Agency web site they were able to display on a map, the river level gauges and the flood warning level so that at a glance it’s possible to see the water level in relation to the flood warning limit.

So after a long day Lizzie Thatcher and Rita Wilson from Lichfield District Council joined us to judge the projects. They came up with a clever matrix of key points to rate the projects by and decided to choose the ‘what’s on’ and ‘flood warning’ projects as joint winners, who each share a prize of £75 in Amazon vouchers.

The coveted ScraperWiki mug also went to the ‘what’s on’ project for their proper use of ScraperWiki to create good quality scrapers.

Pictures from the event by Nick Brickett. Click through to view slideshow…


]]> 3 758214021
Announcing The Big Clean, Spring 2011 Wed, 10 Nov 2010 16:51:30 +0000 We’re very excited to announce that we’re helping to organise an international series of events to convert not-very-useful, unstructured, non-machine-readable sources of public information into nice clean structured data.

This will make it much easier for people to reuse the data, whether this is mixing it with other data sources (e.g. different sources of information about the area you live in) or creating new useful services based on the data (like TheyWorkForYou or Where Does My Money Go?). The series of events will be called The Big Clean, and will take place next spring, probably in March.

The idea was originally floated by Antti Poikola on the OKF’s international open-government list back in September, and since then we’ve been working closely with Antti and Jonathan Gray at OKFN to start planning the events.

Antti and Francis Irving (mySociety) will be running a session on this at the Open Government Data Camp on the 18-19th November in London. If you’d like to attend this session, please add your name to the following list:

If you can’t attend but you’re interested in helping to organise an event near you, please add your name/location to the following wiki page:

All planning discussions will take place on the open-government list!

‘Where’s safe?’ Creating a tool to guide people in an emergency Thu, 21 Oct 2010 10:32:25 +0000 A guest post by Paul Bradshaw (Help me investigate and publisher of Online Journalism Blog)

It’s taken 15 hours – including sleep* – for a group of people in Birmingham to build a tool to help guide people in an emergency event. The tool runs on a mobile phone, allowing anyone involved in emergency planning to manage information and populations while they’re on the move.

Here’s how it works in more detail. Once an emergency is declared – for example, a chemical spill, earthquake, etc. – the emergency services activate a series of safe havens – “rest centres” in the UK jargon; “musters” in Canada – where people know they can be safe. Typically these might be locations used as polling stations, leisure centres, and other suitable public buildings.

The tool – provisionally titled ‘Where’s Safe?’ – allows anyone to send a text message containing their location and receive information back telling them where the emergency is – and where to go to stay safe.

That simple text message kicks off a chain of events that allows the emergency services – and anyone else – to keep track of where people are going. Here’s how it runs:

  1. The text message is received by a mobile phone which is running an application in the background (this was built overnight by Lloyd Henning for Android but could be built for other platforms)
  2. The application sends that text message to Scraperwiki, which looks at the information sent and tries to locate it geographically.
  3. Based on that, Scraperwiki returns messages to be sent: firstly, a message about the emergency, and secondly, a message with details of the nearest safe haven. The safe haven is chosen in a way that avoids the person having to travel through the emergency zone.
  4. These messages are received by the mobile and forwarded back to the sender of the original text message.
  5. Meanwhile, Scraperwiki adds the original message to the appropriate point on the map and updates the ‘count’ of people heading to that particular safe haven.

Meanwhile, any member of the emergency services whose mobile phone is registered with the tool can send a message to the same mobile phone number to update the details of safe havens and other elements of the emergency – for example, turning a safe haven ‘off’ or ‘on’, changing the message being sent out about the emergency (which is then sent out to everyone who has sent a text message), or areas that are being affected, etc.

The impressive thing about the whole service is that it runs entirely on SMS (although it could be scaled up through incorporating a tool like FrontlineSMS) and so could be deployed for something as small as a school outing.

Some final footnotes: Julian Todd tweaked Scraperwiki to allow some additional functionality. Sarah-Jayne Farmer – who was part of the crowdsourced emergency work around the Haiti earthquake – produced the initial architecture and user requirements for how the service might work and worked through use cases – sketches shown below. These were refined further by Peter Sutton – shown at the bottom – before the build began.

initial sketches for emergency tool

refinement sketch by Peter Sutton

*They didn’t get any sleep

]]> 3 758213960
Event: Hacks and Hackers Hack Day – Birmingham Thu, 01 Jul 2010 17:11:00 +0000 We’re happy to announce we’re running a Hacks and Hackers Hack Day in Birmingham, sponsored by Birmingham Science Park Aston, Digital Birmingham, the National Union of Journalists and NHS Local.  It will take place on Friday July 23, 2010 from 9.30am to 8pm at Birmingham Science Park Aston Faraday Wharf, Holt Street.

The *free* hack day is for both developers and journalists. For further sponsorship opportunities please contact aine [at]

Armed with their laptops and WIFI, journalists and developers will be put into teams of around four to develop their ideas, with the aim of finishing final projects that can be published and shared publicly. Each team will then present their project to the whole group.

As previously announced, we will be running an event in Liverpool on July 16; more on that here.

Event: ScraperWiki/LJMU Open Labs Liverpool Hack Day – Hacks Meet Hackers! Tue, 22 Jun 2010 22:10:00 +0000

We’re happy to announce our next Hacks Meet Hackers event, to take place in Liverpool on Friday July 16, 2010 from 9.30am to 8pm at the Arts and Design Academy.

The *free* hack day, sponsored by LJMU Open Labs and Liverpool Daily Post & Liverpool Echo, is for both developers and journalists. For additional sponsorship opportunities please contact aine [at]

Can’t get to Liverpool? Don’t worry – we’ve got more UK hack days in the pipeline: get in touch to find out more about attending or sponsoring one.

So what’s this hack day all about?  It’s a practical event at which web developers and designers will pair up with journalists and bloggers to produce a number of projects and stories based on public data.

Who’s it for? We hope to attract hacks and hackers from all different types of backgrounds: people from big media organisations, as well as individual online publishers and freelancers.

What will you get out of it? The aim is to show journalists how to use programming and design techniques to create online news stories and features; and vice versa, to show programmers how to find, develop, and polish stories and features.

How much? NOTHING! It’s free, thanks to our sponsors.

What should participants bring? We would encourage people to come along with ideas for local ‘datasets’ that are of interest. In addition we will create a list of suggested data sets at the introduction on the morning of the event but flexibility is key for this event.

But what exactly will happen on the day itself? Armed with their laptops and WIFI, journalists and developers will be put into teams of around four to develop ideas, with the aim of finishing final projects that can be published and shared publicly. Each team will then present their project to the whole group. Overall winners will receive a prize at the end of the day. Food and drink will be provided during the day!

Any more questions? Please get in touch via aine[at]

Government data release: what’s still out there Mon, 07 Jun 2010 11:54:00 +0000 James Ball

Last week saw big steps forward in public data: on Monday, Prime Minister David Cameron wrote to all government departments, setting out a timetable for the release of a swathe of official datasets.

On Wednesday, the first two (senior civil service pay and MRSA infection rates) appeared – but the real meat came on Friday with the release of millions of rows of data from the official treasury database, COINS – which has already been packaged into a usable format by the Open Knowledge foundation

A big step forward – but a new dataset over at ScraperWiki reveals there’s still a very long way to go. Developer Anna Powell-Smith has built a scraper for the Information Asset Register (IAR).

The IAR is a register of unpublished datasets held by government departments – and it has more than 2,100 entries. The database shows which department holds the information, and should include a short description of what’s in there.

The data shows how far is still to go for open information: for one, David Cameron’s release last week covers fewer than ten datasets – important ones, beyond a doubt, but only a scratch in the surface.

But this is just a small part of the problem, as anyone looking at the full data in Powell-Smith’s scrape can see: even in this register of government data, quality is low.

More than half of the records in the IAR are missing details – often details as basic as a description of the record’s contents. Some departments have submitted hundreds of datasets, while others appear to have merely carried out a cursory search and listed a handful. Some didn’t even bother to do that.

A first step for the government’s new Transparency Board should doubtless be to update the register and bring it up to scratch.

Cameron warned that the data would initially be patchy. Given the poor state of even this simple document, it seems he wasn’t kidding. The culture of government might be changing, but developers and journalists alike will need to keep on the pressure, if data good enough to be of use to anyone is going to come out.

Get the data here.

Done something with this data? Let us know – @scraperwiki on Twitter or