Aine McGuire – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone Wed, 30 Sep 2015 14:11:16 +0000 Henry Morris

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach.  They’re amazing businesses that focus on social mobility.

We interviewed him for

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at!


Civil Service People Survey – Faster, Better, Cheaper Tue, 08 Sep 2015 13:46:21 +0000 CSPS1

Civil Service Reporting Platform

The Civil Service is one of the UK’s largest employers.  Every year it asks every civil servant what it thinks of its employer: UK plc.

For Sir Jeremy Heywood the survey matters. In his blog post “Why is the People Survey Important?” he says

“The survey is one of the few ways we can objectively compare, on the basis of concrete data, how things are going across departments and agencies.  …. there are common challenges such as leadership, improving skills, pay and reward, work-life balance, performance management, bullying and so on where we can all share learning.”

The data is collected by a professional survey company called ORC International.  The results of the survey have always been available to survey managers and senior civil servants as PDF reports. There is access to advanced functionality within ORC’s system to allow survey managers more granular analysis.

So here’s the issue.  The Cabinet Office wants to give access to all civil servants and in a fast and reliable way.  It wants to give more choice and speed in how the data is sliced and diced – in real time.  Like all government departments it is also under pressure to cut costs.

ScraperWiki built a new Civil Service People Survey Reporting Platform and it’s been challenging.  It’s a moderately large data set.  There’s close to half a million civil servants – over 250,000 answered the last survey which contains 100 questions.   There are 9000 units across government.  This means 30,000,000 rows of data per annum and we’ve ingested 5 years of data.

The real challenges were around:

  • Data Privacy
  • Real Time Querying
  • Design

Data privacy

The civil servants are answering questions on their attitudes to their work, their managers, the organisations they worked in along with questions on who they are: gender, ethnicity, sexual orientation – demographic information. Their responses are strictly confidential and one of the core challenges of the work is maintaining this confidentiality in a tool available over the internet, with a wide range of data filtering and slicing functionality.

A naïve implementation would reveal an individual’s responses either directly (i.e. if they are the only person in a particular demographic group in a particular unit), or indirectly, by taking the data from two different views and taking a difference to reveal the individual. ScraperWiki researched and implemented a complex set of suppression algorithms to allow publishing of the data without breaking confidentiality.

Real-time queriesimage_thumb.png

Each year the survey generates 30,000,000 data points, one for each answer given by each person. This is multiplied by five years of historical data. To enable a wide range of queries our system processes this data for every user request, rather than rely on pre-computed tables which would limit the range of available queries.

Aside from the moderate size, the People Survey data is rich because of the complexity of the Civil Service organisational structure. There are over 9,000 units in the hierarchy which is in some places up to 9 links deep. The hierarchy is used to determine how the data are aggregated for display.

Standard design

image_thumb.pngAn earlier design decision was to use the design guidelines and libraries developed by the Government Digital Service for the GOV.UK website. This means the Reporting Platform has the look and feel of GOV.UK., and we hope follows their excellent usability guidelines.

Going forward

The People Survey digital reporting platform alpha was put into the hands of survey  managers at the end of last year. We hope to launch the tool to the whole civil service after the 2015 survey which will be held in October. If you aren’t a survey manager, you can get a flavour of the People Survey Digital Reporting Platform in the screenshots in this post.

Do you have statistical data you’d like to publish more widely, and query in lightning fast time? If so, get in touch.

….and suddenly I could convert my bank statement from PDF to Excel… Wed, 05 Aug 2015 13:03:59 +0000 24 HoursDo you ever:

  • Need an old bank statement only to find out that the bank has archived it, and want to charge you to get it back?
  • Spot check to make sure there are no fraudulent transactions on your account?
  • Like to summarise all your big ticket items for a period?
  • Need to summarise business expenses?

It’s been difficult for me to do any of these as bank transaction systems are Luddite.

15 years after signing up to my smile internet bank account I received a ground breaking message.

“Your paperless statement is now available to view when you login to online banking”.

I logged in excited, expecting an incredible new interface.

Eureka - PDF StatementNo … it meant I can now download a PDF!

Don’t get me wrong – PDF is the “Portable Document Format” – so at least I can keep my own records which is a step forward. But it’s just as clumsy to analyse a PDF as it is to trawl through the bank’s online system (see The Tyranny of the PDF to understand why).

We know a lot about the problems with PDFs at ScraperWiki and we made  I’m able to convert my PDF to Excel and get a list of  transactions which I can analyse and store in some order.  Yes – I have to do some post processing but I can automate this with a spreadsheet macro.

You can see on the example I have included that the alignment of the transactions is spot on and I could even use our DataBaker product to take out the transaction descriptions and the values and put them into another system.

Although we’d love everything to be structured data all the way through, the number of PDFs on the web is still increasing exponentially.  Hooray for!

Statement #173

Got a PDF you want to get data from?
Try our easy web interface over at!
Summary – Big Data Value Association June Summit (Madrid) Tue, 21 Jul 2015 10:13:39 +0000 Summit Programme

In late June, 375 Europeans + 1 attended the Big Data Value Association (BVDA) Summit in Madrid. The BVDA is the private part of the Big Data Public Private Partnership.  The Public part is the European Commission.  The delivery mechanism is Horizon 2020 and €500m funding . The PPP commenced in 2015 and runs to 2020.

Whilst the conference title included the word ‘BIG’, the content did not discriminate.  The programme was designed to focus on concrete outcomes. A key instrument of the PPP is the concept of a ‘lighthouse’ project.  The summit had arranged tracks that focused on identifying such projects; large scale and within candidate areas like manufacturing, personalised medicine and energy.

What proved most valuable was meeting the European corporate representatives who ran the vertical market streams.  Telcom Italia, Orange and Nokia shared a platform to discuss their sector. Philips drove a discussion around health and well being.  Jesus Ruiz, Director of Open Innovation in Santander Bank Corporate Technology, led the Finance industry track. He tried to get people to think about ‘innovation’ in the layer above traditional banking services. I suspect he meant in the space where companies like Transferwise (cheaper foreign currency conversion) play. These services improve the speed and reduce the cost of transactions.  However the innovating company never ‘owns’ an individual or corporate bank account.  As a consequence they’re not subject to tight financial regulation. It’s probably obvious to most but I was unaware of the distinction.

I had an opportunity to talk to many people from the influential Fraunhofer Institute!  It’s an ‘applied research’ organisation and a significant contributor to Deutschland’s manufacturing success.  Last year it had a revenue stream of €2b.  It was seriously engaged at the event and is active at finding leading edge ‘lighthouse projects’.  We’re in the transport #TIMON consortia with it – Happy Days 🙂

BDVA - You can join!

BDVA – You can join!

Networking is the big bonus at events like these and with representatives from 28 countries and delegates from Palestine and Israel – there were many people to meet.  The UK was poorly represented and ScraperWiki was the only UK technology company showing it’s wares.  It was a shame given the UK’s torching carrying when it comes to data.  Maurizio Pilu, @Maurizio_Pilu Executive Director, Collaborative R&D at Digital Catapult gave a keynote.  The ODI is mentioned in the PPP Factsheet which is good.

There was a strong sense that the PPP initiative is looking to the long term, and that some of the harder problems have not yet been addressed to extract ‘value’.  There was also an acknowledgement of the importance of standards and a track was run by Phil Archer, Data Activity Lead the W3C .

Stuart Campbell, Director, CEO at Information Catalyst and a professional pan-European team managed the proceedings and it all worked beautifully.  We’re in FP7 and Horizon 2020 consortia so we decided to sponsor and actively support #BDVASummit.  I’m glad we did!

The next big event is the European Data Forum in Luxembourg (16-17 Nov 2015).  We’re sponsoring it and we’ll talk about our data science work, and DataBaker.   The event will be opened by Jean-Claude Juncker President of the EU, and Günther Oettinger , European Commissioner for Digital Economy and society.

It’s seems a shame that the mainstream media in the UK focuses so heavily on subjects like #Grexit and #Brexit.  Maybe they could devote some of their column inches to the companies and academics that are making a very significant commitment to finding products and services that make the EU more competitive and also a better place to work and to live.

NewsReader – Hack 100,000 World Cup Articles Wed, 16 Apr 2014 13:52:23 +0000 NWR_logo_narrowJune 10, The Hub Westminster (@NewsReader)

Ian Hopkinson has been telling you about our role in the NewsReader project.  We’re making a thing that crunches large volumes of news articles.  We’re combining natural language processing and semantic web technology.  It’s an FP7 project so we’re working with a bunch of partners across Europe.

We’re 18 months into the project and we have something to show off.  Please think about joining us for a fun ‘hack’ event on June 10th in London at  ‘The Hub’, Westminster.  There are 100,000 World Cup news articles we need to crunch and we hope to dig out some new insights from a cacophony of digital noise.  There will be light refreshments throughout the day.  Like all good hack events there will be an end of day reception and we would like you to present your findings and give us some feedback on the experience. (the requisite beer and pizza will be provided)

All of our partners will be there LexisNexis, SynerScope, VU University (Amsterdam), University of the Basque Country (San Sebastian) and Fondazione Bruno Kessler (Trento).  They’re a great team, very knowledgeable in this field, and they love what they are doing.

Ian recently made a short video about the project which is a useful introduction.

If you are a journalist, an editor, a linked data enthusiast or data professional we hope you will care about this kind of innovation.

Please sign up here  ‘NewsReader eventbrite invitation’  and tell your friends.

logo long (for screen 72dpi)

Open data – the zeitgest Fri, 15 Nov 2013 13:04:37 +0000

Nirvana_by_31337157Open data [1] is becoming a brand – 61 countries are using the brand and many others are expressing interest. The week before last thousands of delegates from around the world descended on London for a host of open data events that ran throughout the week. There is something of the zeitgeist about open data at the moment and this is important as it is –

  • becoming a magnet for digital talent
  • a driver for a host of new start-ups
  • pressurising existing businesses to up their game
  • and engendering a positive feeling about what can be done with technology to make a difference to how citizens are served by government

Monday – Github in Government

@FutureGov is a business that provides ‘digital public service design’. Dominic Campbell (@dominiccampbell) hosted an excellent ‘Github in Government’ event at the trendy Shoreditch Works and which included a number of short presentations by Paul Downey (@psd), Chris Thorpe (@jaggeree), Sarah Kendrew (@sarahkendrew), and James Smith (@floppy) amongst others.  It was a community event that managed to attract a much wider audience from the local digital crowd.

ODI Summit 2013- thanks to Brendan Lea - used under CC SA

ODI Summit 2013- thanks to Brendan Lea – used under CC SA

Tuesday – ODI Annual Summit
On Tuesday @UKODI held its annual summit at the London Museum to celebrate its 1st birthday. Sir Tim Berners Lee, Sir Nigel Shadbolt and Gavin Starks were the main hosts and kept up the pace throughout the day with numerous talks and panels. The main theme was the use of open data to build and support business and a range of exemplars were used to enforce the theme. Placr’s Transport API, Mastadon C, and OpenCorporates were just a few of those mentioned. ScraperWiki is a proud ODI supporter.

Business is important in this context as open data needs to be used and reused to be sustainable in the longer term. It also needs to be embraced more broadly across corporate business. I recently heard a senior executive from a large US social media company propose that government organisations close what is ‘essential to close’ and publish everything else. His suggestion is daring and logical but likely politically unacceptable.

Kenneth Cukier (@kncukier) Data Editor at the Economist cited the UK as being the world leader in open data and emphasised the big opportunity that it presents for the country and its influence internationally. [2]Whilst we dined and networked informally with other delegates, a colourful curved aesthetic digital display hung over the main dining area and showed live data feeds and ODI acknowledgements.

Wednesday – OGP Civil Society Day
Organised by the OKFN – The Civil Society day provided an informal opportunity for over 400 civil society actors that are involved in OGP to connect, interact, learn and strategise.  It provided an opportunity to focus on the conversations that are needed between civil society in order to prepare for Thursday and Friday’s summit and to strengthen the national OGP processes for the future.

OGPSummitThursday and Friday – OGP Summit 2013
The Open Government Summit was the UK’s opportunity to showcase the work it has been doing around open data and its impact on transparency, business and civil society. A good friend suggested that the ‘metapoint’ e-Diplomacy initiative run largely by foreign offices of participating countries. Francis Maude was the main UK government host and he had a very professional team around him to ensure that the event was a success. We were in the ‘Festival’ area which was a showcase of companies providing technology that is supporting open government. The event also marked the UK’s handover of the biennial responsibility to Indonesia. We used the opportunity to show off our prototype of Table Xtract and we demonstrated it by converting energy prices tariffs from each of the energy companies EDF, British Gas and Scottish Power.

..and meanwhile back up on the 4th floor of the QEII centre
The US State Department ran a Tech Camp’. The purpose of a techcamp is to provide NGOs with training in low-cost or no-cost new and online technologies. ‘Speed geeking’ was used to introduce Ian Hopkinson from ScraperWiki to OGP delegates from participating countries every 5 mins – the objective was for him to give examples of successful open data projects – he showed the UN OCHA project. The US ambassador to London Mr Barzun was also treated to the 5 minute experience. We gave him a ScraperWiki sticker and found out he is a Liverpool fan.

I am frequently reminded that my start-up colleagues have been pushing the open data envelope for about 10 years – who says things happen quickly in our sector!Opengov

[1]Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.
[2] co-author of Big Data: A Revolution That Will Transform How We Live, Work, and Think with Viktor Mayer-Schönberger in 2013


Publish from ScraperWiki to CKAN Fri, 05 Jul 2013 09:11:38 +0000 ScraperWiki is looking for open data activists to try out our new “Open your data” tool.

Since its first launch ScraperWiki has worked closely with the Open Data community. Today we’re building on this commitment by pre-announcing the release of the first in a series of tools that will enable open data activists to publish data directly to open data catalogues.

To make this even easier, ScraperWiki will also be providing free datahub accounts for open data projects.

This first tool will allow users of CKAN catalogues (there are 50, from Africa to Washington) to publish a dataset that has been ingested and cleaned on the new ScraperWiki platform. It’ll be released on the 11th July.

screenshot showing new tool (alpha)

If you run an open data project which scrapes, curates and republishes open data, we’d love your help testing it. To register, please email with “open data” in the subject, telling us about your project.

Why are we doing this? Since its launch ScraperWiki has provided a place where an open data activist could get, clean, analyse and publish data. With the retirement of “ScraperWiki Classic” we decided to focus on the getting, cleaning and analysing, and leave the publishing to the specialists – places like CKAN.

This new “Open your data” tool is just the start. Over the next few months we also hope that open data activists will help us work on the release of tools that:

  • Generate RDF (linked data)
  • Update data real time
  • Publish to other data catalogues

Here’s to liberating the world’s messy open data!

]]> 2 758219074
European Data Forum Dublin 2013…what was it all about? Fri, 14 Jun 2013 08:00:40 +0000 FP7 LOGOIt was not an accident that the 2013 European Data Forum was held in Dublin given that Ireland’s presidency of the Council of the European Union runs until June 30th. The venue was Croke Park Conference Centre which over looks Ireland’s premier sporting stadium and an historic landmark. It was organised by the Digital Enterprise Research Institute (DERI), an internationally recognised body focused on semantic web research. Its purpose is to directly contribute to the Irish government’s plan of transforming Ireland into a competitive knowledge economy.

Legislation Data

The annual conference brought together data practitioners from industry, research, the public-sector and the community, to discuss the opportunities and challenges of the emerging Big Data Economy in Europe. The delegate list was heavily influenced by FP7 participants –  ScraperWiki is an FP7 participant in the NewsReader Project! 

Senior executives from SAP, Statoil, Ericsson, RTE and many others talked about how big data problems manifest and also about some of the opportunities that big data presents. Statoil gave some innovative examples including using vast amounts of ‘biometric data’ to better identify the existence of oil deposits.

MartaThe person who cares about this programme at an EU level is Marta Nagy-Rothengass Head of Unit “Data Value Chain” in DG CONNECT at the European Commission.  I asked Marta the purpose of the European Data Forum?  She explained that “the primary objective is to get stakeholders who would never normally meet together to think and to act in making better reuse of public data for commercial purpose” and “as a place to network and exchange ideas’.   When asked about the expected outcomes Marta was enthusiastic “we hope that within three years a data value chain and a platform will be established where public and private organisations can create real financial value from data; we hope to have an infrastructure for services and applications for public sector, private sector and citizens that are multilingual and open; and there will be actions at EU level to develop a data skills network, research and innovation activities and kinetic innovation (e.g. geodata that is cross sector, cross border, has monetary value and that offers better decision making and intelligence).

deirdre_leeThe ‘big’ value came with the networking opportunity, 20 data solutions were presented at an exposition that ran alongside the main event. Deirdre Lee, Research Associate in the eGovernment Domain at DERI, the lead organiser told me that the institute is involved in many EU projects that include open, linked, & big data and where they can help improve data quality and availability. About the conference she said “We did not want the conference to be academic, we want to get industry involved to ensure that we reflect real problems with data and we also wanted to showcase some of the solutions that are available”.

The EU Economy is still the largest and wealthiest  block in the world and at least 26 and 3/4 of the countries see this as an opportunity despite its recent economic woes!  According to Wikipedia “The economy of the European Union generates a GDP of over €12.894 trillion (US$16.566 trillion in 2012) according to Eurostat, making it the largest economy in the world. ”

Earlier this week there was a discussion on the radio about how Silicon Valley’s success can be partially linked to the US defence industry’s thirst for technological competitive advantage – I hope that I am not naive in hoping that the EU’s approach to research and innovation in our sector is less ‘defence’ driven.

NewsReader FP7 Project Team Photo in Liverpool June 2013

NewsReader FP7 Project Team Photo in Liverpool June 2013

Newspapers, advertising, revenue, innovation Fri, 03 May 2013 12:37:19 +0000 Cartoon-newspaper-164x300A couple weeks ago, I joined the 110-year-old WAN-IFRA at their annual Digital Media Conference at the swish ETCVenues’ 200 Aldersgate London pad. The organisation has become the voice for the worldwide community of newspaper publishers, and the DMC was a truly international affair with 37 countries from all five continents represented. Senior executives see it as a place to have a ‘pow-wow’ with their peers on what’s happening in the industry, and to listen and respond to some of the wider issues affecting the sector. Having cut a swathe through the industry’s revenue model, Google’s presence was understandably palpable! They had a good showing because many publishers are now value-added resellers to the giant tech company.

It quickly became clear that the big issues facing the industry are:

  • how to grow revenue from advertising
  • how to cut the cost of serving multiple platforms like tablets, mobile devices and PCs
  • how to innovate.

Unsurprisingly HTML5 was also a popular topic and a number of ready-made products featured in the presentations.

Day 2 was focused on innovation and I had an opportunity to talk about what ScraperWiki has been doing to help in the sector.  I tried to feature stories that data scientists from our community created specifically for the media.

They Write for You

They Write for You

I also wanted to talk about some of the women doing great work, so I rolled back to the story that Anna Powell Smith (@darkgreener) helped craft at our very first Hacks/Hacker day in January 2010. The story was about the number of articles written by MPs for British newspapers – it is a simple and effective visualisation: ‘They Write for You”.

A load of bubbles!

A load of bubbles!

I also talked about the data-driven stories that Nicola Hughes (@datamineruk), Francis Irving(@frabcus) and Julian Todd (@goatchurch) created for the awarding-winning Dispatches programme. These focused around the National Asset Register and English Brownfield Sites.

I finished on the rose visualisation that Julian and Zarino Zappia (@zarino) made to enliven Somerset and Devon Fire Incidents. It seemed like a good candidate to show how local government data can be used to make an interesting, evergreen story:

rose visualisation in Devon'

Rose visualisation from Devon

Dr Johnny Ryan (@johnnyryan), author of ‘A History of the Internet and the Digital Future’ and Chief Innovation Officer at The Irish Times followed up by introducing three new media startups.  This was interesting because the paper is a reasonably conservative publication that has taken the unusual step of acting as a technology accelerator in Dublin. So, hats off to its editor Kevin O’Sullivan!  It provides space, desks, access to management and a platform for the startups to introduce their offerings into the media market.

Here are the three Dr Ryan mentioned:

Oliver Mooney (@olivermooney)  told us how GetBulb allows you to make compelling infographics simply by copying and pasting your data into a template.  They also have a wacky introduction video which made me smile!

Paul Quigley (@paulyq) introduced NewsWhip, a technology that tracks all the news shared on Facebook and Twitter each day to find the fastest-spreading, most-shared, high-quality stuff.

Neil O’Connor from Blockmetrics showed his technology to detects ads being blocked by website visitors, and how to analyse how much ad revenue is being lost as a consequence how companies can do something about it.

The industry is very well aware of the challenges it faces, although there was a level of surprise by some delegates that mobile advertising would not be a panacea against falling revenues. This industry faces tough times ahead, but refreshingly it is proactively looking at innovation as both a defence mechanism and a route to growth and profitability.

200 Aldersgate London

200 Aldersgate, London

What has Europe ever done for us? Fri, 30 Nov 2012 09:21:57 +0000 Hmm… 2.8 million euros, a ‘history recorder’, and the opportunity to have a full on working relationship with VU Amsterdam Uni, Lexis Nexis, plus with some brilliant bods in Trento and San Sebastien  (Happy Christmas!)

It’s official!  We have become a European FP7 partner with the VU Amsterdam University, Faculty of Arts’ (Prof Piek Vossen), Lexis Nexis and research centres in Trento (Italy) and the Basque city, San Sebastien (Spain).

The ‘History Recorder’ project is ambitious.
It aims to create some wizardry software that “reads” daily streams of news and stores exactly what happened, where and when in the world, and who was involved.   The software will use the same strategy as humans by building up a story and merging it with previously stored information.

We have very good reasons for wanting to be involved in this project, not least our responsibility for ensuring that the entire project is open source and for creating a blueprint for all subsequent FP7 projects to be made public and transparent!  We will also help to identify and gather data from interesting places and munge them through the ScraperWiki digger!  And of course we have to engage with customers to get the project embedded commercially and to help with this we will host events with our partners in the participating cities.

Whilst all of the team at ScraperWiki will be involved in ensuring that we are successful, we are hiring someone who will take a lead role in managing the project and will have a job posting on our site very soon – watch this space!

Lexis Nexis Logo, Trento Coat of Arms and San Sebastien Coat of Arms

Roll on 2013 as the project kicks off in January!

]]> 1 758217607