open data – ScraperWiki

Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone

Aine McGuire — Wed, 30 Sep 2015 14:11:16 +0000

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach. They’re amazing businesses that focus on social mobility.

We interviewed him for PDFTables.com

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Which car should I (not) buy? Find out, with the ScraperWiki MOT website…

Pius Okoh — Wed, 23 Sep 2015 15:14:58 +0000

I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

9 years of MOT data was downloaded and concatenated.
Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Summary – Big Data Value Association June Summit (Madrid)

Aine McGuire — Tue, 21 Jul 2015 10:13:39 +0000

In late June, 375 Europeans + 1 attended the Big Data Value Association (BVDA) Summit in Madrid. The BVDA is the private part of the Big Data Public Private Partnership. The Public part is the European Commission. The delivery mechanism is Horizon 2020 and €500m funding . The PPP commenced in 2015 and runs to 2020.

Whilst the conference title included the word ‘BIG’, the content did not discriminate. The programme was designed to focus on concrete outcomes. A key instrument of the PPP is the concept of a ‘lighthouse’ project. The summit had arranged tracks that focused on identifying such projects; large scale and within candidate areas like manufacturing, personalised medicine and energy.

What proved most valuable was meeting the European corporate representatives who ran the vertical market streams. Telcom Italia, Orange and Nokia shared a platform to discuss their sector. Philips drove a discussion around health and well being. Jesus Ruiz, Director of Open Innovation in Santander Bank Corporate Technology, led the Finance industry track. He tried to get people to think about ‘innovation’ in the layer above traditional banking services. I suspect he meant in the space where companies like Transferwise (cheaper foreign currency conversion) play. These services improve the speed and reduce the cost of transactions. However the innovating company never ‘owns’ an individual or corporate bank account. As a consequence they’re not subject to tight financial regulation. It’s probably obvious to most but I was unaware of the distinction.

I had an opportunity to talk to many people from the influential Fraunhofer Institute! It’s an ‘applied research’ organisation and a significant contributor to Deutschland’s manufacturing success. Last year it had a revenue stream of €2b. It was seriously engaged at the event and is active at finding leading edge ‘lighthouse projects’. We’re in the transport #TIMON consortia with it – Happy Days 🙂

BDVA – You can join!

Networking is the big bonus at events like these and with representatives from 28 countries and delegates from Palestine and Israel – there were many people to meet. The UK was poorly represented and ScraperWiki was the only UK technology company showing it’s wares. It was a shame given the UK’s torching carrying when it comes to data. Maurizio Pilu, @Maurizio_Pilu Executive Director, Collaborative R&D at Digital Catapult gave a keynote. The ODI is mentioned in the PPP Factsheet which is good.

There was a strong sense that the PPP initiative is looking to the long term, and that some of the harder problems have not yet been addressed to extract ‘value’. There was also an acknowledgement of the importance of standards and a track was run by Phil Archer, Data Activity Lead the W3C .

Stuart Campbell, Director, CEO at Information Catalyst and a professional pan-European team managed the proceedings and it all worked beautifully. We’re in FP7 and Horizon 2020 consortia so we decided to sponsor and actively support #BDVASummit. I’m glad we did!

The next big event is the European Data Forum in Luxembourg (16-17 Nov 2015). We’re sponsoring it and we’ll talk about our data science work, PDFTtables.com and DataBaker. The event will be opened by Jean-Claude Juncker President of the EU, and Günther Oettinger , European Commissioner for Digital Economy and society.

It’s seems a shame that the mainstream media in the UK focuses so heavily on subjects like #Grexit and #Brexit. Maybe they could devote some of their column inches to the companies and academics that are making a very significant commitment to finding products and services that make the EU more competitive and also a better place to work and to live.

Publish your data to Tableau with OData

Zarino Zappia — Fri, 07 Mar 2014 16:48:38 +0000

We know that lots of you use data from our astonishingly simple Twitter tools in visualisation tools like Tableau. While you can download your data as a spreadsheet, getting it into Tableau is a fiddly business (especially where date formatting is concerned). And when the data updates, you’d have to do the whole thing over again.

There must be a simpler way!

And so there is. Today we’re excited to announce our new “Connect with OData” tool: the hassle-free way to get ScraperWiki data into analysis tools like Tableau, QlikView and Excel Power Query.

To get a dataset into Tableau, click the “More tools…” button and select the “Connect with OData” tool. You’ll be presented with a list of URLs (one for each table in your dataset).

Copy the URL for the table of interest. Then nip over to Tableau, select “Data” > “Connect to Data” > “OData”, and paste in the URL. Simple as that.

The OData connection is fast and robust – so far we’ve tried it on datasets with up to a million rows, and after a few minutes, the whole lot was downloaded and ready to visualise in Tableau. The best bit is that dates and Null values come through just fine, with zero configuration.

The “Connect with OData” tool is available to all paying ScraperWiki users, as well as journalists on our free 20-dataset journalist plan.

If you’re a Tableau user, try it out, and let us know what you think. It’ll work with all versions of Tableau, including Tableau Public.

Exploring Stack Exchange Open Data

Matthew Hughes — Wed, 14 Aug 2013 16:01:48 +0000

Inspired by my long commute and the pretty dreadful EDM music blasting out in my gym, I’ve found myself on a bit of a podcast kick lately. Besides my usual NPR fare (If you’ve not yet listened to an episode of This American Life with Ira Glass, you’ve missed out), I’ve been checking out the Stack Exchange podcast; a fairly irreverent take on the popular Q & A website hosted by the founders. On the 51st episode, they announced the opening of their latest site which focuses on the exciting world of open data.

Perhaps the most common complaint I’ve heard since I’ve started surrounding myself with data scientists is that getting specific sets of data can be frustratingly hard. Often, you will find that what you can get by scraping from a website is more than sufficient. That said if you’re looking for something oddly specific like the nutritional information of all food products on the shelves of UK supermarkets, you can quickly find yourself hitting some serious brick walls.

That’s where Stack Exchange Open Data comes in. It follows the typical formula that Stack Overflow has adhered to since its inception. Good questions rise to the top whilst bad ones fade into irrelevance.

The aim of this site is to provide a handy venue for finding useful datasets to analyze or use in projects. Despite only opening quite recently, it has garnered a large userbase and people are asking interesting questions and getting helpful answers. These range from finding out information about German public transportation to global terrain data .

Will you be using Stack Exchange Open Data in one of your future projects? Has Stack Exchange Open Data helped out out find a particularly elusive dataset? Let me know in the comments below.

Sharing in 6 dimensions

Zarino Zappia — Fri, 19 Jul 2013 16:36:10 +0000

Hands up everyone who’s ever used Google Docs. Okay, hands down. Have you ever noticed how many different ways there are to ‘share’ a document with someone else?

We have. We use Google Docs at lot internally to store and edit company documents. And we’ve always been baffled by how many steps there are to sharing a document with another person.

There’s a big blue “Share” button which leads to the modal windows above. You can pick individual collaborators (inside your organisation, or out, with Google accounts or without), send email invites, or share a link directly to the document. Each user can be given a different level of editing capability (read-only, read-and-comment, read-and-edit, read-edit-delete). But wait, there’s more!! Clicking the effectively invisible “Change…” link takes you to a second screen, where you can choose between five different privacy/visibility levels, and three different levels of editing capability, for users other than the ones you just picked.

Or you can side-step the blue button entirely and use File > Publish to the web… which creates a sort of non-editable, non-standard UI for the document, which can either be live or frozen in the current document state, and can either require visitors to have a Google account or not.

Or, you can save the document into a shared folder, whereby it’ll seemingly randomly appear in the specified collaborators’ Google Drive interfaces, usually hidden under some unexpected sub-directory in the main sidebar. And, of course, shared folders themselves have five different privacy levels, although it’s not entirely clear whether these privacy settings override the settings of any documents within, or whether Google does some magic additive/subtractive/multiplicative divination in cases of conflict.

Finally, if everything’s just getting too much—and frankly I can’t blame you—you can copy and paste the horrifically long URL from your browser’s URL bar into an email and just hope for the best.

AAAAAAAARGHHHHHHH!!!

What’s this got to do with ScraperWiki?

ScraperWiki is a data hub. It’s a place to write code that does stuff with data: to store that data, keep it up to date, document its provenance, and share it with the world. ScraperWiki Classic took that last part very literally. It was a ScraperWiki – everything was out in the open, and most of it was editable by everyone.

Over time, a few more sharing options were added to ScraperWiki Classic – first protected scrapers in 2011 and then private vaults in 2012. We’d started on the slippery road to Google-level complexity, but it was thankfully still only one set of checkboxes.

When we started rethinking ScraperWiki from the ground up in the second half of 2012, we approached it from the opposite angle. What if everything was private by default, and sharing was opt-in?

Truth is, beyond the world of civic hacktivism and open government, a lot of data is sensitive. People expect it to be private, until they actively share it with others. Even journalists and open data freaks usually want to keep their stories and datasets under wraps until it’s time for the big reveal.

So sharing on the new ScraperWiki was managed not at the dataset level, but the data hub level. Each of our beta testers got one personal data hub. All their data was saved into it, and it was all entirely private. But, if they liked, they could invite other users into their data hub – just like you might give your neighbour a key to your house when you go on holiday. These trusted people could then switch into your data hub much like Github users can switch into an “organisation” context, or Facebook users can use the site under the guise of a “page” they administer.

But once they were in your data hub, they could see and edit everything. It was a brutal concept of “sharing”. Literally sharing everything or nothing. For months we tested ScraperWiki with just those two options. It worked remarkably well. But we’ve noticed it doesn’t cover how data’s used in the real world. People expect to share data in a number of ways. And that number, in fact, is 6.

Sharing in 6 dimensions

Thanks to detailed feedback from our alpha and beta users, about what data they currently share, and how they expect to do it on ScraperWiki, it turns out your average data hub user expects to share…

Their datasets and/or related visualisations…
Individually, or as one single package…
Live, or frozen at a specific point in time…
Read-only, or read-write…
With specific other users, or with anybody who knows the URL, or with all registered users, or with the general public…
By sharing a URL, or by sharing via a user interface, or by publishing it in an iframe on their blog, or by publicly listing it on the data hub website.

Just take a minute and read those again. Which ones do you recognise? And do you even notice you’re making the decision?

Either way, it suddenly becomes clear how Google Docs got so damn complicated. Everyone expects a different sharing workflow, and Google Docs has to cater to them all.

Google isn’t alone. If you ask me, the sorry state of data sharing is up there with identity management and trust as one of the biggest things holding back the development of the web. I’ve seen Facebook users paralysed with confusion over whether their most recent status update was “public” or not (whatever “public” means these days). And I’ve written bank statement scrapers that I simply don’t trust to run anywhere other than my own computer, because my bank, in its wisdom, gives me one username and password for everything, because, hey, that’s sooo much more secure than letting me share or even just get hold of my data via an API.

Sharing in 2 dimensions

As far as ScraperWiki’s concerned, we’ve struck what we think is a sensible balance: You can either share everything in a single datahub, read-write, with another registered ScraperWiki user; or you can use individual tools (like the “Open your data” tool and, soon, the “Plot a graph” tool) to share live or static visualisations of your data, read-only, with the whole world. They’re the two most common use-cases for sharing data: share everything with someone I trust so we can work on it together, or share this one thing, read-only, on my blog so anonymous visitors can view it.

The next few months will tell us how sensible a division that is – especially as more and more tools are developed, each with a specific concept of what it means to “share” their output. In the meantime, if you’ve got any thoughts about how you’d like to share your content on ScraperWiki, comment below, or email zarino@scraperwiki.com. The future of sharing is in your hands.

Open your data with ScraperWiki

Zarino Zappia — Thu, 11 Jul 2013 16:48:15 +0000

Open data activists, start your engines. Following on from last week’s announcement about publishing open data from ScraperWiki, we’re now excited to unveil the first iteration of the “Open your data” tool, for publishing ScraperWiki datasets to any open data catalogue powered by the OKFN’s CKAN technology.

Try it out on your own datasets. You’ll find it under “More tools…” in the ScraperWiki toolbar:

And remember, if you’re running a serious open data project and you hit any of the limits on our free plan, just let us know, and we’ll upgrade you to a data scientist account, for free.

If you would like to contribute to the underlying code that drives this tool you can find its repository on github here – http://bit.ly/1898NTI.

Publish from ScraperWiki to CKAN

Aine McGuire — Fri, 05 Jul 2013 09:11:38 +0000

ScraperWiki is looking for open data activists to try out our new “Open your data” tool.

Since its first launch ScraperWiki has worked closely with the Open Data community. Today we’re building on this commitment by pre-announcing the release of the first in a series of tools that will enable open data activists to publish data directly to open data catalogues.

To make this even easier, ScraperWiki will also be providing free datahub accounts for open data projects.

This first tool will allow users of CKAN catalogues (there are 50, from Africa to Washington) to publish a dataset that has been ingested and cleaned on the new ScraperWiki platform. It’ll be released on the 11th July.

screenshot showing new tool (alpha)

If you run an open data project which scrapes, curates and republishes open data, we’d love your help testing it. To register, please email hello@scraperwiki.com with “open data” in the subject, telling us about your project.

Why are we doing this? Since its launch ScraperWiki has provided a place where an open data activist could get, clean, analyse and publish data. With the retirement of “ScraperWiki Classic” we decided to focus on the getting, cleaning and analysing, and leave the publishing to the specialists – places like CKAN.

This new “Open your data” tool is just the start. Over the next few months we also hope that open data activists will help us work on the release of tools that:

Generate RDF (linked data)
Update data real time
Publish to other data catalogues

Here’s to liberating the world’s messy open data!

A visit from a minister

Zach Beauvais — Fri, 08 Mar 2013 16:47:58 +0000

You may have heard from twitter that last Wednesday, Nick Hurd, the Cabinet Minister for Civil Society, paid a visit to ScraperWiki HQ. Nick has been looking into government data and transparency as part of his remit, and asked if he could come and have a chat with us in Liverpool.

Joined by Sophie and Laura from the Cabinet Office, the minister spoke with the team about government transparency, and was—Francis tells me—amused to meet the makers of TheyWorkForYou! Nick asked about scraping, and about startup-life in the Northwest.

He also asked how the Government is doing with open data, and our answer was basically that the very first stage of relatively easy wins is done—the hard work now is releasing tougher datasets (e.g. text of Government contracts). And about changing Civil Service culture to structure data better and publish by default.

To illustrate open data in use, Zarino gave a demonstration of our project with Liverpool John Moores University scraping demographic data and using it to analyse ambulance accidents.

Lindsay Sharples from Open Labs at LJMU also joined us, and commented:

“We were delighted to be part of the Minister’s visit to Liverpool. Open Labs has been working with ScraperWiki to support its customer development programme and it was great to see such high level recognition of its ground breaking work.”

Nick Hurd summarised his visit:

“It was fascinating to see how the data-cleaning services of companies like ScraperWiki are supporting local and central government, business and the wider community – making government data more accessible to ordinary citizens and allowing developers and entrepreneurs to identify public service solutions and create new data-driven businesses.

“Initiatives of this sort underline the vital contribution the private sector can make to realising the potential of open data – which this government is a world leader in releasing – for fuelling social and economic growth.”

Three hundred thousand tonnes of gold

Julian Todd — Wed, 04 Jul 2012 20:17:18 +0000

On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as 1.58887418588e+13.

Doesn’t matter for now. Let’s look at the graph over time:

It’s not very exciting, unless you happen to be interested in phrases such as “debasing our fiat currency” and “return to the gold standard”. In truth, one should really divide the values by the GDP, or the national population, or the cumulative inflation over the time period to scale it properly.

Nevertheless, I decided also to look at the gold price, which can be seen as a graph (click the [Graph] button, then [Group Column (x-axis)]: “date” and [Group Column (y-axis)]: “price”) on the Data Hub. They give this dataset the title: Gold Prices in London 1950-2008 (Monthly).

Why does the data stop in 2008 just when things start to get interesting?

I discovered a download url in the metadata for this dataset:

https://raw.github.com/datasets/gold-prices/master/data/data.csv

which is somewhere within the github^tm as part of the repository https://github.com/datasets/gold-prices in which there resides a 60 line Python scraper known as process.py.

Aha, something I can work with! I cut-and-pasted the code into ScraperWiki as scrapers/gold_prices and tried to run it. Obviously it didn’t work as-is — code always requires some fiddling about when it is transplanted into an alien environment. The module contained three functions: download(), extract() and upload().

The download() function didn’t work because it tries to pull from the broken link:

http://www.bundesbank.de/statistik/statistik_zeitreihen_download.en.php?func=directcsv&from=&until=&filename=bbk_WU5500&csvformat=en&euro=mixed&tr=WU5500

This is one of unavoidable failures that can befall a webscraper, and was one of the motivations for hosting code in a wiki so that such problems can be trivially corrected without an hour of labour checking out the code in someone else’s favourite version control system, setting up the environment, trying to install all the dependent modules, and usually failing to get it to work if you happen to use Windows like me.

After some looking around on the Bundesbank website, I found the Time_series_databases (Click on [Open all] and search for “gold”.) There’s Yearly average, Monthly average and Daily rates. Clearly the latter is the one to go for as the other rates are averages and likely to be derivations of the primary day rate value.

I wonder what a “Data basket” is.

Anyways, moving on. Taking the first CSV link and inserting it into that process.py code hits a snag in the extract() function:

downloaded = 'cache/bbk_WU5500.csv'
outpath = 'data/data.csv'
def extract():
    reader = csv.reader(open(downloaded))
    # trim junk from files
    newrows = [ [row[0], row[1]] for row in list(reader)[5:-1] ]

    existing = []
    if os.path.exists(outpath):
        existing = [ row for row in csv.reader(open(outpath)) ]

    starter = newrows[0]
    for idx,row in enumerate(existing):
        if row[0] == starter[0]:
            del existing[idx:]
            break

    # and now add in new data
    outrows = existing + newrows
    csv.writer(open(outpath, 'w')).writerows(outrows)

ScraperWiki doesn’t have persistent files, and in this case they’re not helpful because all these lines of code are basically replicating the scraperwiki.sqlite.save() features through use of the following two lines:

    ldata = [ { "date":row[0], "value":float(row[1]) }  for row in newrows  if row[1] != '.' ]
    scraperwiki.sqlite.save(["date"], ldata)

And now your always-up-to-date gold price graph is yours to have at the cost of select date, value from swdata order by date –> google annotatedtimeline.

But back to the naked github disclosed code. Without its own convenient database.save feature, this script must use its own upload() function.

def upload():
    import datastore.client as c
    dsurl = 'http://datahub.io/dataset/gold-prices/resource/b9aae52b-b082-4159-b46f-7bb9c158d013'
    client = c.DataStoreClient(dsurl)
    client.delete()
    client.upload(outpath)

Ah, we have another problem: a dependency on the undeclared datastore.client library, which was probably so seamlessly available to the author on his own computer that he didn’t notice its requirement when he committed the code to the github where it could not be reused without this library. The library datastore.client is not available in the github/datasets account; but you can find it in the completely different github/okfn account.

I tried calling this client.py code by cut-and-pasting it into the ScraperWiki scraper, and it did something strange that looked like it was uploading the data to somewhere, but I can’t work out what’s happened. Not to worry. I’m sure someone will let me know what happened when they find a dataset somewhere that is inexplicably more up to date than it used to be.

But back to the point. Using the awesome power of our genuine data-hub system we can take the us_debt_to_the_penny, and attach the gold_prices database to perform the combined query to scales ounces of gold into tonnes:

SELECT 
  debt.date, 
  debt.totaldebt/gold.value*2.83495231e-5 
    AS debt_gold_tonnes
FROM swdata AS debt
LEFT JOIN gold_prices.swdata as gold
  ON gold.date = debt.date
WHERE gold.date is not null
ORDER BY debt.date

and get the graph of US government debt expressed in terms of tonnes of gold.

So that looks like good news for all the gold-bugs, the US government debt in the hard currency of gold has been going steadily down by a factor of two since 2001 to around 280 thousand tonnes. The only problem with that there’s only 164 thousand tonnes of gold in the world according to the latest estimates.

Other fun charts people find interesting such as gold to oil ratio can be done once the relevant data series is loaded and made available for joining.