Zarino Zappia – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Publish your data to Tableau with OData Fri, 07 Mar 2014 16:48:38 +0000 We know that lots of you use data from our astonishingly simple Twitter tools in visualisation tools like Tableau. While you can download your data as a spreadsheet, getting it into Tableau is a fiddly business (especially where date formatting is concerned). And when the data updates, you’d have to do the whole thing over again.

There must be a simpler way!

And so there is. Today we’re excited to announce our new “Connect with OData” tool: the hassle-free way to get ScraperWiki data into analysis tools like Tableau, QlikView and Excel Power Query.

odata-screenshotTo get a dataset into Tableau, click the “More tools…” button and select the “Connect with OData” tool. You’ll be presented with a list of URLs (one for each table in your dataset).

Copy the URL for the table of interest. Then nip over to Tableau, select “Data” > “Connect to Data” > “OData”, and paste in the URL. Simple as that.

The OData connection is fast and robust – so far we’ve tried it on datasets with up to a million rows, and after a few minutes, the whole lot was downloaded and ready to visualise in Tableau. The best bit is that dates and Null values come through just fine, with zero configuration.

The “Connect with OData” tool is available to all paying ScraperWiki users, as well as journalists on our free 20-dataset journalist plan.


If you’re a Tableau user, try it out, and let us know what you think. It’ll work with all versions of Tableau, including Tableau Public.

]]> 2 758221163
ScraperWiki Classic retirement guide Fri, 07 Mar 2014 14:06:04 +0000 tractor-on-beach

In July last year, we announced some exciting changes to the ScraperWiki platform, and our plans to retire ScraperWiki Classic later in the year.

That time has now come. If you’re a ScraperWiki Classic user, here’s what will be changing, and what it means for you:

logo-17c5c46cfc747acf837d7989b622f557Today, we’re adding a button to all ScraperWiki Classic pages, giving you single-click migration to, a free cloud scraping site run by our awesome friends at OpenAustralia. is very similar to the ScraperWiki Classic platform, allowing you to share the data you have scraped. If you’re an open data activist, or you work on public data projects, you should check them out!

From 12th March onwards, all scrapers on ScraperWiki Classic will be read-only, you will no longer be able to edit the code of the scrapers. You’ll still be able to migrate to or copy the code and paste it into the “Code in your browser” tool on the new ScraperWiki. And scheduled scrapers will continue running until 17th March.

On 17th March, scheduled scrapers will stop running. We’re going to take a final copy of all public scrapers on ScraperWiki Classic, and upload them as a single repository to GitHub, in addition to the read-only archive on

Retiring ScraperWiki Classic helps us focus on our new platform and tools, and our “Code in your browser” and “Open your data” tools on our new platform are perfect for journalists and researchers starting to code, and our free 20-dataset Journalist accounts are still available. So you have no excuse not to create an account and go liberate some data! 🙂

If you have any other questions, make sure to visit our ScraperWiki Classic retirement guide for more info and FAQs.

In summary…

ScraperWiki Classic is retiring on 17th March 2014.

You can migrate to or our new “Code in your browser” tool at any point.

We’re going to keep your public code and data available in a read-only form on for as long as we’re able.

]]> 4 758221134
New ScraperWiki tool lets you extract data from reports with complete accuracy Thu, 30 Jan 2014 15:41:20 +0000 It’s not always possible to automate data gathering, even with scrapers.

Often we find customers want to regularly update data in ScraperWiki via spreadsheets.

Either they’ve made the spreadsheets via a report from another system (typically one that isn’t on the web), or they gather the data by hand (for example, by phoning someone up every day) and already type it into a spreadsheet.

Today we’re pleased to launch a new Extract Reports tool.


We developed it for the UK Government, and like the ScraperWiki platform it’s open source.


It’s important the person who is uploading the spreadsheet (who often isn’t technical) gets clear and simple error messages if it is misformatted.


Each type of spreadsheet is configured by a custom filter. These are short pieces of Python code which pull out the data and save it. This makes sure they find the data accurately, and can validate it as thoroughly as possible.

Once the data is loaded, just like any other data set it can be used with other tools. For example, to visualize in Tableau or upload to CKAN.


The tool is available to our corporate customers, so do get in touch.

P.S. Of course, since we’re wicked good at PDFs as well, the tool can also be used to upload reports in PDFs unchanged.

]]> 1 758220926
Sharing in 6 dimensions Fri, 19 Jul 2013 16:36:10 +0000 Hands up everyone who’s ever used Google Docs. Okay, hands down. Have you ever noticed how many different ways there are to ‘share’ a document with someone else?

We have. We use Google Docs at lot internally to store and edit company documents. And we’ve always been baffled by how many steps there are to sharing a document with another person.

There’s a big blue “Share” button which leads to the modal windows above. You can pick individual collaborators (inside your organisation, or out, with Google accounts or without), send email invites, or share a link directly to the document. Each user can be given a different level of editing capability (read-only, read-and-comment, read-and-edit, read-edit-delete). But wait, there’s more!! Clicking the effectively invisible “Change…” link takes you to a second screen, where you can choose between five different privacy/visibility levels, and three different levels of editing capability, for users other than the ones you just picked.

Google Docs publish to web

Or you can side-step the blue button entirely and use File > Publish to the web… which creates a sort of non-editable, non-standard UI for the document, which can either be live or frozen in the current document state, and can either require visitors to have a Google account or not.

Google Docs sidebar

Or, you can save the document into a shared folder, whereby it’ll seemingly randomly appear in the specified collaborators’ Google Drive interfaces, usually hidden under some unexpected sub-directory in the main sidebar. And, of course, shared folders themselves have five different privacy levels, although it’s not entirely clear whether these privacy settings override the settings of any documents within, or whether Google does some magic additive/subtractive/multiplicative divination in cases of conflict.

Finally, if everything’s just getting too much—and frankly I can’t blame you—you can copy and paste the horrifically long URL from your browser’s URL bar into an email and just hope for the best.


What’s this got to do with ScraperWiki?

ScraperWiki is a data hub. It’s a place to write code that does stuff with data: to store that data, keep it up to date, document its provenance, and share it with the world. ScraperWiki Classic took that last part very literally. It was a ScraperWiki – everything was out in the open, and most of it was editable by everyone.

ScraperWiki Classic privacy settings

Over time, a few more sharing options were added to ScraperWiki Classic – first protected scrapers in 2011 and then private vaults in 2012. We’d started on the slippery road to Google-level complexity, but it was thankfully still only one set of checkboxes.

When we started rethinking ScraperWiki from the ground up in the second half of 2012, we approached it from the opposite angle. What if everything was private by default, and sharing was opt-in?

Truth is, beyond the world of civic hacktivism and open government, a lot of data is sensitive. People expect it to be private, until they actively share it with others. Even journalists and open data freaks usually want to keep their stories and datasets under wraps until it’s time for the big reveal.

So sharing on the new ScraperWiki was managed not at the dataset level, but the data hub level. Each of our beta testers got one personal data hub. All their data was saved into it, and it was all entirely private. But, if they liked, they could invite other users into their data hub – just like you might give your neighbour a key to your house when you go on holiday. These trusted people could then switch into your data hub much like Github users can switch into an “organisation” context, or Facebook users can use the site under the guise of a “page” they administer.

But once they were in your data hub, they could see and edit everything. It was a brutal concept of “sharing”. Literally sharing everything or nothing. For months we tested ScraperWiki with just those two options. It worked remarkably well. But we’ve noticed it doesn’t cover how data’s used in the real world. People expect to share data in a number of ways. And that number, in fact, is 6.

Sharing in 6 dimensions

Thanks to detailed feedback from our alpha and beta users, about what data they currently share, and how they expect to do it on ScraperWiki, it turns out your average data hub user expects to share…

  1. Their datasets and/or related visualisations…
  2. Individually, or as one single package…
  3. Live, or frozen at a specific point in time…
  4. Read-only, or read-write…
  5. With specific other users, or with anybody who knows the URL, or with all registered users, or with the general public…
  6. By sharing a URL, or by sharing via a user interface, or by publishing it in an iframe on their blog, or by publicly listing it on the data hub website.

Just take a minute and read those again. Which ones do you recognise? And do you even notice you’re making the decision?

Either way, it suddenly becomes clear how Google Docs got so damn complicated. Everyone expects a different sharing workflow, and Google Docs has to cater to them all.

Google isn’t alone. If you ask me, the sorry state of data sharing is up there with identity management and trust as one of the biggest things holding back the development of the web. I’ve seen Facebook users paralysed with confusion over whether their most recent status update was “public” or not (whatever “public” means these days). And I’ve written bank statement scrapers that I simply don’t trust to run anywhere other than my own computer, because my bank, in its wisdom, gives me one username and password for everything, because, hey, that’s sooo much more secure than letting me share or even just get hold of my data via an API.

Sharing in 2 dimensions

As far as ScraperWiki’s concerned, we’ve struck what we think is a sensible balance: You can either share everything in a single datahub, read-write, with another registered ScraperWiki user; or you can use individual tools (like the “Open your data” tool and, soon, the “Plot a graph” tool) to share live or static visualisations of your data, read-only, with the whole world. They’re the two most common use-cases for sharing data: share everything with someone I trust so we can work on it together, or share this one thing, read-only, on my blog so anonymous visitors can view it.

The next few months will tell us how sensible a division that is – especially as more and more tools are developed, each with a specific concept of what it means to “share” their output. In the meantime, if you’ve got any thoughts about how you’d like to share your content on ScraperWiki, comment below, or email The future of sharing is in your hands.

]]> 4 758218205
Open your data with ScraperWiki Thu, 11 Jul 2013 16:48:15 +0000 Open data activists, start your engines. Following on from last week’s announcement about publishing open data from ScraperWiki, we’re now excited to unveil the first iteration of the “Open your data” tool, for publishing ScraperWiki datasets to any open data catalogue powered by the OKFN’s CKAN technology.


Try it out on your own datasets. You’ll find it under “More tools…” in the ScraperWiki toolbar:


And remember, if you’re running a serious open data project and you hit any of the limits on our free plan, just let us know, and we’ll upgrade you to a data scientist account, for free.

If you would like to contribute to the underlying code that drives this tool you can find its repository on github here –

]]> 2 758219146
Your questions about the new ScraperWiki answered Mon, 08 Jul 2013 16:49:27 +0000 You may have noticed we launched a completely new version of ScraperWiki last week. Here’s a suitably meta screengrab of last week’s #scraperwiki twitter activity, collected by the new “Search for tweets” tool and visualised by the “Summarise this data” tool, both running on our new platform.

Twitter summary

These changes have been a long time coming, and it’s really exciting to finally see the new tool-centric ScraperWiki out in the wild. We know you’ve got a load of questions about the new ScraperWiki, and how it affects our old platform, now lovingly renamed “ScraperWiki Classic”. So we’ve created an FAQ that hopefully answers all of your questions about what’s going on.

Take a look:

If there’s anything missing, or any questions left unanswered, let us know. We want to keep that FAQ up to date as the Classic migration goes on, and we’d love your help improving it.

]]> 2 758219095
Data analysis using the Query with SQL tool Fri, 05 Jul 2013 08:34:59 +0000 Ferdinand Magellan
Ferdinand Magellan, the Renaissance’s most prodigious explorer. He almost certainly knew lingua franca – but did he know SQL?
It’s Summer 1513. Rome is the centre of the Renaissance world, and Spanish, Italian, and Portuguese merchant ships criss-cross the oceans, ferrying textiles from the North, spices from the East, and precious metals from the newly-discovered Americas.

If you were to step into these Merchants’ houses, spy on their meetings, or buy them a drink at the local tavern, you wouldn’t hear Italian, or Spanish, or French. You’d hear something known locally as “Frankish” – a mix of Italian, Old French, Greek, Arabic and Portuguese. You couldn’t get anywhere in this world of trade, commerce and exploration without knowing this lingua franca (literally “Frankish language”), and once you’d mastered it, the world was quite literally your oyster.

Fast forward 500 years, to 2013, and there’s a new lingua franca in the worlds of trade, science and commerce: SQL.

Structured Query Language is a custom-built language for working with structured data. With a handful of verbs, and relatively simple grammar, you can perform almost any slice, manipulation or extraction of data you can imagine. Your bank uses it. Your blog uses it. Your phone uses it. (PRISM probably uses it.) And ScraperWiki uses it too.

Fast forward 500 years and there’s a new lingua franca: SQL

All the data you collect in the new ScraperWiki—whether it’s from a PDF you’ve scraped, or a twitter stream you’ve searched, or a spreadsheet you’ve uploaded—all the data is structured with SQL. It’s this common underpinning that means all the tools on ScraperWiki are guaranteed to work with each other.

There’s one tool on ScraperWiki that lets you get your hands on this SQL directly, and see the results of your queries, in real-time, on any dataset in ScraperWiki. It’s called the Query with SQL tool, and here it is running on a dataset of my iTunes library:

Query with SQL

Down the left-hand side, you’ll see a list of all your dataset’s tables and columns. You can hover over a column name to see some example values.

In the middle is where you’ll type your SQL queries. And then, on the right, is a table showing you the result. The tool’s read-only, so you don’t need to worry about destroying your data. Just have fun and explore.

The Query with SQL tool starts with a basic SQL query to get you going. But we can do so much better than that. The first thing any data scientist is going to do when faced with a brand new dataset, is get a quick idea of its structure. Looking at example values for all the columns is usually a good start.

select * from "tracks" limit 10

When you’re reading data from a SQL database, you start your query with select. Then you list the columns you want to get (or use the special * to select them all) and say the table they’re from. The double quotes around table and column names are optional, but advised, especially when working with columns that contain punctuation and spaces.

The above command produces a table that looks a lot like the default View in a table tool. That’s because View in a table is running this exact query when it’s showing you your data. So now you know 🙂

You can go a step further by getting some basic summary statistics for the more interesting columns:
See it in action

count("play_count") as "total tracks",
min("play_count") as "min plays",
avg("play_count") as "avg plays",
max("play_count") as "max plays",
sum("play_count") as "total plays"
from "tracks"

It turns out I’ve got 4699 tracks in my library, which I’ve played a total of 89670 times. The average number of plays per track is 19, and the most listened-to track has been played 282 times.

The count() function is particularly powerful when combined with group by, to show frequency distributions across your data in a flash:

count("genre") as "total tracks"
from "tracks"
group by "genre"

You can even order the output, using order by, to see the most common genres at the top:

count("genre") as "total tracks"
from "tracks"
group by "genre"
order by "total tracks" desc

Running that query on my iTunes data reveals my addiction to Soundtracks (1236 tracks), Electronic music (305 tracks) and Prog Rock (218 tracks) amongst others. Changing the count("genre") function to a sum("total_plays") shows me the total plays for each genre, rather than the total tracks:

sum("play_count") as "total plays"
from "tracks"
group by "genre"
order by "total plays" desc

By now, you’ll probably want to investigate only rows in your database where a particular thing is true, or a certain condition is met. In this case, lets investigate that “Electronic” category. Which are the most popular artists?

sum("play_count") as "total plays"
from "tracks"
where "genre" = 'Electronic'
group by "artist"
order by "total plays" desc

Zero 7 (3257 plays – nearly 40% of all plays in that genre, and 4% of all the music I’ve ever listened to – scary), Lemon Jelly (1187 plays) and Télépopmusik (745 plays). Cool. How about my favourite artists over the last two months?

sum("play_count") as "total plays"
from "tracks"
where "date_last_played" > date('now', '-2 months')
group by "artist"
order by "total plays" desc

SQLite has a really awesome date() function which takes two arguments: a date/time or the word “now” as a first argument, and a time difference as a second, meaning you can easily work with date ranges and totals over time.

My recent Daft punk obsession is quickly laid bare by SQL’s awesome date comparison power. 2406 plays in the last 2 months – that’s an average of 40 a day. Sheesh, I think I actually have a problem.

sql-json-apiFinally, let’s say I wanted to publicise my French Electro-pop addiction – perhaps to a Daft Punkers Anonymous support forum. Just like Classic before it, the new ScraperWiki exposes all of its data via a SQL web endpoint, which returns JSON just begging to be fed into your own custom apps and online visualisations. Complete destruction of my musical reputation is only a short, sharp press of the orange “JSON API” button away.

Why not give the Query with SQL tool a go on your datasets, and let me know what other hidden gems you discover?

Get a free account on »

]]> 1 758218934
Announcing the new Wed, 03 Jul 2013 19:37:29 +0000 Today is a big day for ScraperWiki – Our new platform is coming out of beta.

Sign up and give it a go! We think you’ll like it.


ScraperWiki is about liberating data from silos and empowering you to do what you want with it. You can either write your own code, or use our powerful built-in tools…

…And there’ll be more coming. Like one for opening up your data on CKAN, and one for magically scraping tabular data.

If you’re a developer, you can write your own tools using git, SSH and any language you like.

We’ve safely moved all of the existing scrapers and views from ScraperWiki Classic to We suggest you migrate them to the new platform before we shut down ScraperWiki Classic in September – perhaps using something like Ross Jones’ awesome migration script.

If you have any questions about the switch-over, or need help migrating, email me at

]]> 5 758219047
It’s all about tools Thu, 06 Jun 2013 08:30:55 +0000 The new ScraperWiki is all about tools. People talk a lot about data, big data, data mining, data science. But the action happens in tools. Tools like Excel, R, SPSS, Python. Or, on the new ScraperWiki, tools like View in a table, Summarise this data and Query with SQL.

We’ve just pushed an improvement to the ScraperWiki beta which puts all of your tools within closer reach. Each tool currently working on a dataset is shown in the black bar at the top, and you can scroll left to see or add more.

New ScraperWiki toolbar

Give it a go, explore your data, and let me know how it feels:


Programmers past, present and future Tue, 04 Jun 2013 09:20:38 +0000 As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only came to code and data tangentially, and are slowly, almost subconsciously, working their way up what I jokingly refer to as Zappia’s Hierarchy of (Programmer) Self-Actualisation™.

The Hierarchy started life as a visual aid. The ScraperWiki office, just over a year ago, was deep in conversation about the people who’d attended our recent events in the US. How did they come to code and data? And when did they start calling themselves “programmers” (if at all)?

Being the resident whiteboard addict, I grabbed a marker pen and sketched out something like this:

Zappia's hierarchy of coder self-actualisation

This is how I came to programming. I took what I imagine is a relatively typical route, starting in web design, progressing up through Javascript and jQuery (or “DHTML” as it was back then) to my first programming language (PHP) and my first experience of databases (MySQL). AJAX, APIs, and regular expressions soon followed. Then a second language, Python, and a third, Shell. I don’t know Objective-C, Haskell or Clojure yet, but looking at my past trajectory, it seems pretty inevitable I sometime soon will.

To a non-programmer, it might seem like a barrage of crazy acronyms and impossible syntaxes. But, trust me, the hardest part was way back in the beginning. Progressing from a point where websites are just like posters you look at (with no idea how they work underneath) to a point where you understand the concepts of structured HTML markup and CSS styling, is the first gigantic jump.

You can’t “View source” on a TV programme

Or a video game, or a newspaper. We’re not used to interrogating the very fabric of the media we consume, let alone hacking and tweaking it. Which is a shame, because once you know even just a little bit about how a web page is structured, or how those cat videos actually get to your computer screen, you start noticing solutions to problems you never knew existed.

The next big jump is across what has handily been labelled on the above diagram, “The chasm of Turing completeness”. Turing completeness, here, is a nod to the hallmark of a true programming language. HTML is simply markup. It says: “this a heading; this is a paragraph”. Javascript, PHP, Python and Ruby, on the other hand, are programming languages. They all have functions, loops and conditions. They are *active* rather than *declarative*, and that makes them infinitely powerful.

Making that jump – for example, realising that the dollar symbol in $('span').fadeIn() is just a function – took me a while, but once I’d done it, I was a programmer. I didn’t call myself a programmer (in fact, I still don’t) but truth is, by that point, I was. Every problem in my life became a thing to be solved using code. And every new problem gave me an excuse to learn a new function, a new module, a new language.

Your mileage may vary

David, ScraperWiki’s First Engineer, sitting next to me, took a completely different route to programming – maths, physics, computer science. So did Francis, Chris and Dragon. Zach, our community manager, came at it from an angle I’d never even considered before – linguistics, linked data, Natural Language Processing. Lots of our users discover code via journalism, or science, or politics.

I’d love to see their versions of the hierarchy. Would David’s have lambda calculus somewhere near the bottom, instead of HTML and CSS? Would Zach’s have Discourse Analysis? The mind boggles, but the end result is the same. The further you get up the Hierarchy, the more your brain gets rewired. You start to think like a programmer. You think in terms of small, repeatable functions, extensible modules and structured data storage.

And what about people outside the ScraperWiki office? Data superhero and wearer of pink hats, Tom Levine, once wrote about how data scientists are basically a cross between statisticians and programmers. Would they have two interleaving pyramids, then? One full of Excel, SPSS and LaTeX; the other Python, Javascript and R? How long can you be a statistician before you become a data scientist? How long can you be a data scientist before you inevitably become a programmer?

How about you? What was your path to where you are now? What does your Hierarchy look like? Let me know in the comments, or on Twitter @zarino and @ScraperWiki.

]]> 1 758218506