Zach Beauvais – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Why the World Got Stuck on Spreadsheets and the Future of Data Manipulation Mon, 20 May 2013 12:50:00 +0000 Guest post by Dan Thompson


In 1979 a Harvard MBA student and former programmer at DEC, invented something that fundamentally change the world of IT and which still affects everyone with a desk job today. What Dan Bricklin had created was the spreadsheet – in its modern form at least. There had been number crunching programming languages before, there had been systems for working on rows and columns of numbers before too. But what Bricklin did was make something interactive: the numbers updated as you were using it. It changed everything.

Thus the first modern spreadsheet application, VisiCalc, was born, and it was massively popular. This was the original “Killer App;” there are those who put the success of the Apple II down to VisiCalc alone. In the years that followed, Lotus 123 and then Microsoft Excel would take over, but they never fundamentally deviated from the way that VisiCalc worked. The grid concept, the cell references, the simple formula language: they’ve all been there since the beginning. Which is shocking when you think about it, because since that time we’ve seen the introduction of the mouse, the graphical user interface, the Mac, the Windows PC, the web, smartphones, tablet computers and touch interfaces.

So why did spreadsheets not move on? The answer is that they have, sort of. Excel now has named ranges, pivot tables, the PowerPivot, change tracking, multi-user collaboration and tools to help with formula debugging, but you’re probably not using them. Either because you didn’t know they were there, you find them too complicated, or because you just haven’t been forced to learn them yet. The reason for this comes down to Microsoft’s dominance of the business IT market, which is built largely on backward compatibility. Business will only buy the new version of Excel if they’re confident all their stuff will still work. So any new features that Excel gets must be optional and out of the way. Oh, and there’s zero chance of loosing market share so making it slick and highly usable is not a major concern. Compatibility: crucial. Ease of use: optional.

For a long time, businesses have tried to replace spreadsheets with easier-to-use but less flexible centralised databases. But no sooner would they replace one spreadsheet the sales team were using with a “proper” system, then the sales team would go and invent three new spreadsheets to help with something else. Which points to spreadsheets’ biggest strength: the flexibility they give to those on the front line. People don’t want “One big database to rule them all”, they want flexible tools for working with data.

Change is coming though, the Windows mono-culture is now near an end. The prevalence of Macs, iPads and Chromebooks is forcing software onto the web as the one technology which works everywhere. Fortunately, the web is now maturing and is powerful enough to support rich user interfaces. Meanwhile, the ecosystem of single purpose apps that Apple pioneered with it’s App Store is being emulated on the Mac, Chrome and Windows 8. There is now a way for people like me, who’ve had an idea for an application or tool, to get that onto people’s computers. Innovation and proper competition can once again resume.

I have no doubt that spreadsheets will always be with us, but they will be joined by many more streamlined tools, each serving different and more specific use cases. OpenRefine is a great tool for cleaning up data sets and fixing mismatching values. Tableau is a great way to create interactive visualisations that go beyond the graphs and charts that people are used to. Meanwhile QueryTree, the product I founded, makes the process of sorting, joining, grouping and generally exploring data easier than with a formula driven spreadsheet.

But all will not be plain sailing. Ecosystems depend on a shared set of formats or standards in order to work. Train companies need to agree on the width of the track, television makers and broadcasters need to agree on what “HD” actually means, and for people who make apps that work with data, we need that data to be in a format we can understand. For a while, XML was the answer to everything. These days, it’s JSON. Yet each new Web API that launches structures its data in a slightly different way. I predict a long, drawn-out and gruesome battle to own the platform that apps will share structured data in. Maybe SPARQL will win, maybe everything will end up built on top of Google Spreadsheets – I have no idea. But, if I were going to place a bet of any of these technologies, I’d pick the one that is simple, that doesn’t rely on any one vendor and which already works in every tool out there. Yes, you’ve guessed it, my prediction for the data format of the future is: the CSV file.

Dan is the Founder and Managing Director of D4 Software, the company behind the data analysis tool QueryTree. Dan started his career as a C, C# then Python programmer turned Development Manger. He lives in Worcestershire with his wife and two children.

]]> 2 758218783
Summarising Serendipity Tue, 07 May 2013 09:50:01 +0000 5 years ago, a friend and I sat down in a pub in Shrewsbury, drank some beer, and chatted about the web. Every month since, people have been doing that in Shrewsbury (and a few times in Ludlow). It’s called ShropGeek (we’re very savvy in our naming conventions, you see). It was started and organised almost exclusively via twitter, and it has evolved from a monthly banter-session into an annual conference.

I say this here not as an advertisement (cough), but because I have almost accidentally ended up as a co-organiser of quite a big event. (It’s an accident, on my part, because Kirsty has put a hell of a lot of time and effort into it!)

Because of its twitter-heavy organisation, I would have a pretty good idea about what was going on, because @shropgeek would normally be included in conversations. With the nature of the beast evolving, I’ve had some pretty important questions about how people share, and what they’re saying about the conference, and I accidentally (I seem accident-prone when it comes to administrative tasks) discovered a very useful resource which sits under my nose at my day-job: the Summarise Automatically tool on the New ScraperWiki.

summarise_wordsOne of the first things I did, to test out the summariser was to search twitter for mentions of the “#revolutionconf” hashtag, then click on “summarise this data.” My expectations were to see some cool graphics, and mainly to test it out as a ScraperWiki tool. What I found, though, were some really valuable views on how people are tweeting.

Basically, the summariser tool tries to tell you some instant things about your data by, well, summarising the columns in your data. This can be a bit of a mixed bag, with some summaries making little sense (but we’ll get better at that). However, the really cool thing is the very high-level, dashboard-like information I could get on *this* data, which I know comprises tweets, and all of which are related to my hashtag.

summarise_screen_name1. The first win was a simple count of how many mentions there are. I saw that the hashtag hasn’t been used as much as it could be (with only 66 instances), and realised that I’ve tweeted several times without it. /me slaps own hand!

2. Next, for me, was the screen_name summary. I saw several people on that list who I didn’t realise were in-the-know, and was able to remind myself to thank them soon!

3. The pie-chart saying other hashtags was also interesting, because it included the word “#excited”. Although this doesn’t seem to have *every* other hashtag, it was good to see.


4. Finally, “url” column was summarised as a pie-chart, showing me which urls were included within tweets containing the conference hashtag. This is very interesting, because I can see if people are linking to the index page, or to the ticket page for the site. Also, I can see what *isn’t* being linked (e.g. the Lanyrd page for the event or direct to eventbrite, which I expected to happen.)

summarise_urlThese were all interesting, and helped me instantly better understand how people are talking about the conference. I should also point out that the tool ran automatically: all I did was to install the tool on my search data, and it presents me with this information without any setup. Best of all, it also showed me some things I wasn’t planning for. The list of re-tweeters, for example, jogged my memory, and made me consider asking some specific people to mention the event, which is something I hadn’t thought of doing.

I’m pretty excited about this tool, not just because it’s geeky and has charts, but because it’s at a very early stage and *already* did something useful with my social data. As it improves, I hope we get some more instant-win effects from it, and I’d be keen to hear what we could do to make it better, too.

data-driven london week Fri, 26 Apr 2013 17:03:44 +0000 view of the Shard from ShoreditchMost mornings this week, I awoke in the mystical land of Hackney, and battled hordes of hipster-cyclists to make my way to the Google Campus – a refuge of data-folk. At least, that’s how I like to remember it.

As I blogged last week, several ScraperWikians attended and spoke at a range of events, all put on to the tune of “Big Data.” I spent Monday evening with a friendly meetup group talking about the importance of data in marketing. And on Wednesday, I watched a very smart presentation by Thomas Stone (hopefully, soon-to-be Dr. Stone) from, which looks to be an interesting, open-source project for developers to call upon machine learning without the need for proprietary lock-in.

Alongside Stone, I also learned about Games Analytics from their COO, Mark Robinson. The gist of the talk was that games – particularly online games – give their producers the chance to deeply understand how players actually use their product. Through continuous contact with the players, they can learn: what stops them from playing, where they find it difficult to continue, how many times they log-in before purchasing… What I liked about this, was the lack of hand-wavey discussion about “data leading to insights.” Instead, Robinson’s talk focused on how this data can lead to quite practical decisions, such as making levels of a game quicker at the start, reducing the cost in places, and increasing it in others.

Between those two events, I had the tremendous privilege of joining around 120 others for the W3C’s Open Data on the Web. The remarkable brain-power per square inch at the workshop was mentioned quite a few times, and – although I tend to feel disinclined to perpetuate that kind of talk – I must agree. The Campus hosted architects, businesspeople, developers, hackers and scientists, from government bodies, universities, NGOs and foundations mixed with large companies (including IBM, Adobe, Tesco and Google).

I was particularly drawn to discussions about building and growing businesses on data. I’m intrigued by, and think ScraperWiki is well-placed to, work on addressing the use of open data to augment private data – for example: taking aggregated customer data, and matching with government stats, open geographic data, public social media, etc. I’ve got a few ideas for some tooling to the new ScraperWiki platform, which I’d like to explore in a few weeks.

I don’t feel there is enough space here to do proper justice to the topics covered, but suffice it to say I’m glad I had a chance to go, and was able to take part in the afternoon’s Barcamp (our team discussed the application of the recent revolution of distributed coding workflows to data handling – in other words, Github for data).

I would also like to point out a few of the sessions, and recommend the papers to read:

I don’t have a link yet to Tescos’ talk (just the abstract) about their huge sets of data (product, customers, locations, journeys…), but if anyone has, or as soon as I find it, I’ll put it here!

Two ways you can help guide ScraperWiki’s new platform. Mon, 22 Apr 2013 14:56:45 +0000 tractor-shinyYou will have noticed some activity over the past few weeks, as we have begun reaching out about the new ScraperWiki platform. We’ve blogged about some of the new features, and have invited the first ever users outside the office to have a poke around the beta.

That initial feedback has been immeasurably helpful, and has lead to bug-fixes, feature requests, and some directional suggestions which we can’t thank you enough for.

But we need more.

  • more feedback
  • more testers
  • more questions answered
  • more coffee…

So, there are now two ways you can join the testing community, and – with absolutely no exaggeration – play a vital part in the future functionality, design, and direction of the new ScraperWiki.

First, you can become a premium user of the New ScraperWiki. We have just switched on the payment plans, and are tweaking settings and tuning things up as it’s rolled out. You can see the two available premium plans here, and sign up.

Second, you can join the queue for private free-tier testers by emailing Just ping us your name from the email account you’d like inviting, and we will add you to the list, which we’re breaking into groups simply so we can test different things and make the most of your first impressions!

Big Data Week Events Thu, 18 Apr 2013 16:04:46 +0000 big-data-tractorNext week, a plethora of organisations, hackers and data scientists are celebrating “Big Data Week,” and the ScraperWiki team will be taking part in London.

We will be supporting the DoES Liverpool exhibit at the Internet of Things stream of Internet World at Earls Court (#internetworld2013). Francis will also be giving a talk at 1:30 on Wednesday: discussing the past, present and future of government data, and you can catch his talk at the Big Data Show “Volume and Variety” Theatre. Registration is free but the organisers recommend you book in advance.

I will be more or less camped at the Google Campus in Shoreditch, attending Marketing ReMix on Monday and the Big Data Meetup hosted by the Geckoboard team on Wednesday. If anyone’s about on Tuesday, I’ll be working from the Google cafe, so drop me a line if you’d like to meet up!

There will be plenty of opportunity to meet in London next week, so if you have any questions for ScraperWikians, get in touch and come join us!

]]> 1 758218468
Asking data questions of words Tue, 09 Apr 2013 13:22:02 +0000 The vast majority of my contributions to the web have been loosely encoded in the varyingly standard-compliant family of languages called English. It’s a powerful language for expressing meaning, but the inference engines needed to parse it are pretty complex, staggeringly ancient, yet cutting edge (i.e. brains). We tend to think about data a lot at ScraperWiki, so I wanted to explore how I can ask data questions of words.

Different engines render English into browser-viewable markup (HTML): twitter, wordpress, Facebook, tumblr and emails; alongside various iterations of employers’ sites, industry magazines, and short notes on things I’ve bought on Amazon. Much of this is scrapeable or available via APIs, and a lot of ScraperWiki’s data science work has been gathering, cleaning, and analysing data from websites.

For sharing facts as data, people publish CSVs, tables and even occasionally access to databases, but I think there are lessons to learn from the web’s primary, human-encoded content. I’ll share my first tries, and hope to get some feedback (comment below, or drop me a line on the Scraperwiki Google Group, please).


There’s a particularly handy Python package for treating words as data, and the people behind NLTK wrote a book (available online) introducing people not only to the code package, but to ways of thinking programmatically about language.

One of many nice things about NLTK is that it gives you loads of straightforward functions so you can put names to different ways of slicing up and comparing your words. To get started, I needed some data words. I happened to have a collection of CSV files containing my archive of tweets, and the ever-helpful Dragon at ScraperWiki helped me convert all these files into one long text doc, which I’m calling my twitter corpus.

Then, I fed NLTK my tweets and gave it a bunch of handles – variables – on which I may want to tug in future, (mainly to see what they do).

[sourcecode language=”python”]
from nltk import *
filename = ‘tweets.txt’

def txt_to_nltk(filename):
raw = open(filename, ‘rU’).read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
cleaner_tokens = wordpunct_tokenize(raw)
# “Text” is a datatype in NLTK
tweets = Text(tokens)
# For language nerds, you can tag the Parts of Speech!
tagged = pos_tag(cleaner_tokens)
return dict(
raw = raw,
tokens = tokens,
words = words,
vocab = vocab,
cleaner_tokens = cleaner_tokens,
tweets = tweets,
tagged = tagged

tweet_corpus = txt_to_nltk(filename)

Following some exercises in the book, I jumped straight to the visualisations. I asked for a lexical dispersion plot of some words I assumed I must have tweeted about. The plot illustrates the occurrence of words within the text. Because my corpus is laid-out chronologically (the beginning of the text is older than the end), I assumed I would see some differences over time:

[sourcecode language=”python”]
tweet_corpus.dispersion_plot([“coffee”, “Shropshire”, “Yorkshire”,
“cycling”, “Tramadol”])

Can you guess what some of them might be?


This ended up pretty much as I’d expected: illustrating my move from Shropshire to Yorkshire. It shows when I started tweeting about cycling, and the lovely time I ended up needing to talk about powerful painkillers (yep, that’s related to cycling!). I continuously cover the word “coffee” in my tweets. This kind of visualisation could be particularly useful for marketers watching the evolution of keywords, or head-hunters keeping an eye out for emerging skills. Basically, anyone who wants to see when a topic gathers reference within a set of words (e.g. the back-catalog of an industry blog).

Alongside the lexical dispersion plot, I also wanted to focus on a few particular words within my tweets. I looked into how I tweet about coffee, and used a few of NLTK’s most basic functions. A simple: ‘tweet_corpus.count(“coffee”)’, for example, gives me the beginnings of keyword metrics from my social media. (I’ve tweeted “coffee” 809 times, btw.) Using the vocab variable, I can ask Python – ‘len(vocab)’ – how many different words I use (around 35k), though this tends to include some redundancies like plurals and punctuation. Taking an old linguist’s standby, I also created a concordance, getting the occurrences within context. NLTK beautifully lined this all up for me with a single command: ‘tweetcorpus.concordance(“coffee”)’

View the code on Gist.

I could continue to walk through other NLTK exercises, showing you how I built bigrams and compared words, but I’ll leave further exploration for future posts.

What I would like to end with is an observation/question: this noodling in the natural language processing on social data makes it clear that a very few commands can be used to provide context and usage metrics for keywords. In other words, it isn’t very hard to see how often you’ve said (in this case tweeted) a keyword you may be tracking. You could treat just about any collection of words as your own corpus (company blog, user manuals, other social media…), and start asking some very straightforward questions very quickly.

What other data questions would you want to ask of your words?

Data Science Magic Tue, 26 Mar 2013 13:57:55 +0000 8480244286_1c296ef78f

where data wizards learn magic

As a business person, if I want insight into my business needs, I can ask a data scientist for answers, so I can make better decisions!

Urgh: the sound I make reading that.

I am starting to wonder if Data Science is seen as magic, and insight as arcane wisdom distilled from eyes of newts – lots of them, cause it’s a Big Data cauldron, of course!

Many online business publications cover Data Science as a buzzword, and a quick search adding “data science” to a business function like, say: “marketing,” returns many posts extolling the power of data to inform decisions. But examples are scarce.

This awareness seems to be reflected from well-known examples of analysis actually turning up business ideas. One of the best examples is America’s everything-shop Target inferring that some of its shoppers were pregnant before it became public knowledge (sometimes, even before they had told close family). This came about, it seems, because people asked a simple business question:

‘“Specifically, the marketers said they wanted to send specially designed ads to women in their second trimester … Can you give us a list?” the marketers asked.’ (NYT)

Target distills a lot of information into its shoppers’ profiles: shopping habits (obviously), bank used, car driven, websites visited (oh hai cookies!), and the rest. Using all this data, Target’s stat wonks were able to mashup user profiles with wider habit-patterns, and find people matching a predicted pattern of behaviours.

The NYT article discusses a particularly interesting area of data science, where internal data is augmented by external information. So, by acquiring population stats (the NYT guesses at things like demographics, zip codes, birth records, thought Target didn’t say exactly what external data), they were able to make better guesses about what kind of people were in their big stacks of user profiles.

This is interesting (if somewhat creepy), because the people asking the question had a pressing business need: tell pregnant women that Target sells newborn baby stuff, before they give birth. The marketers asked that question of the analysts, instead of asking for just the metrics. Avinash Kaushik talks about that phenomenon particular to web analytics, and I think it’s a similar story when business people ask report questions instead of business questions of their data scientists.

All of that stems from the simple business question: “Can we have a list of people who are probably pregnant, so we can send them a specific message?” And, let’s face it, Target had budget to spend on huge data. So, what are the lessons for smaller businesses beyond seeing data science as magic?

OK, so there must be some wins closer to home: where else are business questions being asked of data? So, I am very interested in other examples of marketers, flaks, hacks, managers and CIO’s asking such questions, and would love to talk to some of you!

I put a similar question up on Quora, and there is a comment box down there. You can also drop me a line at, if you have good examples. I’d like to write them up, tell the stories to the ScraperWiki community, and get beyond the “magic” and into cases, facts, and workable ideas!

image credit: “Arches upon Arches” by Zach Beauvais, CC BY-SA 2.0 via flickr

Young Rewired State: 2013 Festival of Code Mon, 25 Mar 2013 16:49:59 +0000 Guest post by Kaitlin Dunning from Young Rewired State

profile pic kait bw

Young Rewired State is a network of software developers and designers aged 18 and under. It is the philanthropic arm of Rewired State and its primary focus is to find and foster the young children and teenagers who are driven to teaching themselves how to code, how to program the world around them. The aim is to create a worldwide, independent, mentored network of young programmers supported  –  and supporting  each other  –  through peer-to-peer learning. Ultimately, young developers can be solving real-world challenges.

The Festival of Code is our annual celebration of everything code. It takes place all over the UK every year in the first full week of August, and ends with a long weekend at the Custard Factory in Birmingham, with everybody coming together showcasing the amazing achievements. This year, the dates are 5-11 August, and we’re aiming to have 60 centres around the UK, with 1000 kids participating!


Participating in the Festival of Code is the best way to get to know how we work and to become a part of the community  –  whether as a young person, mentor or host centre.
The mentor community is a huge part of the success of the Festival of Code. Traditionally, it has been drawn from the Rewired State network, but as the popularity of the week has grown, so has the mentor network. Indeed, some of you mentors will be YRS alumni that are aged over 18, and therefore too old to be a YRS participant. We hope that as the years go by, our mentor numbers will grow at the same rate as new attendees.

YRScircleThe role of the mentor is manifold, and includes: providing expertise in programming, design, presentation skills, agile, ideation, robotics, open data, open government data, and graphics. It also involves assisting the centre lead in looking after the room, alongside assessing skills and encouraging collaboration.

To help grow the mentor network, we would like to ask the ScraperWiki community if there are any people here interested in getting involved. If you are interested in learning how you can connect with Young Rewired State (as a centre, mentor, or sponsor) please email

Careers in Computing Fri, 22 Mar 2013 10:40:28 +0000 I realise the whole world isn’t inspired by the same things I am, and that’s fair enough. However, on Wednesday, I had the privilege of being invited to share some of my inspiration with a bunch of teenagers at the Liverpool John Moores University career day.

I was asked to talk to year-10 students (14-15 years old) for half an hour covering “careers in computing,” and nearly balked at the enormity of that kind of request. They said it’s OK to narrow it down a bit, so I decided to think more about CompSci and the web. I also took a webby approach to writing this talk (I cheated), and posted this question on Quora:

“If I were to speak to 50 teenagers about why a career in computer science is awesome, what would be a travesty to leave out?”

I grant you, Computer Science isn’t necessarily the best match to a career in computing (indeed, I pointed out to the kids that they can learn to code and contribute to the world’s largest public resource right now). However, a particular answer I received from Ian Dickinson inspired me to trash my “history of computing in Britain” approach  –  Babbage, Lovelace, TimBL etc  –  and rework the talk completely. I’ll share a few of those slides below, and continue to enjoy the glory of geeks sharing their careers-talk.006-001 careers-talk.008-001careers-talk.011-001 careers-talk.012-001

Getting SkilledUp Thu, 21 Mar 2013 16:37:48 +0000 skilledup-homepageGuest Post by SkilledUp‘s Nick Gidwani.

ScraperWiki is a revolutionary tool. Not just because it allows you to collect data, but because it allows anyone – including journalists who now must specialize in data – to organize and draw conclusions from vast data sets. That skill set (organizing what is now called “big data”) was not something that was expected of journalists just a few years ago. Now, in our infographic and data-hungry new world, being versed in analysis is critical.

Knowledge work – the type of work that requires creativity, problem solving and “thinking” – is the most important and valued work of the future. Increasingly, to become a knowledge worker, one must learn a varied set of skills rather than just be a master of your own domain. We are already seeing “Growth Hackers” become an increasingly valued position, where individuals with comfort and expertise with data, analysis, marketing and product development can combine those skills to do what was once the job of several people. It is a rarity these days to find a writer that doesn’t have at least proficiency in social media concepts, optimizing for SEO or editing an image in Photoshop.

Few graduates of universities arrive on the job with any of these hard skills – the ability to immediately contribute. For those starting off at a Fortune 100 company, there is a good chance that they have an internal training system to help you get started and trained to use the newest tools. For everyone else though, you are expected to learn it on your own: not an easy proposition, especially in today’s very competitive labour market.

Well, the good news is that there is an army of experts, startup CEOs, and others that have run companies, mentored and trained employees, and had success in business, that are willing to share their skills – the how, the why, and the what – via online learning to large audiences. Even better, much of this content is free or available at a very low cost, and widely available. Although companies like have been around for a while, we’ve seen exponential growth in the variety and number of online courses being introduced in the last 18 months. New, easy to use tools that allow you to create online courses have also enabled the creation of courses. Different from the university model, this educational content is highly fragmented, rarely features certification, and is priced all over the map.

SkilledUp‘s goal is to organize, curate and review the world of online education, with a focus on the type of education that imparts marketable job skills. We believe that in this new world, many of the best workers are going to be those who teach themselves these skills by first learning from experts, and then trying their hand by experimenting in their jobs or on their own. While there is a lot of great training from excellent instructors, much of it is prohibitively expensive or too lengthy to apply quickly. We hope to make it easier to separate the best from the rest.

We’ve begun by creating our online course search app, which is currently organizing 50,000 courses from over 200 unique libraries. We expect these numbers to continuously grow, especially in areas like MOOCs (Massive Open Online Courses), Talks (things like or and even E-books. We understand that everyone learns differently, and while some may prefer a 12-hour course with video, others prefer a high quality e-book, perhaps with exercise files.

Over time, we expect to add signal data to these courses, so that parsing the ‘best’ from the ‘rest’ is easier than reading a review.

To create something that is truly comprehensive and robust, we are asking the ScraperWiki community for help: more course libraries needs gathering, sorting and adding to our database, especially of the open (aka free) variety. We’re looking for people who can scrape sites, make suggestions, or suggest learning libraries they know about so that we can build the most robust and useful index possible.

If you have any ideas, or can contribute a few simple lines of code, please get in touch via the ScraperWiki Google Group, drop a comment below, or email us directly at

Nick Gidwani uses ScraperWiki as part of his startup company: SkilledUp.