The Tyranny of the PDF

Got a PDF you want to get data from? Try our easy web interface over at PDFTables.com! Why is ScraperWiki so interested in PDF files? Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number […]

Time to try Table Xtract

Getting data out of websites and PDFs has been a problem for years with the default solution being the prolific use of copy and paste. ScraperWiki has been working on a way to accurately extract tabular data from these sources and making it easy to export to Excel or CSV format. We have been internally testing and […]

Data Science (and ScraperWiki) comes to the Cabinet Office

The Cabinet Office is one of the most vital institutions in British government, acting as the backbone to all decision making and supporting the Prime Minister and Deputy Prime Minister in their running of the the United Kingdom. On the 19th of November, I was given an opportunity to attend an event run by this important […]

Tableau, twitter and ScraperWiki

I bumped into Andy Cotgreave at Strata London a couple of weeks ago. The Tableau booth had a handy place for me to put my coffee whilst I ate my biscuits, and as a regular user of Tableau and twitter I recognised him immediately! Andy is Social Content Manager at Tableau. I was pleased to see […]

(Machine) Learning about ScraperWiki’s Twitter followers

Machine learning is commonly used these days. Even if you haven’t directly used it personally, you’ve almost certainly encountered it. From checking your credit card purchases to prevent fraudulent transactions, through to sites like Amazon or IMDB telling you what things you might like, it’s a way of making sense of the large amounts of data that are increasingly accessible. Supervised learning […]

Live-graphing the UK Government’s agile auction

Hold your nerve! Hold your nerve! Stay at 49! I really think it’s over this time. It had been intense all day, an adrenalin rush. Tens of thousands of pounds potentially at stake. Watching carefully in shifts with no more than a minute of distraction. Luckily Aidan held his nerve, the auction did close this […]

Finding contact details from websites

Since August, I’ve been an intern at ScraperWiki. Unfortunately, that time’s shortly coming to an end. Over the past few months, I’ve learnt a huge amount. I’ve been surprised at just how fast-moving things are in a startup and I’ve been involved with several exciting projects. Before the internship ends, I thought it would be a […]

Open data – the zeitgest

Open data [1] is becoming a brand – 61 countries are using the brand and many others are expressing interest. The week before last thousands of delegates from around the world descended on London for a host of open data events that ran throughout the week. There is something of the zeitgeist about open data […]

Underneath the hood of Government’s Performance Platform

In the previous post I described what the UK Government’s new Performance Platform (made by GDS) is for. Today’s question is, how does it work? I’ve found out two ways. Firstly, thanks to Alex Muller from GDS, who talked me through the platform. Secondly, all the code is freely available on Github, which is pretty. Component parts There […]

Ignition!

Team ScraperWiki has been in London today, at the Strata Conference. A trip organised late in the day after I received word that my Ignite talk had been accepted. For those unfamiliar with the Ignite format, it’s 5 minutes, 20 slides with strictly 15 seconds per slide – a headlong plunge. The event was hosted […]

ScraperWiki

Extract tables from PDFs and scrape the web

Blog