Case study: Enrique Cocero getting political data from PDFs

Political strategy is international now. Enrique Cocero works from Madrid for his consultancy 7-50 Electoral Math, using data to understand voters and candidates in election campaigns across the world. He’s struggled with PDFs for a long time, and recently found PDF Tables via a Google search. He says: I used to have nightmares – I’m […]

The four kinds of data PDF

At ScraperWiki, we talk to lots of customers who need to convert PDFs to Excel. Why are they doing it? The industries are diverse – banking, insurance, retail, logistics, political campaigning, energy… What separates them in data terms though, is each has one of four different kinds of workflow. A. Large tables These are PDFs […]

Burn the digital paper! A call to arms

This is a blog post version of a lunchtime talk I gave at the Open Data Institute. You may prefer to listen to it or use the slides. Stafford Beer Stafford Beer was a British cybernetician. He described four stages that happen when you get a computer. Each stage ends in disappointment. 1. Amazement It’s […]

PDFTables.com: PHP, C# and VBA API examples

Invoices, bank statements, feeds of public data… Painful though it can be, many business workflows need to be able to take data in from PDFs. PDFTables.com has had an web API for a while. We’ve just added a few more language examples for C#, PHP and Visual Basic for Applications coders. You can find them […]

Which plane had the most accidents?

Searching by facets Last year, ScraperWiki helped migrate lots of specialist datasets to GOV.UK. This afternoon, we happened to notice that the Air Accidents Investigation Branch reports, which we scraped from their old site, are live. The user interface is called Finder Frontend, and is used by GOV.UK wherever the user needs to search for […]

PDFTables: All the tables in one page, CSV

Lots of you have asked for it, and we’ve finally changed the Excel download format at PDFTables.com to put all the pages of your PDF into one worksheet. This is particularly useful if you have big tables that span multiple pages. You can still have the old format, just choose “Excel (multiple sheets)” from the […]

Four specific things “agile” saved us from doing at ONS

There’s lots of both hype and cynicism around “agile”. Instead, look at this part of the original agile declaration. We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value: … Responding to change over Following a plan That is, while there […]

Announcing PDFTables.com

PDFs were invented at the same time as the web. As “digital paper”, they’re trustworthy and don’t change behind your back. This has a downside – often the definitive source of published data is a PDF. It’s hard to get tens of thousands of numbers out and into a spreadsheet or database. Copying and pasting is […]

The story of getting Twitter data and its “missing middle”

We’ve tried hard, but sadly we are not able to bring back our Twitter data tools. Simply put, this is because Twitter have no route to market to sell low volume data for spreadsheet-style individual use. It’s happened to similar services in the past, and even to blog post instructions. There’s lots of confusion in the […]

getting,stuff,done

In Berlin last week, a bunch of interoperability geeks gathered for the first csv,conf. Yes, that’s right, comma-separated value files. The conference was about getting stuff done. Data in, data out… With an ironic self-recognition that CSVs are weak in lots of ways, but still the best we’ve got. To give you a taste, here […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive by Author