Thinking – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Over a billion public PDFs Tue, 15 Sep 2015 07:02:33 +0000 You can get a guesstimate for the number of PDFs in the world by searching for filetype:pdf on a web search engine.

These are the results I got in August 2015 – follow the links to see for yourself.

Google Bing
Number of PDFs 1.8 billion 84 million
Number of Excel files 14 million 6 million

The numbers are inexact, but that’s likely over a billion PDF files. Of course, it’s only the visible ones…

But the fact is that the vast majority of PDFs are in corporate or governmental repositories. I’ve heard various government agencies (throughout the world) comment that they have tens of millions (or more) in their own libraries/CMS’s. Various engineering businesses, such as Boeing and Airbus are also known to have tens of millions (or more) in their repositories. Leonard Rosenthol, Adobe’s PDF Architect

Digging a bit deeper by also adding site: to the search, you can find out what percentage of documents that a search engine has indexed are PDFs.

Number of PDFs Total number of pages % PDFs
.com 547 million 25 billion 2%
.gov 316 million 839 million 38%

UK – HM Treasury Summer Budget 2015

That’s proportionately a lot more PDFs published by the US Government than by commercial sites!

Got a PDF you want to get data from?
Try our easy web interface over at!
Burn the digital paper! A call to arms Fri, 07 Aug 2015 14:58:04 +0000 This is a blog post version of a lunchtime talk I gave at the Open Data Institute. You may prefer to listen to it or use the slides.

Stafford Beer

Stafford Beer was a British cybernetician.

Stafford Beer

He described four stages that happen when you get a computer.

Each stage ends in disappointment.

1. Amazement

It’s an electronic brain!

This is a homeostat from the article “The Electronic Brain” in the March 1949 issue of Radio Electronics.


At this stage you get no benefit from the computer. It is just amazing.

2. Digital paper

Most of us, in every day business use of computers, are at this stage.

We write documents, which in form look much like this one from 1946.

Argentine Situation

We send the documents to each other using an electronic metaphor for postal mail.

Letter Carrier Delivering Mail

Sure, copies are cheaper to make than real paper, and they arrive in an instant. However, the underlying process it the same. That’s why it was so easy to learn.

Even our data is digital paper.

This spreadsheet, a financial ledger from the World’s Columbian Exposition in 1893, looks much the same as any business spreadsheet today.

World's Columbian Exhibition spreadsheet

The numbers in the columns can add themselves up now, but it isn’t making full use of our computers.

This over use of paper methods causes problems. Errors can waste billions of dollars.

Is Excel the most dangerous piece of software in the world?It’s disappointing. Electronic brains must be able to do more!

3. Gather up as data

Let’s structure everything as data and put it all together in a massive store!

At this stage of computer use, the hope is that once everything is uniform, we’ll have amazing power and analysis. Everything will be better!

To do this, the only current method is to hire some nerds and get them to write complex, expensive software.

Computer Engineer

The engineer makes sure inputs are consistent. Consistency is what “data” means.

Apps are an example. They gather the structure as a side effect of user actions. It’s what social networks (and digital spies!) do.

Facebook like on a wall

You can reduce this effort, by making better use of existing data. The Humanitarian Data Exchange (which ScraperWiki works on) is an example.


There are products which help convert digital paper into data.

PDF Tables screenshot

But ultimately, current methods rely on people entering data really really carefully. Which we humans are not very good at.


I can only see two ways to radically improve this, and get more data.

Firstly, more digital literacy. Most people learn to drive, or else cars would be useless. Likewise, can everyone learn to tag things consistently, so CRMs work better? Should everyone learn to code, so we can make infrequent business processes create structured data?

Digital literacy

Secondly, improving artificial intelligence. That is, to have a computation model more sophisticated than current coding, which doesn’t need pickily consistent “data” any more.


So, you’ve now got lots of data. Alas, it’s still not enough!

Stafford Beer says that even when organisations get everything structured, their data lakes overflowing, their tables linked across the web… Even then they are disappointed.

4. Feedback loops

Next, you realise you need a feedback system to effectively use your data.

Cybernetic factory

Think of your own body, where tangled hierarchical layers of proteins, cells and organs have feedback loops within and between each other.

There’s no point the Government opening data, if it doesn’t alter policy decisions. There’s no point rolling out Business Intelligence software, if it doesn’t make your enterprise chose better.

Viable systems model

More deeply, what should organisations look like now we have computers?

The question which asks how to use the computer in the enterprise, is, in short, the wrong question.

A better formulation is to ask how the enterprise should be run given that computers exist.

The best version of all is the question asking what, given computers, the enterprise now is.

– Stafford Beer, “Brain of the Firm”, 1972

Stafford Beer with his assistant Sonia Mordojovich

If you’ve got some digital paper you want to turn into data, ScraperWiki has various products to help. For more on Stafford Beer, including the wild story of Cybersyn and Allende’s Chile, watch my lightning talk at Liverpool Ignite. All the pictures above are links to further information.