The Tyranny of the PDF

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Why is ScraperWiki so interested in PDF files?

Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number of examples we found. We found 508,000,000 PDF files, over 70% of the non-html web. It is constructed by running Google queries for different filetypes (i.e. filetype:csv). After filtering out the basic underpinnings of the internet (the HTML, HTM, aspx, php and so forth) we are left with hosted files. The sorts of files you might find also on your local hard drive or on corporate networks.

These are just the files that Google has indexed. There are likely to be many, many more in private silos such as company reports databases, academic journal archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications…

The number of PDFs on the indexed internet amounts to 1 for every 5 or so people with internet access but I know that personally I am responsible for another couple of hundred which are unique to me – mainly bank and credit card statements. That goes without mentioning the internal company reports I’ve generated, and numerous receipts and invoices.

So we’ve established there’s an awful lot of PDF files in the world.

Why are there so many of them?

The first version of PDF was released in 1993, around the birth of the internet. It is a mechanism for describing the appearance of a page on screen or on paper. It does not care about the semantics of the content: the titles, paragraphs, footnotes and so forth. It just cares about how the page looks, and how a page looks is important. For its original application it is the readers job to parse the structure of the document. The job of the PDF document is solely to look consistent on a range of devices.

PDF documents are about appearance.

They are also about content control, there are explicit mechanisms in the format to control access with passwords and limit the ability to copy or even print the content. PDF is designed as a read only medium – the intention of a PDF document is that you should read it, not edit it. There is a variant PDF/A designed explicitly as an archival format. Therefore for publishers of data, who wish to limit the use to which the data they sell, PDF is an ideal format. Ubiquitous, high-fidelity but difficult to re-purpose.

PDF documents are about content control.

These are the reasons that the format is so common: it is optimised for appearance and to read only, and there are a lot of people who want to generate such content.

What’s in them?

PDF files contain all manner of content. There are scientific reports, bank statements, the verbatim transcripts of the UN assembly, product information, football scores and statistics, insurance policy documents, SEC filings, planning applications, traffic surveys, government information revealed by Freedom of Information requests, the membership records of learned societies going back for hundreds of years, the financial information of public, private and charitable bodies, climate records…

We can divide the contents of PDFs crudely into free form text, and structured data: data in things that look like tables or database records. Sometimes these contents are mixed together in the same document. For example a scientific paper or a report may contain text and tables of data. Sometimes PDF files just contain tables of data. Apparently free form text can be quite structured.

The point about structured contents is that they are ripe for re-use, if only they can be prised from the grip of a format that is solely interested appearance.

Our customers are particularly interested in PDF files which contains tables of data, things like production data for crops or component specifications, or election results. But even free form text can be useful once liberated. For example, you may wish to mine the text of your company reports in order to draw previously unseen connections using machine learning algorithms. Or, to take the UN transcripts as an example, we can add to the PDF by restructuring the contents to record who said what when and how they voted in a manner that is far more accessible than the original document.

What’s the problem?

The problem with PDF files is that they describe the appearance of the page but do not mark up the logical content. Current, mainstream tools allow you to search for the text inside a PDF file but they provide little context for that text and, if you are interested in numbers, then search tools are little use to you at all.

In general PDF documents don’t even know about words, a program that converts PDF to text will have to reconstruct the words and sentences from the raw positions of letters found in the PDF file.

This is why ScraperWiki is so interested in PDF files, and why we made the extraction of tables from PDF files a core feature of our Table Xtract product.

What data do you have trapped in your PDF files?

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Tags: pdf, Table Xtract

Trackbacks/Pingbacks

The Tyranny of the PDF | Frontiers of Journalis... - December 27, 2013
[…] Why is ScraperWiki so interested in PDF files? Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number of examples we found. […]

ScraperWiki

Extract tables from PDFs and scrape the web

Blog