pdf – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone https://blog.scraperwiki.com/2015/09/henry-morris-entrepreneur-for-social-mobility-on-getting-contacts-from-pdf-into-his-iphone/ Wed, 30 Sep 2015 14:11:16 +0000 https://blog.scraperwiki.com/?p=758224084 Henry Morris

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach.  They’re amazing businesses that focus on social mobility.

We interviewed him for PDFTables.com

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

 

]]>
758224084
Announcing PDFTables.com https://blog.scraperwiki.com/2015/05/announcing-pdftables-com/ Mon, 18 May 2015 14:46:18 +0000 https://blog.scraperwiki.com/?p=758222804 PDFs were invented at the same time as the web.  As “digital paper”, they’re trustworthy and don’t change behind your back.

This has a downside – often the definitive source of published data is a PDF. It’s hard to get tens of thousands of numbers out and into a spreadsheet or database. Copying and pasting is too slow, and popular conversion tools munge columns together.

At ScraperWiki, we’ve been helping people get the data back out of PDFs for nearly 5 years.

In that time we’ve developed an Artificial Intelligence algorithm. Just like your eyes, it can see the spacing between columns, picking out the structure of a table from its shape.

It’s called PDFTables.com.

PDF Tables screenshot

This is the first self-service, web-based product designed for getting volumes of data from PDFs. It’s super fast to convert individual PDFs, and there’s a web API to automate more.

You can use it a couple of times without signing up, and then get 50 pages more for free. We charge per page, so you only pay for what you need.

We’d love feedback – please contact us to let you know what you think.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758222804
The Tyranny of the PDF https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/ https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/#comments Fri, 27 Dec 2013 09:30:25 +0000 https://blog.scraperwiki.com/?p=758220272 Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Why is ScraperWiki so interested in PDF files?

Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number of examples we found. We found 508,000,000 PDF files, over 70% of the non-html web. It is constructed by running Google queries for different filetypes (i.e. filetype:csv). After filtering out the basic underpinnings of the internet (the HTML, HTM, aspx, php and so forth) we are left with hosted files. The sorts of files you might find also on your local hard drive or on corporate networks.

These are just the files that Google has indexed. There are likely to be many, many more in private silos such as company reports databases, academic journal archives, bank statements, credit card bills, material safety data sheets, product catalogues, product specifications…

The number of PDFs on the indexed internet amounts to 1 for every 5 or so people with internet access but I know that personally I am responsible for another couple of hundred which are unique to me – mainly bank and credit card statements. That goes without mentioning the internal company reports I’ve generated, and numerous receipts and invoices.

So we’ve established there’s an awful lot of PDF files in the world.

Why are there so many of them?

The first version of PDF was released in 1993, around the birth of the internet. It is a mechanism for describing the appearance of a page on screen or on paper. It does not care about the semantics of the content: the titles, paragraphs, footnotes and so forth. It just cares about how the page looks, and how a page looks is important. For its original application it is the readers job to parse the structure of the document. The job of the PDF document is solely to look consistent on a range of devices.

PDF documents are about appearance.

They are also about content control, there are explicit mechanisms in the format to control access with passwords and limit the ability to copy or even print the content. PDF is designed as a read only medium – the intention of a PDF document is that you should read it, not edit it. There is a variant PDF/A designed explicitly as an archival format. Therefore for publishers of data, who wish to limit the use to which the data they sell, PDF is an ideal format. Ubiquitous, high-fidelity but difficult to re-purpose.

PDF documents are about content control.

These are the reasons that the format is so common: it is optimised for appearance and to read only, and there are a lot of people who want to generate such content.

What’s in them?

PDF files contain all manner of content. There are scientific reports, bank statements, the verbatim transcripts of the UN assembly, product information, football scores and statistics, insurance policy documents, SEC filings, planning applications, traffic surveys, government information revealed by Freedom of Information requests, the membership records of learned societies going back for hundreds of years, the financial information of public, private and charitable bodies, climate records…

We can divide the contents of PDFs crudely into free form text, and structured data: data in things that look like tables or database records. Sometimes these contents are mixed together in the same document. For example a scientific paper or a report may contain text and tables of data. Sometimes PDF files just contain tables of data. Apparently free form text can be quite structured.

The point about structured contents is that they are ripe for re-use, if only they can be prised from the grip of a format that is solely interested appearance.

Our customers are particularly interested in PDF files which contains tables of data, things like production data for crops or component specifications, or election results. But even free form text can be useful once liberated. For example, you may wish to mine the text of your company reports in order to draw previously unseen connections using machine learning algorithms. Or, to take the UN transcripts as an example, we can add to the PDF by restructuring the contents to record who said what when and how they voted in a manner that is far more accessible than the original document.

What’s the problem?

The problem with PDF files is that they describe the appearance of the page but do not mark up the logical content. Current, mainstream tools allow you to search for the text inside a PDF file but they provide little context for that text and, if you are interested in numbers, then search tools are little use to you at all.

In general PDF documents don’t even know about words, a program that converts PDF to text will have to reconstruct the words and sentences from the raw positions of letters found in the PDF file.

This is why ScraperWiki is so interested in PDF files, and why we made the extraction of tables from PDF files a core feature of our Table Xtract product.

What data do you have trapped in your PDF files?

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2013/12/the-tyranny-of-the-pdf/feed/ 1 758220272
Table Scraping Is Hard https://blog.scraperwiki.com/2013/11/table-scraping-is-hard/ Tue, 05 Nov 2013 15:25:43 +0000 http://blog.scraperwiki.com/?p=758219419 The Problem

NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS?

It’s a common problem – there’s lots of data out there which isn’t in a nice, neat, obviously usable format, for various reasons. What we’d like to have is all that data in a single database so we can slice it by time and place, and look for patterns. There’s no magic bullet for this, yet; we’re having to solve issues on each step of the way.

Where’s the data?

A sizable portion of the data is stored in data.gov.uk, but significant chunks were stored elsewhere on individual NHS trust websites. Short of spidering every NHS trust website, we wouldn’t be able to find the spreadsheets, so we automated finding the spreadsheets through Google.

Google make it difficult to scrape them – ironic, given that that’s what they do to everyone else – building a search engine means scraping websites and indexing the content. We also found that the data in the tables was held in a variety of formats and whilst most of this spending data was in spreadsheets, we also found tables in web pages and PDFs. Usually these would each need separate software to understand the tables, but we’ve been building new tools to make this easier to let us extract tables from all these different formats so we don’t need to worry about where the table originally came from.

The requirement from central government to provide this spending data has led to some consistency in the types of information provided; but there are still difficulties in matching up the different columns of data: Dept Family, Department family, and Departmental Family are all obviously the same thing to a human, but it’s more difficult to work out how to describe such things to a computer. Where one table has both “Gross” and “Net”, which should be matched up with another table’s “Amount”?

Worse still, it’s possible for the data in the tables to need to be matched, such as company names; where an entry exists for “BT” it needs to be matched to “British Telecommunications PLC”, rather than “BT Global Services”. Quite how to do this reliably, even with access to Companies House data is still not as easy as it should be. Hopefully projects such as OpenCorporates which also uses Scraperwiki will in the future make this an easier job.

To handle the problem of providing a uniform interface to tables in PDF, on web pages and in Excel files we made a library which we built into our recently launched product, Table Xtract.

Try Table Xtract for free!

]]>
758219419
pdftables – a Python library for getting tables out of PDF files https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/ https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/#comments Mon, 29 Jul 2013 16:21:54 +0000 http://blog.scraperwiki.com/?p=758219060 Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!

One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time.

PDF is a page description format, it has no knowledge of the logical structure of a document such as where titles are, or paragraphs, or whether it’s two column format or one column. It just knows where characters are on the page. The plot below shows how characters are laid out for a large table in a PDF file.

AlmondBoard7_LTChar

This makes extracting structured data from PDF a little challenging.

Don’t get me wrong, PDF is a useful format in the right place, if someone sends me a CV – I expect to get it in PDF because it’s a read only format. Send it in Microsoft Word format and the implication is that I can edit it – which makes no sense.

I’ve been parsing PDF files for a few years now, to start with using simple online PDF to text converters, then with pdftohtml which gave me better location data for text and now using the Python pdfminer library which extracts non-text elements and as well as bonding words into sentences and coherent blocks. This classification is shown in the plot below, the blue boxes show where pdfminer has joined characters together to make text boxes (which may be words or sentences). The red boxes show lines and rectangles (i.e. non-text elements).

AlmondBoard7

More widely at ScraperWiki we’ve been processing PDF since our inception with the tools I’ve described above and also the commercial, Abbyy software.

As well as processing text documents such as parliamentary proceedings, we’re also interested in tables of numbers. This is where the pdftables library comes in, we’re working towards making scrapers which are indifferent to the format in which a table is stored, receiving them via the OKFN messytables library which takes adapters to different file types. We’ve already added support to messytables for HTML, now its time for PDF support using our new, version-much-less-than-one pdftables library.

Amongst the alternatives to our own efforts are Mozilla’s Tabula, written in Ruby and requiring the user to draw around the target table, and Abbyy’s software which is commercial rather than open source.

pdftables can take a file handle and tell you which pages have tables on them, it can extract the contents of a specified page as a single table and by extension it can return all of the tables of a document (at the rate of one per page). It’s possible, for simple tables to do this with no parameters but for more difficult layouts it currently takes hints in the form of words found on the top and bottom rows of the table you are looking for. The tables are returned as a list of list of lists of strings, along with a diagnostic object which you can use to make plots. If you’re using the messytables library you just get back a tableset object.

It turns out the defining characteristic of a data scientist is that I plot things at the drop of a hat, I want to see the data I’m handling. And so it is with the development of the pdftables algorithms. The method used is inspired by image analysis algorithms, similar to the Hough transforms used in Tabula. A Hough transform will find arbitrarily oriented lines in an image but our problem is a little simpler – we’re interested in vertical and horizontal rows.

To find these vertical rows and columns we project the bounding boxes of the text on a page onto the horizontal axis ( to find the columns) and the vertical axis to find the rows. By projection we mean counting up the number of text elements along a given horizontal or vertical line. The row and column boundaries are marked by low values, gullies, in the plot of the projection. The rows and columns of the table form high mountains, you can see this clearly in the plot below. Here we are looking at the PDF page at the level of individual characters, the plots at the top and left show the projections. The black dots show where pdftables has placed the row and column boundaries.

AlmondBoard8_projection

pdftables is currently useful for supervised use but not so good if you want to just throw PDF files at it. You can find pdftables on Github and you can see the functionality we are still working on in the issue tracker. Top priorities are finding more than one table on a page and identifying multi-column text layouts to help with this process.

You’re invited to have a play and tell us what you think – ian@scraperwiki.com

Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/feed/ 3 758219060
Scraping the Royal Society membership list https://blog.scraperwiki.com/2012/12/scraping-the-royal-society-membership-list/ https://blog.scraperwiki.com/2012/12/scraping-the-royal-society-membership-list/#comments Fri, 28 Dec 2012 13:44:38 +0000 https://scraperwiki.wordpress.com/?p=758217875 To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Image(1)

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782” or “c 1782”, even more vaguely they may be described as having “flourished 1663-1778” or “fl. 1663-1778”. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

  • Nice traceable code to provide a provenance for the dataset;
  • Access to the pdftoxml library;
  • Strong encouragement to “do the right thing” and put the data into a database;
  • Publication of the data;
  • A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

]]>
https://blog.scraperwiki.com/2012/12/scraping-the-royal-society-membership-list/feed/ 4 758217875
Scraping PDFs: now 26% less unpleasant with ScraperWiki https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/ https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/#comments Fri, 17 Dec 2010 10:20:53 +0000 http://blog.scraperwiki.com/?p=758214147 Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.

So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.

Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.

The PDF to HTML Preview.

It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.

Scraping PDFs: the problem…

Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.

It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.

Adobe HQ in California

Adobe HQ in California. Locals say that only one person works inside - a reference to PDFs' bloated filesize.

ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.

But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.

…and the solution

So, the PDF previewer works as follows:

  • Grabs the data. Gets the XML using pdftoxml.
  • Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
  • Adds Javascript onclick events. Attaches simple events so that when you click on a word or phrase, you see the underlying XML.

Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!

We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/feed/ 2 758214147