Comments on: pdftables – a Python library for getting tables out of PDF files

By: Scraping large pdf tables which span accross multiple pages | BlogoSfera

Scraping large pdf tables which span accross multiple pages | BlogoSfera — Tue, 06 Aug 2013 14:02:40 +0000

[…] have encountered several python libraries like pdftables but they are not easy to use for non-python developer like me (I was not even able to run these […]

By: Ian Hopkinson

Ian Hopkinson — Thu, 01 Aug 2013 07:38:59 +0000

pdfminer brings additional functionality over pdftohtml, hence the switch – the fact it is Python based is convenient but not essential.

We’ve used Abby in the past, and if we go down the commercial application route we’d probably stick with them. I’ve seen various open source alternatives but none seemed to be considered the obvious go-to solution. Anyway, I fancied having a play with the problem myself 😉

Thanks for pointing out the ground truth data set – hadn’t seen it before and it looks very handy.

By: Tom Morris (@tfmorris)

Tom Morris (@tfmorris) — Wed, 31 Jul 2013 15:29:11 +0000

Does the switch the pdfminer bring additional functionality or was it just in the name of Python purity? Are Tabula and Abby the only two other PDF table extraction packages that you’ve evaluated? Have you run your algorithm against the PDF table extraction ground truth data set from ICDAR 2013? http://www.tamirhassan.com/dataset.html