Comments on: pdftables – a Python library for getting tables out of PDF files https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/ Extract tables from PDFs and scrape the web Thu, 14 Jul 2016 16:12:42 +0000 hourly 1 https://wordpress.org/?v=4.6 By: Scraping large pdf tables which span accross multiple pages | BlogoSfera https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/#comment-938 Tue, 06 Aug 2013 14:02:40 +0000 http://blog.scraperwiki.com/?p=758219060#comment-938 […] have encountered several python libraries like pdftables but they are not easy to use for non-python developer like me (I was not even able to run these […]

]]>
By: Ian Hopkinson https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/#comment-937 Thu, 01 Aug 2013 07:38:59 +0000 http://blog.scraperwiki.com/?p=758219060#comment-937 pdfminer brings additional functionality over pdftohtml, hence the switch – the fact it is Python based is convenient but not essential.

We’ve used Abby in the past, and if we go down the commercial application route we’d probably stick with them. I’ve seen various open source alternatives but none seemed to be considered the obvious go-to solution. Anyway, I fancied having a play with the problem myself 😉

Thanks for pointing out the ground truth data set – hadn’t seen it before and it looks very handy.

]]>
By: Tom Morris (@tfmorris) https://blog.scraperwiki.com/2013/07/pdftables-a-python-library-for-getting-tables-out-of-pdf-files/#comment-936 Wed, 31 Jul 2013 15:29:11 +0000 http://blog.scraperwiki.com/?p=758219060#comment-936 Does the switch the pdfminer bring additional functionality or was it just in the name of Python purity? Are Tabula and Abby the only two other PDF table extraction packages that you’ve evaluated? Have you run your algorithm against the PDF table extraction ground truth data set from ICDAR 2013? http://www.tamirhassan.com/dataset.html

]]>