Comments on: PDF table extraction of pagenated table https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/ Extract tables from PDFs and scrape the web Thu, 14 Jul 2016 16:12:42 +0000 hourly 1 https://wordpress.org/?v=4.6 By: jeffhoek https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/#comment-797 Thu, 15 Aug 2013 19:23:17 +0000 http://blog.scraperwiki.com/?p=758216965#comment-797 this list comprehension just ends up empty for me:
tboxes = [ [ [ ] for xl in xlist ] for yl in ylist ]
are we missing something here?

]]>
By: Luca https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/#comment-796 Mon, 18 Feb 2013 15:53:46 +0000 http://blog.scraperwiki.com/?p=758216965#comment-796 Did you have a look at he work by McCallum on Table Extraction ? http://www.cs.umass.edu/~mccallum/papers/TableExtraction-irj06.pdf

]]>
By: (ó)nytsamlegar vefslóðir fyrir gagnablaðamennsku | Gögn https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/#comment-795 Mon, 06 Aug 2012 02:46:53 +0000 http://blog.scraperwiki.com/?p=758216965#comment-795 […] Hér er dæmi um hvernig má ná efni út úr slíkum PDF skjölum ↩ […]

]]>
By: Francis Davey https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/#comment-794 Fri, 29 Jun 2012 21:51:18 +0000 http://blog.scraperwiki.com/?p=758216965#comment-794 Thanks. This is really interesting. Only yesterday I was trying to extract something from:

http://www.justice.gov.uk/downloads/tribunals/lands/court-appeal-cases.pdf

So I could add value to this: https://scraperwiki.com/scrapers/lands-tribunal-decisions/ scraper. I’ve only just started poking around with pdfminer and I agree, its all completely hacky. I wrote something that pulled out lines in much the same way and it worked for *most* but not all the text on the first page I tried.

Probably more interesting to you would be this table:

http://www.justice.gov.uk/downloads/tribunals/information-rights/current-cases/information-rights-current-cases.pdf

which links up ICO case references with information tribunal case numbers. That link is not easily found elsewhere and could allow some sort of mashup between the ICO decisions and their IT appeals. I have managed to get the IT summaries all scraped, but going further would be nice.

If I have time I’ll work some more on this, but pdfminer is *not* well documented. I ended up looking at the source code to try to work out some of the objects.

]]>