Comments on: Scraping PDFs: now 26% less unpleasant with ScraperWiki https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/ Extract tables from PDFs and scrape the web Thu, 14 Jul 2016 16:12:42 +0000 hourly 1 https://wordpress.org/?v=4.6 By: Stoph https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/#comment-578 Thu, 23 Dec 2010 17:10:21 +0000 http://blog.scraperwiki.com/?p=758214147#comment-578 This is great, but the pdftohtml codebase doesn’t appear to handle newer pdfs. It just shows page numbers for the following pdf. In contrast, xpdf can convert it to plain text or to simple html.

http://dynamodata.fdncenter.org/990_pdf_archive/351/351019477/351019477_200806_990.pdf

pdftohtml is based on xpdf 2.02 while xpdf has now moved on to 3.02.
It is possible to incorporate xpdf and use its layout mode and “grep-like” processing for scraping?

http://www.foolabs.com/xpdf/

Thanks,

Stoph

P.S. By the way, fantastic work on scraperwiki

]]>
By: links for 2010-12-17 « Sarah Booker https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/#comment-577 Fri, 17 Dec 2010 18:01:48 +0000 http://blog.scraperwiki.com/?p=758214147#comment-577 […] Scraping PDFs: now 26% less unpleasant with ScraperWiki | Scraperwiki Data Blog Scraping PDF is less painful apparently. (tags: data PDF Scraperwiki) […]

]]>