annapowellsmith – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Scraping PDFs: now 26% less unpleasant with ScraperWiki https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/ https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/#comments Fri, 17 Dec 2010 10:20:53 +0000 http://blog.scraperwiki.com/?p=758214147 Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!

Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job.

Coders try to avoid scraping PDFs if there’s any other option. But sometimes, there isn’t – the data you need is locked up inside inaccessible PDF files.

So I’m pleased to present the PDF to HTML Preview, a tool written by ScraperWiki’s Julian Todd to ease the pain of scraping PDFs.

Just enter the URL of your PDF to see a preview in the browser. Click on the text you need – and instantly, you see the underlying XML.

The PDF to HTML Preview.

It doesn’t write your scraper for you – but it shows you what you’re scraping, just like “View Source”. And that makes starting out a lot easier.

Scraping PDFs: the problem…

Why is scraping PDFs so hard? Well, the PDF standard was designed to do a particular job: describe how a document looks, anywhere and forever.

It achieves that pretty well. But unlike HTML, the underlying code was never designed to be read. And it contains a lot of bloat.

Adobe HQ in California

Adobe HQ in California. Locals say that only one person works inside - a reference to PDFs' bloated filesize.

ScraperWiki already lets you extract XML from a PDF, for simple parsing – you can see the scraperwiki.pdftoxml library in our (incredibly basic) tutorial.

But matching up long-winded XML with what you see on the page isn’t always easy. Julian knows this only too well, having scraped PDFs on a grand scale to create UNDemocracy.

…and the solution

So, the PDF previewer works as follows:

  • Grabs the data. Gets the XML using pdftoxml.
  • Outputs as HTML. Outputs each PDF page as an absolute-positioned <div>.
  • Adds Javascript onclick events. Attaches simple events so that when you click on a word or phrase, you see the underlying XML.

Incidentally, the Preview is also a ScraperWiki view, meaning that you can edit the underlying code if you want it to work differently. In particular, feel free to improve the instructions and the layout!

We’ll be improving our PDF-scraping tutorials and examples in the coming weeks. If you’ve written a clever PDF scraper that would make a good basis for tutorials, please let us know in the comments.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2010/12/scraping-pdfs-now-26-less-unpleasant-with-scraperwiki/feed/ 2 758214147