New Ruby scraping tutorials – PDFs and Mechanize

Got a PDF you want to get data from?
Try our easy web interface over at pdftables.com!

Mark Chapman has made us two new Ruby tutorials.

Advanced Scraping: Pages Behind Forms shows you how to get data that is buried behind search boxes and drop down query lists. It uses the Mechanize library, which is a class that pretends to be a web browser, so it can work with cookies, and has a familiar interface

Advanced Scraping: PDFs shows you how to extract information from Adobe Portable Document Files. It uses the Ruby library PDF::Reader. It handles the text extract phase – working out how to parse that is a later skill.

You can find all the Ruby tutorials (and links to Python and PHP ones) on one page.

Thanks Mark!

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

New Ruby scraping tutorials – PDFs and Mechanize