New Ruby scraping tutorials – PDFs and Mechanize
Got a PDF you want to get data from?
Try our easy web interface over at pdftables.com!
Try our easy web interface over at pdftables.com!
Mark Chapman has made us two new Ruby tutorials.
Advanced Scraping: Pages Behind Forms shows you how to get data that is buried behind search boxes and drop down query lists. It uses the Mechanize library, which is a class that pretends to be a web browser, so it can work with cookies, and has a familiar interface
Advanced Scraping: PDFs shows you how to extract information from Adobe Portable Document Files. It uses the Ruby library PDF::Reader. It handles the text extract phase – working out how to parse that is a later skill.
You can findĀ all the Ruby tutorials (and links to Python and PHP ones) on one page.
Thanks Mark!