Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach. They’re amazing businesses that focus on social mobility. We interviewed him for PDFTables.com He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone. It’s his preferred personal Customer Relationship Management (CRM) system, it’s […]
Announcing PDFTables.com
PDFs were invented at the same time as the web. As “digital paper”, they’re trustworthy and don’t change behind your back. This has a downside – often the definitive source of published data is a PDF. It’s hard to get tens of thousands of numbers out and into a spreadsheet or database. Copying and pasting is […]
The Tyranny of the PDF
Got a PDF you want to get data from? Try our easy web interface over at PDFTables.com! Why is ScraperWiki so interested in PDF files? Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number […]
Table Scraping Is Hard
The Problem NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS? […]
pdftables – a Python library for getting tables out of PDF files
Got PDFs you want to get data from? Try our web interface and API over at PDFTables.com! One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time. PDF is a page description format, it […]
Scraping the Royal Society membership list
To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which […]
Scraping PDFs: now 26% less unpleasant with ScraperWiki
Got a PDF you want to get data from? Try our easy web interface over at PDFTables.com! Scraping PDFs is a bit like cleaning drains with your teeth. It’s slow, unpleasant, and you can’t help but feel you’re using the wrong tools for the job. Coders try to avoid scraping PDFs if there’s any other option. But […]