Hiding invisible text in Table Xtract

As part of the my London Underground visualisation project I wanted to get data out of a table on Wikipedia, you can see it below. It contains data on every London Underground station including things like the name of the station, the opening date, which zone it is in, how many passengers travel through it […]

The Tyranny of the PDF

Got a PDF you want to get data from? Try our easy web interface over at PDFTables.com! Why is ScraperWiki so interested in PDF files? Because the world is full of PDF files. The treemap above shows the scale of their dominance. In the treemap the area a segment covers is proportional to the number […]

Time to try Table Xtract

Getting data out of websites and PDFs has been a problem for years with the default solution being the prolific use of copy and paste. ScraperWiki has been working on a way to accurately extract tabular data from these sources and making it easy to export to Excel or CSV format. We have been internally testing and […]

Table Scraping Is Hard

The Problem NHS trusts have been required to publish data on their expenditure over £25,000 in a bid for greater transparency; A well known B2B publisher came to us to aggregate that data and provide them with information spanning across the hundreds of different trusts, such as: who are the biggest contractors across the NHS? […]

pdftables – a Python library for getting tables out of PDF files

Got PDFs you want to get data from? Try our web interface and API over at PDFTables.com! One of the top searches bringing people to the ScraperWiki blog is “how do I scrape PDFs?” The answer typically being “with difficulty”, but things are getting better all the time. PDF is a page description format, it […]

ScraperWiki

Extract tables from PDFs and scrape the web

Tag Archives | Table Xtract