PDF table extraction of pagenated table

Got PDFs you want to get data from? Try our web interface and API over at PDFTables.com! The Isle of Man aircraft registry (in PDF form) has long been a target of mine waiting for the appropriate PDF parsing technology. The scraper is here. Setting aside the GetPDF() function, which deals with copying out each […]

Streets of San Francisco….well Mission Street to be precise…#newshacksf

This Friday June 22nd the ScraperWiki truck will roll across the Golden Gate bridge as part of its ‘Liberate the Data’ quest. It is with the fabulous support of @MJ_Coren www.majorplanetstudios.org. It’s a big deal as it’s our first co-hosted event and the first time we’ve parked in sunny California. It is also the precursor to […]

Middle Names in the United States over Time

I was wondering what proportion of people have middle names, so I asked the Census. Recently you requested personal assistance from our on-line support center. Below is a summary of your request and our response. We will assume your issue has been resolved if we do not hear from you within 48 hours. Thank you […]

Local ScraperWiki Library

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline. How to use pip install scraperwiki_local You can then import scraperwiki in scripts run on your […]

Microfinance Data Scraping

I went to the Datakind‘s New York Datadive last November and met the Microfinance Information Exchange (MIX), a group that ‘delivers data services, analysis, research and business information on the institutions that provide financial services to the world’s poor’. They wanted to see whether web-scraping could save them from manually gathering data. So fellow divers and I showed MIX the utility […]

5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories)

It’s 9:54am in Torquay on a Wednesday morning: One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head. On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head. Crews used vaseline and the finger kit to remove the […]

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases. Here is some ways that an exception might be raised. [][0] #The list has no zeroth element, so this raises an […]

Announcing ScraperWiki Premium Accounts!

The most exciting bit about ScraperWiki is how it forms a link between two very different worlds. On the one hand, we love the public good that data liberation enables, and we’re used by everyone from journalists (did you see us on the Guardian front page last week?) to activists (like the guys behind Australian planning […]

Parsing panic

This is a guest post by Martha Rotter, co-founder of Woop.ie and recently launched Irish technology magazine Idea. Hey remember the Wikipedia blackout? I do, because I was highly amused by the number of students panicking due to papers or homework they seemingly could not complete without this one website. One of my favourite things to […]

Is scraping legal?

Lots of people, when they hear about ScraperWiki, ask “is scraping legal? how can you build a business off that?”. Usually to follow up by saying “we do it in our company, but we would never tell anyone”. This is strange to us, as we have come from a world of good scraping. Taking Government […]

ScraperWiki

Extract tables from PDFs and scrape the web

Blog