Three hundred thousand tonnes of gold

On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as […]

Software Archaeology and the ScraperWiki Data Challenge at #europython

There’s a term in technical circles called “software archaeology” – it’s when you spend time studying and reverse-engineering badly documented code, to make it work, or make it better. Scraper writing involves a lot of this stuff. ScraperWiki’s data scientists are well accustomed with a bit of archaeology here and there. But now, we want […]

PDF table extraction of pagenated table

Got PDFs you want to get data from? Try our web interface and API over at PDFTables.com! The Isle of Man aircraft registry (in PDF form) has long been a target of mine waiting for the appropriate PDF parsing technology. The scraper is here. Setting aside the GetPDF() function, which deals with copying out each […]

Local ScraperWiki Library

It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline. How to use pip install scraperwiki_local You can then import scraperwiki in scripts run on your […]

More Python libraries!

I installed some new Python libraries and restructured the Python libraries documentation page. Some highlights Gensim is “Topic Modelling for Humans”. Read the introduction to the documentation. I’m looking for an excuse to play with it. unidecode transliterates Unicode into ASCII. It’s helpful for things like making column names. Beautiful Soup 4 beta (It’s a […]

Introducing status.scraperwiki.com

So you can find out if parts of ScraperWiki aren’t working, we’ve added a new status page. It’s called status.scraperwiki.com, and looks like this: The page and the status monitoring is done by the excellent Pingdom. We’ve been using it for a while to alert us to outages, so there’s quite a bit of history […]

The Data Hob

Keeping with the baking metaphor, a hob is a projection or shelf at the back or side of a fireplace used for keeping food warm. The central part of a wheel into which the spokes are inserted looks kind of like a hob, and is called the hub (etymology). Lately there has been a move […]

The UN peacekeeping mission contributions mostly baked

Many of the most promising webscraping projects are abandoned when they are half done. The author often doesn’t know it. “What do you want? I’ve fully scraped the data,” they say. But it’s not good enough. You have to show what you can do with the data. This is always very hard work. There are […]

Big fat aspx pages for thin data

My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility. At the end of the day, someone must have done some actual webscraping — and the harder it is the better. During the final hours of the Columbia University hack day, I got to work […]

Journalism Data Camp NY potential data sets

Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday. This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear). 1. Iowa accident reports The site http://accidentreports.iowa.gov contains all […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive | Developer