The global CADCAM behemoth known as Autodesk hoovers up another small company every two weeks — a process unlikely to diminish following a $750million bond issue last month. (Well, what else are they going to do with that money?) It was only a matter of time before this happened to me on account of my […]
On-line directory tree webscraping
As you surf around the internet — particularly in the old days — you may have seen web-pages like this: or this: The former image is generated by Apache SVN server, and the latter is the plain directory view generated for UserDir on Apache. In both cases you have a very primitive page that allows […]
Three hundred thousand tonnes of gold
On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as […]
PDF table extraction of pagenated table
Got PDFs you want to get data from? Try our web interface and API over at PDFTables.com! The Isle of Man aircraft registry (in PDF form) has long been a target of mine waiting for the appropriate PDF parsing technology. The scraper is here. Setting aside the GetPDF() function, which deals with copying out each […]
5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories)
It’s 9:54am in Torquay on a Wednesday morning: One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head. On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head. Crews used vaseline and the finger kit to remove the […]
Fine set of graphs at the Office of National Statistics
It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK). If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through? A Population Pyramid is a […]
The Data Hob
Keeping with the baking metaphor, a hob is a projection or shelf at the back or side of a fireplace used for keeping food warm. The central part of a wheel into which the spokes are inserted looks kind of like a hob, and is called the hub (etymology). Lately there has been a move […]
The UN peacekeeping mission contributions mostly baked
Many of the most promising webscraping projects are abandoned when they are half done. The author often doesn’t know it. “What do you want? I’ve fully scraped the data,” they say. But it’s not good enough. You have to show what you can do with the data. This is always very hard work. There are […]
Big fat aspx pages for thin data
My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility. At the end of the day, someone must have done some actual webscraping — and the harder it is the better. During the final hours of the Columbia University hack day, I got to work […]
Journalism Data Camp NY potential data sets
Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday. This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear). 1. Iowa accident reports The site http://accidentreports.iowa.gov contains all […]