Scraperwiki’s response to the Heartbleed security failure

Et tu, Heartbleed “Catastrophic” is the right word. On the scale of 1 to 10, this is an 11. ― Security expert, Bruce Schneier, responds to Heartbleed On Monday the 7th of April 2014, a software flaw was identified which exposed approximately two thirds of the web to the risk of catastrophic security failure. The flaw has […]

Scraping Spreadsheets with XYPath

Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents. Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated […]

Underneath the hood of Government’s Performance Platform

In the previous post I described what the UK Government’s new Performance Platform (made by GDS) is for. Today’s question is, how does it work? I’ve found out two ways. Firstly, thanks to Alex Muller from GDS, who talked me through the platform. Secondly, all the code is freely available on Github, which is pretty. Component parts There […]

Git!

As software company, use of some sort of software source control system is inevitable, indeed our CEO wrote TortoiseCVS – a file system overlay for the early CVS source control system. For those uninitiated in the joys of software engineering: source control is a system for recording the history of file revisions allowing programmers to […]

It’s good to share…

As you may have gathered I’m on a journey, I’ve worked as a physicist, a data scientist for 20 years and now I’ve fallen amongst software engineers. There are obvious similarities in what we do, we write code to do stuff. I write code to analyse things and the software engineers write code to do […]

Mastering space and time with jQuery deferreds

Recently Zarino and I were pairing on making improvements to a new scraping tool on ScraperWiki. We were working on some code that allows the person using the tool to pick out parts of some scraped data in order to extract a date into a new database column. For processing the data on the server […]

Book Review: Clean Code by Robert C. Martin

Following my revelations regarding sharing code with other people I thought I’d read more about the craft of writing code in the form of Clean Code: A Handbook of Agile Software Craftmanship by Robert C. Martin. Despite the appearance of the word Agile in the title this isn’t a book explicitly about a particular methodology […]

npm install urchin

Urchin, the shell testing framework for extreme hipster superheroes (I’m not including myself in that group I should add), is now available as an npm package. That means you can install it using npm: sudo npm install -g urchin If you’re not hipster enough to use npm then you can still wget it from github: […]

Programmers past, present and future

As a UX designer and part-time anthropologist, working at ScraperWiki is an awesome opportunity to meet the whole gamut of hackers, programmers and data geeks. Inside of ScraperWiki itself, I’m surrounded by guys who started programming almost before they could walk. But right at the other end, there are sales and support staff who only […]

Scheduling! Keep your data fresh

We’ve added scheduling to the “Code in your browser” tool on beta.scraperwiki.com. For now it is daily, as that covers most people’s uses. Please ask if you need something else! Or have a look at the tool’s source code. Want to know how to use the new ScraperWiki? There’s a quick start guide to coding […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive | Developer