Free community accounts on the ScraperWiki Beta

We’ve been teasing and tempting you with blog posts about the first few tools on the new ScraperWiki Beta for a while now. It’s time to let you try them out first-hand. As of right now, the new ScraperWiki Beta is open for you, your aunt, anyone, to sign up for a free community account: […]

Book review: JavaScript: The Good Parts by Douglas Crockford

This week I’ve been programming in JavaScript, something of a novelty for me. Jealous of the Dear Leader’s automatically summarize tool I wanted to make something myself, hopefully a future post will describe my timeline visualising tool. Further motivations are that web scraping requires some knowledge of JavaScript since it is a key browser technology […]

Summarise #1: Grouping automatically for you

Late at night, after a long conversation in a bar (after Social Media Cafe), Zach mentioned one feature that everyone loved about Kasabi. It had an overview page, which automatically summarised each dataset. Of course, Kasabi did it using linked data – telling you how many of your triples were geographic locations, and how many […]

From future import x.scraperwiki.com

Time flies when you’re building a platform. At the start of the year, we announced the beginnings of a new, more powerful, more flexible ScraperWiki. More powerful because it exposes industry standards like SQL, SSH, and a persistent filesystem to developers, so they can scrape and crunch and export data pretty much however they like. […]

Tools of the trade

With the experience of a whole week of ScraperWiki, I am starting to appreciate the core tools of the professional Data Scientist. In the past I’ve written scrapers in Matlab, C# and Python. However, the house language for scraping at ScraperWiki is Python. It’s a good choice: a mature but modern language with a wide […]

The next evolution of ScraperWiki

Quietly, over the last few months, we’ve been rebuilding both the backend and the frontend of ScraperWiki. The new ScraperWiki has been built from the ground up to be more powerful for data scientists, and easier to use for everyone else. At its core, it’s about empowering people to take a hold of their data, […]

How to test shell scripts

Extreme hipster superheroes like me need tests for their shell. Here’s what’s available. YOLO: No automated testing Few shell scripts have any automated testing because shell programmers live life on the edge. Inevitably, this results in tedious manual ‘testing’. Loads of projects use this approach. git flow homeshick ievms rbenv z Here are some more. […]

We’re hiring: the world’s best data scientists!

If you’re a ScraperWiki coder with great communication skills and a passion for data, then you should probably bookmark our new Jobs page. We’ll be hiring for a few different roles over the coming months, and we’d love to hear from you! Right now, we’re looking for two Data Scientists to help Dragon Dave get, […]

DumpTruck 0.0.3

I’ve added some new features to DumpTruck. Changes Dictionary case sensitivity I removed the dictionaries with case-insensitive keys because that just seemed to be delaying the conversion to case sensitivity. Ordered Dictionaries DumpTruck.execute now returns a collections.OrderedDict for each row rather than a dict for each row. Also, order is respected on insert, so you […]

The state of Twitter: Mitt Romney and Indonesian Politics

It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive | Developer