Scraping guides: Dates and times

Working with dates and times in scrapers can get really tricky. So we’ve added a brand new scraping guide to the ScraperWiki documentation page, giving you copy-and-paste code to parse dates and times, and save them in the datastore. To get to it, follow the “Dates and times guide” link on the documentation page. The […]

New backend now fully rolled out

The new faster, safer sandbox that powers ScraperWiki is now fully rolled out to all users. You should find running and developing scrapers and views faster than before, and that you’re using much more recent versions of Ruby, Python and associated libraries. Thank you to everyone, and there were lots of you, who helped us beta […]

Scraping guides: Parsing HTML using CSS selectors

We’ve added a new scraping copy-and-paste guide, so you can quickly get the lines of code you need to parse an HTML file using CSS selectors. Get to it from the documentation page: The HTML parsing guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right […]

Four data trends to rule them all, the data scientist king to bind them

My favourite soundbite from O’Reilly’s Strata data conference was a definition of big data. John Rauser, Amazon’s main data scientist, said to me that “data is big data when you can’t process it on one machine”. And naturally, small data is data that you can process on one machine. What’s nice about this definition is it […]

Make RSS with an SQL query

Lots of people have asked for it to be easier to get data out of ScraperWiki as RSS feeds. The Julian has made it so. The Web API now has an option to make RSS feeds as a format (i.e. instead of JSON, CSV or HTML tables). For example, Anna made a scraper that gets alocohol […]

Scraping guides: Excel spreadsheets

Following on from the CSV scraping guide, we’ve now added one about scraping Excel spreadsheets. You can get to them from the documentation page. The Excel scraping guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page. As with CSV files, at first […]

A faster, safer sandbox to play in

When programmers first hear about ScraperWiki, their initial reaction is often “what! you let anyone edit general purpose code and run it on your servers!”. The answer is that, yes, we do, but in an isolated environment. Your own “sandbox” if you like, where you can safely build castles without knocking others over. Or, as […]

Scraping guides: Values, separated by commas

When we revamped our documentation a while ago, we promised guides to specific scraper libraries, such as lxml, Nokogiri and so on. We’re now staring to roll those out. The first one is simple, but a good one. Go to the documentation page and you’ll find a new section called “scraping guides”. The CSV scraping guide is available […]

Scheduling: A scrape a day keeps stale data away

We’ve just rolled out a change to the default frequency of new scrapers. They used to default to running once a day. Now they default to not running at all. We’ve made this change because people often make new scrapers that aren’t ready yet. These run every day and send annoying emails saying that they’re […]

ScraperWiki Digger Gets HTTPS Security System

During the last week of rioting across the UK 8 riot vans were called out to quel the unrest just around the corner from where I live. With scenes of chaos and destruction filling the airwaves and clogging up twitter I begun thinking: Are your scrapers safe from looters? We don’t stock trainers or flat-screen TVs, […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive by Author