How to stop missing the good weekends

Far too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks — an activity which is somewhat weather dependent. Luckily, help is at hand in the form of the ScraperWiki email alert system. […]

ScraperWikiをためしてみよう

Guest post by Makoto Inoue, a Japanese ScraperWiki user. Makoto works in London as a Web developer, a technical writer, and a translator. He has a Japanese blog and his Twitter account is @makoto_inoue. はじめにみなさんスクレイプ（Scrape）という単語はご存知でしょうか？ウェッブページから特定のデータを引っこ抜く作業のことをスクレイピング（Scraping）と呼びます。昨今のホームページではデータを簡単に提供するためのAPI（Application Programming Interface）というしくみが多いので「なんで今更そんなの必要なの」と思われる方>も多いかもしれません。しかしながら前回起きた東日本大地震の際、地震や電力の速報や、各地の被害状況を把握するために必要な政府の統計情報などがAPIとして提供されておらず、開発者の中には自分でスクレイパー（Scraper）用のプログラムを書いた人も多いのではないのでしょうか？　ただそういった多くの開発者の善意でつくられたプログラムがいろいろなサイトに散らばっていたり、やがてメンテナンスされなくなるのは非常に残念なことです。そういうときにScraperWikiの出番です。 ScraperWikiとは ScraperWikiはイギリスのスタートアップ企業で、スクレイパーコードを共有するサイトを提供しています。開発者達はサイト上から直接コード（Ruby, PHP, Python）を編集、実行することができます。スクレイプを定期的に実行することも可能で、取得されたデータはScraperWikiに保存されますが、ScraperWikiはAPIを用意しているので、このAPIを通して、他のサイトでデータを再利用することが可能です。「Wiki」といっているだけあって、一般公開されているコードは他の人も編集したり、またコードをコピーして他のスクレイピングに利用することもできます。定期的に実>行されているスクレイパーがエラーを起こしていないかをチェックする仕組みがあり「みんなでスクレイピングを管理」するための仕組みがいたるところにあります。 ScraperWikiは、もともとイギリスで、どの議員がどの法案に賛成または反対票を投じたかを議会のサイトから創業者の一人が2003年頃にスクレイプしたことを起源に持ちます。日本であればちょうどこういったページでしょうか？現在ではGuardian社といった大手報道機関が企業ロビイストの議会での影響力を調べるのにつかったり、イギリス政府自身がalpha.gov.ukというプロトタイプサ>イトで、各省庁に点在したデータを一元的にアクセスするための仕組みとしてScraperWikiを使っているそうです。 ScraperWikiのビジネスモデルですが、一般公開するコードに関しては無料ですが、非公開にしたり、定期的にスクレイプする量などに応じて課金するようになっています。前置きが長くなってきましたが、実際に使ってみましょう。既存のスクレイパーを眺めてみる「ScraperWiki」でGoogle検索すると、すでにScraperWikiを使用している日本人の方がいらっしゃいました。「スクレイピングするなら ScraperWiki 使うといいよ」ここでは衆議院議員のデータをスクレイプするのに使用しています。 Members of […]

Scraping the protests with Goldsmiths

Zarino here, writing from carriage A of the 10:07 London-to-Liverpool (the wonders of the Internet!). While our new First Engineer, drj, has been getting to grips with lots of the under-the-hood changes which’ll make ScraperWiki a lot faster and more stable in the very near future, I’ve been deploying ScraperWiki out on the frontline, with […]

How to scrape and parse Wikipedia

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page). The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if […]

ScraperWiki in 3 minutes

As you’ve probably noticed, Zarino and the team recently upgraded all of your scraper pages to have a lovely new, more useful look. So we’re rolling out a new set of introductory screencasts going through our new look site and giving you a bit of a flavour of the many things you can do on ScraperWiki. […]

ScraperWiki scrapers: now 53% more useful!

It’s Christmas come early at ScraperWiki HQ as we deliver—like elves popping boxes under the data digging Christmas tree—a bunch of great new improvements to the ScraperWiki site. We’ve been working on these for a while, so it’s great to finally let you all use them! First up: a new look for your scrapers The […]

How to get along with an ASP webpage

Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats. Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages. The important thing […]

Job advert: Lead programmer

Oil wells, marathon results, planning applications… ScraperWiki is a Silicon Valley style startup, in the North West of England, in Liverpool. We’re changing the world of open data, and how data science is done together on the Internet. We’re looking for a programmer who’d like to: Revolutionise the tools for sharing data, and code that works with […]

Lots of new libraries

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install. Also, let us know why you want them as it’s […]

Tweeting the drilling

A very long time ago I discovered the easiest webscraping target: the locations of all the North Sea Oil wells. Once you webcrawl through the index pages, the entries were pretty straightforward. There were dates, water depths (in feet or metres), GPS locations and so on. The code, if you want to look at it, […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive | Developer