Back to contents PHP Python Ruby Choose a language:

Getting started

Technical overview questions

Technical detail questions

Licensing questions

Everything else

What is ScraperWiki?

ScraperWiki is a platform for writing and scheduling screen scrapers, and for storing the data they generate. There's lots of useful data locked away on the internet and we want to open it up!

If you need a parallel, and every self respecting website does, then it's like GitHub, except with an execute button and a database behind it.

Who is ScraperWiki for?

ScraperWiki is useful both for programmers who want to write screen scrapers with less fuss, and for journalists, activists and the general public who want to discover and re-use interesting, useful data.

What programming languages can I use to write scrapers?

We support Ruby, Python and PHP.

How do I write a screen scraper?

See our documentation for tutorials.

How often does my scraper run?

By default scrapers only run when you do so from the editor. You can set your scraper to run automatically (e.g. once a day) in the "schedule" section of its main page.

Currently you can't go more frequently than once a day. We can do so on a case by case basis, please get in touch if you have an application that needs it.

Who can edit a scraper?

By default anyone can edit anyone else's scraper; this means other people can help extend or fix your code. You will be emailed so you'll know when your scraper is edited and by whom.

You can also protect a scraper, so only people you choose can edit it, or alter the data associated with it. Go to the Contributors section on the scraper overview page. Where it says "This scraper is public" choose "edit" and change it to "Protected".

Eventually you will have the option to keep scrapers completely private. If you are interested in testing an earlier version of this, please get in touch.

How can I get data out of ScraperWiki?

The simplest way is to download a CSV file from the link on the scraper page, or you can use the API.

What happens if my scraper breaks?

We will send you an email alert to warn you if your scraper breaks.

What are all the bits of ScraperWiki?

These are the major parts of ScraperWiki, from a technical point of view.

  • Edit code with a browser-based code editor, it's called CodeMirror.
  • Normal Python, Ruby and PHP scripts run in a sandbox on ScraperWiki's servers.
  • Nokogiri, lxml, Mechanize — all the 3rd party libraries you're used to.
  • ScraperWiki library makes scraping and storing data simple.
  • Data is stored direct in a SQLite datastore, schemaless unless you want control (Datastore copy & paste guide).
  • Access to data in JSON, CSV or RSS using SQL in URLs via the ScraperWiki external API.
  • Views for exporting the data in any format, or writing simple web apps. They're CGI scripts.
  • Scheduled to re-run daily so your data is always up-to-date. See scraper overview page.
  • Email alerts if your scrapers fail, or someone edits them.
  • Autocommits to built in source control, based on Mercurial. See the history tab.
I made a column or a table I don't need, how do I remove it?

The datastore save function automatically makes a schema for you. This means that while you're developing a scraper you sometimes end up with columns or tables that you don't need later.

The easiest fix during development is to clear the datastore, and let your script make it again with exactly the right columns/tables. There is a button on the scraper overview page called "Clear datastore" that does that.

Alternatively call the "excecute" function and use "alter table" SQL commands to modify the schema how you like. See the Datastore copy & paste guide.

How do I log progress of my scraper?

Just by printing! Use print or puts according to the language you're using. You can write to stdout or stderr.

The output is displayed in the console as the scraper runs in the editor, and a selectively cropped version is stored in the history for scheduled runs.

ScraperWiki scrapers are not guaranteed to run at a specific time. We use a queuing system to smooth demand on our servers. You can read more on our Google Group.

How do I get query parameters in my view?

Getting the query string is done the same way in each language, via the environment variable following the CGI standard; this means that the typical ways of accessing the CGI paramaeters will work (this will vary accordind to your chosen language).

In a Python view you can use the cgi module:

import cgi, os paramdict = dict(cgi.parse_qsl(os.getenv("QUERY_STRING", ""))) print paramdict

See also a slightly longer example that echoes the CGI parameters.

With a Ruby view you can use: require 'cgi' paramdict = CGI::parse( ENV['QUERY_STRING'] ) p paramdict

PHP views can access the query string using $_GET: print_r($_GET);

What is the CPU limit, and what do I do if my scraper reaches it?

Each script run has a limit of roughly 160 seconds of processing time. After that, in Python and Ruby you will get an exception (scraperwiki.CPUTimeExceededError in Python, and ScraperWiki::CPUTimeExceededError in Ruby). In PHP the script is terminated with the fatal error "Maximum execution time […] exceeded". We would love to convert this to an exception, but sadly we can’t due to a limitation in PHP.

This exception can be caught, and it can be useful to do that if you have some state to save or other cleaning up to do before exiting. You will have a small amount (2 seconds) of additional CPU time before the process is killed without warning.

In many cases this happens when you are first scraping a site, catching up with the backlog of existing data. The best way to handle it is to make your script do a chunk at a time using the save_var and get_var functions in the ScraperWiki library to remember your place. This technique also lets you recover more easily from other parsing errors.

Note that CPU time is not the same as wall-clock time. Most scripts use only a little CPU time when they are scraping, they are mostly waiting for web pages to be downloaded, which takes essentially no CPU time. A typical script can scrape thousands of pages before hitting this CPU limit.

What is the limit on memory use?

The sandbox in which the scrapers run gives them at most 1Gb of memory. When this is exceeded, you will likely get a SIGKILL.

Can I scrape secure (https://) sites?

Yes. Here's an example, in Python.

url = "https://cia.gov" import urllib print urllib.urlopen(url).read()
Is there an editor other than the CodeMirror 1?

At the moment, no. The editor and pair programming functions have been integrated into the internal workings of this library.

You can, however, disable the editor and use a plain old HTML textarea editor by adding either "?textarea=plain" or "#plain" to the end of the URL while editing the scraper.

This should be enough for you to use it with browser plugins that spawn vim or emacs for a textarea. It will also work in browsers for which CodeMirror doesn't work.

Please give us feedback about this feature so we know if we should make it more available (for example by adding a setting to your user account so that the editor is always a textarea for you).

How do I revert to an earlier version of my code?

On the history page for the scraper, view the commit that you want to go back to. There's then a link called "rollback".

When do views show a "powered by ScraperWiki" banner?

On views such as this, you will see a banner at the top right saying "powered by ScraperWiki". This appears on all HTML views, and is a link back to the view. It gives us and you credit, and ensures sources are cited.

If your view is generating something other than HTML, such as json, csv or an image file you should set the httpresponseheader to the appropriate MIME type (see ScraperWiki library), and the banner will no longer appear. (eg in Python: scraperwiki.utils.httpresponseheader("Content-Type", "text/json"))

If you are finding the banners annoying in a particular set of cases, please get in touch. At some point we'll add ways to vary the banner for different circumstances.

Can I save files, and if so where?

Yes, but all files are temporary. You can save them either in /tmp, or in the user's home directory (/home/scriptrunner). The current directory starts out as the home directory.

This only works for temporary downloads, as the scripts run in a clean environment each time so data can't leak between scripts. You must save any permanent data in the datastore, or elsewhere on the web.

Wow, I can run arbitary commands!

Feel free to spawn external commands and download arbitary extra binaries or code applications. It'll be slow, so if there is something you use a lot ask us to install it permanently.

See this example which spawns a bash shell script and sed.

There are some shared binaries which you can use from any language in that way.

Can I read from the datastore of another scraper?

Each scraper can only write to its own datastore, so you can tell the provenance of any data, including what code wrote it.

You can, however, read from other datastores by attaching to them first. See the First view tutorial for a simple example, full documentation in the Datastore copy & paste guide.

It's possible to attach to lots of datastores, and use SQL to select from all of them as if one.

The datastore is slow and/or timing out, what should I do?

Queries to the datastore can take at most 30 seconds. Here are some things you can do if this is a problem:

  1. If this is happening with a large database from the web site, try reloading the page again. It should then be cached in the server's memory and respond the second time.
  2. From code, it is faster to save lots of rows in groups. Instead of passing a dictionary/hash to the save command, pass a list of dictionaries/hashes. See "for greater speed" in the Datastore copy & paste guide.
  3. When saving, specify "verbose = 0" as a parameter to the save_sqlite function. This turns off logging to the Data tab in the editor, and in some cases can make it ten times faster while developing. See ScraperWiki library.
  4. Make appropriate indices to speed up your queries. For example, in Python, this created the index for a road accidents scraper. scraperwiki.sqlite.execute(''' CREATE INDEX IF NOT EXISTS casualty_type_manual_index ON casualties (Casualty_Type)''') The datastore normally times out after 30 seconds, but CREATE INDEX commands have up to 3 minutes.
Can I import code from another scraper?

In Ruby you can run:

require 'scrapers/some-other-scraper'

In PHP you can pass a URL to require:

require("http://some-url-to/some.php");

In Python we have a feature than can, at best, be described as experimental. Instead of import amodule you can use: amodule = scraperwiki.utils.swimport("some-other-scraper")

If you would like this to be more convenient (or even better, if you have a patch to make import foo work), then please get in touch.

Who owns the code I write on ScraperWiki?

You do. However, for public code, you have agree to license it for others to use under the GNU General Public License. That does not apply inside a private vault. For more information please see our terms and conditions.

Who owns the data in ScraperWiki?

It depends where the data originally came from and how it was derived.

What's your policy on what's legal to scrape?

In short, play nice, and don't do to someone else's website what you wouldn't like done to your own.

General

It is our view that, where a web server responds to an unauthenticated HTTP request, there is an implied licence to use the HTML that is returned for reading and automatically extracting that information. This, in our view, is how the web is designed to operate. If the proprietor of a web host wishes (for example) to charge for use of their site, HTTP provides mechanisms to require payment or authentication for use. They may also make use of the robots exclusion protocol to prevent scraping and spidering of any kind.

Of course we may be wrong about this. The question has not been tested in any UK court and, we understand, there is not much more clarity world-wide. If you are in doubt about whether what you are doing is lawful, you should seek your own legal advice, rather than relying on our best guess.

Platform

Users doing scraping themselves are using us as a hosting service (whether public or private, it makes no difference).

We will obey the law, for example legal takedown notices. Other than that, it is none of our business what they do.

We won't add features to the platform whose only purpose is to avoid technical measures that prevent scraping. It is, however, not our business whether users do so themselves using standard tools.

Data services

Where we are ourselves handling data requests or doing other consultancy, our policy is two fold.

1) If it is a Government site (any Government), and we aren't just performing technical measures to get data which the Government otherwise sells, then we consider it in the public interest for it to be scraped. We will do so.

2) For non-Government sites, we check the robots.txt file. If the site permits robots in general to scrape their site (NOT just GoogleBot!), then we will do so. We will make no effort to look for other terms and conditions as well.

Developer introductions

Where we introduce someone to a developer who does scraping for them, the same situation as described in "platform" above applies.

It is up to you to ensure that your scraping activity does not break the law. While some standard libraries may check robots.txt others may not. Even if you are permitted to use a site, you should ensure that what you do is not disruptive or breaks the law in some other way.

See also our terms and conditions.

Can I get a copy of the ScraperWiki source code?

Yes, the source code for the ScraperWiki site is available under the GNU Affero General Public License. You can download the source here.

How do I get in touch with you?

We'd really like to hear your ideas for improving and adding to ScraperWiki. You can contact us here or ask a question on our email list.