Back to contents PHP Python Ruby Choose a language:

Write a real scraper by copying and pasting code, for programmers or non-programmers (30 minutes).

1. Make a new scraper

We’re going to scrape the average number of years children spend in school in different countries from this page, which was once on a UN site but has since been replaced with an Excel spreadsheet.

Create a new scraper, and choose Python as the language. (You can also do this tutorial in Ruby or PHP if you’re more comfortable with those). You’ll get a web based code editor.

Put in a few lines of code to show it runs, and click the “Run” button or type Ctrl+R.

print "Hello, coding in the cloud!"

(As we go through this tutorial, you can copy and paste each block of code onto the end of your growing scraper, and run it each time.)

The code runs on ScraperWiki's servers. You can see any output you printed in the Console tab at the bottom of the editor.

2. Download HTML from the web

You can use any normal Python library to crawl the web, such as urllib2 or Mechanize. There is also a simple built in ScraperWiki library which may be easier to use.

import scraperwiki html = scraperwiki.scrape("http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htm") print html

When you print something quite large, click "more" in the console to view it all. Alternatively, go to the Sources tab in the editor to see everything that has been downloaded.

3. Parsing the HTML to get your content

lxml is the best library for extracting content from HTML.

import lxml.html root = lxml.html.fromstring(html) for tr in root.cssselect("div[align='left'] tr"): tds = tr.cssselect("td") if len(tds)==12: data = { 'country' : tds[0].text_content(), 'years_in_school' : int(tds[4].text_content()) } print data

The bits of code like div and td are CSS selectors, just like those used to style HTML. Here we use them to select all the table rows. And then, for each of those rows, we select the individual cells, and if there are 12 of them (ie: we are in the main table body, rather than in one of the header rows), we extract the country name and schooling statistic.

4. Saving to the ScraperWiki datastore

The datastore is a magic SQL store, one where you don't need to make a schema up front.

Replace print data in the lxml loop with this save command (make sure you keep it indented with spaces at the start like this):

scraperwiki.sqlite.save(unique_keys=['country'], data=data)

The unique keys (just country in this case) identify each piece of data. When the scraper runs again, existing data with the same values for the unique keys is replaced.

Go to the Data tab in the editor to see the data loading in. Wait until it has finished.

5. Getting the data out again

If you haven't done so yet, press "save scraper" at the bottom right of the editor. You'll need to give your scraper a title, and to make a ScraperWiki account if you don't have one already.

Now, click on the Scraper tab at the top right to see a preview of your data. The easiest way to get it all out is to "Download spreadsheet (CSV)".

For more complex queries "Explore with ScraperWiki API". Try this query in the SQL query box.

select * from swdata order by years_in_school desc limit 10

It gives you the records for the ten countries where children spend the most years at school.

Notice that as well as JSON, you can also get custom CSV files using the SQL query in the URL.

What next?

If you have a scraper you want to write, and feel ready, then get going. Otherwise take a look at the documentation.