A temporary version of your scraper has been saved. To save it permanently you need to sign in or create an account.
Warning: this scraper has not been saved or committed. Sign in or create an account to save it permanently, or discard changes
  • Python
  • Tutorials and help
Close editor   
  • Console
  • Data
  • Sources
  • Chat

.

  • (Publishing your scraper will make the code and data available to others, and allow you to schedule it.)
  • e.g. europe, grants, transport
* Required field
cancel  

Diff with the published version

close

        
close

Traceback

close

Welcome to the ScraperWiki code editor

Here you can write, test and debug your screen scrapers using the Python programming language. If you are not familiar with Python click here to read a bit about it.

Click the Run button to test your scraper and view the results.

Whilst you are working on your scraper click Save draft to back it up. When you are finished, click Commit and publish to share it with the world.

Ready to start?
close

Help and instructions

Scraperwiki is a collaborative wiki for web-scrapers. See FAQ for details.

Keyboard shortcuts

Control-SSave
Control-RRun
Control-DDiff
Control-ZUndo
Control-YRedo

scraperwiki functions

scrape(url, params=[])
Returns the source text of a webpage as a string. This is a wrapper for urllib2 that lists extra information on the sources tab.

scraperwiki.datastore functions

save(unique_keys=[], data={}, date=None, latlng=[None,None])
Saves to the datastore, either adding or over-writing data with the same values for unique_keys. date and latlng are optional special fields for indexing each record.

scraperwiki.geo functions

gb_postcode_to_latlng(postcode)
returns (lat,lng) pair for a postcode.
os_easting_northing_to_latlng(easting, northing)
returns (lat,lng) pair for an OSGB easting, northing coordinates.
extract_gb_postcode(string)
attempts to extract a UK postcode from a string.

scraperwiki.metadata functions

save(key, value)
saves a persistent variable accessible during another run
get(key, default=None)
retrieves a persistent variable saved by this scraper
save("chart", value)
sets the googlechart image for the scraper

Recommended libraries

urllib2, urlparse
Standard python libraries for opening urls. docs
BeautifulSoup
For parsing html. docs
mechanize
Automates submitting forms. docs
lxml
For parsing html and accessing it using cssselect.
pdftoxml
Converts the binary data of a PDF file into a parsable XML string
Python Google Chart
For generating charts using the Google Chart API. docs

Handy tutorials

Click on one to load it up.

How to Write a Screen Scraper: 1
Start here: check the ScraperWiki interface is working, then learn how to download a web page.
How to Write a Screen Scraper: 2
Slightly more advanced: scrape data from raw HTML, and save it to the ScraperWiki datastore.
How to Write a Screen Scraper: 3
Doing it again and again: following 'next' links to scrape multiple pages.
How to Write a Screen Scraper: 4
Using multiple sources of data: pulling in JSON from an external API to combine with your scraper.
Advanced Scraping: .ASPX Pages
Scrape ASP.NET web pages (with an .aspx extension) using the Mechanize library.
Advanced Scraping: Pages Behind Forms
Scrape pages behind forms: using the Mechanize library.
Advanced Scraping: Excel Files
Scrape Excel files using the xlrd library.
Advanced Scraping: CSV Files
Scrape CSV files using the csv library.
Advanced Scraping: PDFs
Scrape PDF files using ScraperWiki's pdftoxml library.
Advanced Scraping: Geodata
Scraping geographic data with ScraperWiki - extracting UK postcodes from strings, and converting to latitude/longitude.
Presenting Your Data: Charts
Create charts from your data: make bar charts and pie charts using the pygooglechart library.
Presenting Your Data: Mapped Charts
Map charts of your data: put the charts you've created onto a Google map.
Alternative Scraping Libraries: lxml
Demonstrates the use of lxml - an alternative to BeautifulSoup that is particularly useful for selecting elements by CSS class.
Alternative Scraping Libraries: Regular Expressions
Demonstrates using regular expressions to clean up data from the National Lottery grants database.
close