1. Select the “Code in your browser” tool
After registering and logging in, click the “Create a new dataset” button on your homepage.
You’ll be shown all the tools you can use to populate your new dataset.
We’re going to use the “Code in your browser” tool. Click it.
2. Pick a language
QuickCode supports dozens of languages.
We recommend Python, because it has a clean syntax and great data science libraries.
We will use Python for this tutorial.
3. Name your dataset
We’re going to scrape the UPS corporate blog. Although with small changes this should work for any WordPress blog.
Use the dropdown dataset menu to “Untitled dataset” to rename your dataset to something like “UPS blog posts”.
4. Scrape the data
Copy and paste this code into the code editor. It downloads the front page of the blog, and extracts information about each article.
#!/usr/bin/env python import scraperwiki import requests import lxml.html html = requests.get("http://blog.ups.com").content dom = lxml.html.fromstring(html) for entry in dom.cssselect('.theentry'): post = { 'title': entry.cssselect('.entry-title')[0].text_content(), 'author': entry.cssselect('.the-meta a')[0].text_content(), 'url': entry.cssselect('a')[0].get('href'), 'comments': int( entry.cssselect('.comment-number')[0].text_content() ) } print post
Press the Run button. You’ll see information about each post printed in the console window.
5. Save to the datastore
To save to the datastore, put this in your code. It should go just after the print post
. Make sure it is indented.
scraperwiki.sql.save(['url'], post)
You don't have to use this special function. Any library, in any language, which makes a SQLite database file called scraperwiki.sqlite
will do.
6. Use your data
QuickCode is built out of lots of tools that let you do stuff with your data. The tools always appear in the grey toolbar next to your dataset’s name.
Click the orange “View in a table” icon to see your data in a flexible table view.
Or click More tools… to do other things like automatically summarising your data or publishing it to a CKAN datahub.