Local ScraperWiki Library
It quite annoyed me that you can only use the scraperwiki library on a ScraperWiki instance; most of it could work fine elsewhere. So I’ve pulled it out (well, for Python at least) so you can use it offline.
How to use
pip install scraperwiki_local
You can then import scraperwiki
in scripts run on your local computer. The scraperwiki.sqlite
component is powered by DumpTruck, which you can optionally install independently of scraperwiki_local
.
pip install dumptruck
Differences
DumpTruck works a bit differently from (and better than) the hosted ScraperWiki library, but the change shouldn’t break much existing code. To give you an idea of the ways they differ, here are two examples:
Complex cell values
What happens if you do this?
import scraperwiki
shopping_list = ['carrots', 'orange juice', 'chainsaw']
scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
On a ScraperWiki server, shopping_list
is converted to its unicode representation, which looks like this:
[u'carrots', u'orange juice', u'chainsaw']
In the local version, it is encoded to JSON, so it looks like this:
["carrots","orange juice","chainsaw"]
And if it can’t be encoded to JSON, you get an error. And when you retrieve it, it comes back as a list rather than as a string.
Case-insensitive column names
SQL is less sensitive to case than Python. The following code works fine in both versions of the library.
In [1]: shopping_list = ['carrots', 'orange juice', 'chainsaw']
In [2]: scraperwiki.sqlite.save([], {'shopping_list': shopping_list})
In [3]: scraperwiki.sqlite.save([], {'sHOpPiNg_liST': shopping_list})
In [4]: scraperwiki.sqlite.select('* from swdata')
Out[4]: [{u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}, {u'shopping_list': [u'carrots', u'orange juice', u'chainsaw']}]
Note that the key in the returned data is ‘shopping_list’ and not ‘sHOpPiNg_liST’; the database uses the first one that was sent. Now let’s retrieve the individual cell values.
In [5]: data = scraperwiki.sqlite.select('* from swdata')
In [6]: print([row['shopping_list'] for row in data])
Out[6]: [[u'carrots', u'orange juice', u'chainsaw'], [u'carrots', u'orange juice', u'chainsaw']]
The code above works in both versions of the library, but the code below only works in the local version; it raises a KeyError on the hosted version.
In [7]: print(data[0]['Shopping_List'])
Out[7]: [u'carrots', u'orange juice', u'chainsaw']
Here’s why. In the hosted version, scraperwiki.sqlite.select
returns a list of ordinary dictionaries. In the local version, scraperwiki.sqlite.select
returns a list of special dictionaries that have case-insensitive keys.
Develop locally
Here’s a start at developing ScraperWiki scripts locally, with whatever coding environment you are used to. For a lot of things, the local library will do the same thing as the hosted. For another lot of things, there will be differences and the differences won’t matter.
If you want to develop locally (just Python for now), you can use the local library and then move your script to a ScraperWiki script when you’ve finished developing it (perhaps using Thom Neale’s ScraperWiki scraper). Or you could just run it somewhere else, like your own computer or web server. Enjoy!
Is there anything like this for Ruby?
Alas no! We hope somebody will make one as we roll out x.scraperwiki.com to help migrate scripts to it 🙂
Nowdays you do “pip install scraperwiki” (instead of scraperwiki_local)
Hi, I’ve tried to install scraperwiki but need pdftoxml module, can’t for the life of me find the module to install! Please could you advise where to find the module?
Same problem here! Would love to know where to find the pdftohtml module