Lots of new libraries

We’ve had lots of requests recently for new 3rd party libraries to be accessible from within ScraperWiki. For those of you who don’t know, yes, we take requests for installing libraries! Just send us word on the feedback form and we’ll be happy to install.

Also, let us know why you want them as it’s great to know what you guys are up to. Ross Jones has been busily adding them (he is powered by beer if ever you see him and want to return the favour).

Find them listed in the “3rd party libraries” section of the documentation.

In Python, we’ve added:

csvkit, a bunch of tools for handling CSV files made by Christopher Groskopf at the Chicago Tribune. Christopher is now lead developer on PANDA, a Knight News Challenge winner who are making a newsroom data hub
requests, a lovely new layer over Python’s HTTP libraries made by Kenneth Reitz. Makes it easier to get and post.
Scrapemark is a way of extracting data from HTML using reverse templates. You give it the HTML and the template, and it pulls out the values.
pipe2py was requested by Tony Hirst, and can be used to migrate from Yahoo Pipes to ScraperWiki.
PyTidyLib, to access the old classic C library that cleans up HTML files.
SciPy is at the analysis end, and builds on NumPy giving code for statistics, Fourier transforms, image processing and lots more.
matplotlib, can almost magically make PNG charts. See this example that The Julian knocked up, with the boilerplate to run it from a ScraperWiki view.
Google Data (gdata) for calling various Google APIs to get data.
Twill is its own browser automation scripting language and a layer on top of Mechanize.

In Ruby, we’ve added:

tmail for parsing emails.
typhoeus, a wrapper round curl with an easier syntax, and that lets you do parallel HTTP requests.
Google Data (gdata) for calling various Google APIs.

In PHP, we’ve added:

GeoIP for turning IP addresses into countries and cities.

Let us know if there are any libraries you need for scraping or data analysis!

Tags: coding, libraries, PHP, programming, python, ruby

2 Responses to “Lots of new libraries”

mazadillon October 26, 2011 at 2:41 pm #

I don’t think it’s documented anywhere but I found that I could use pdf2txt.py which is part of PDF Miner by using the ‘bash trick’ someone discussed on the mailing list – basically running a local command in the sandbox.

Perhaps worth mentioning somewhere?
Francis Irving October 26, 2011 at 5:04 pm #

There’s a small mention in this FAQ: https://scraperwiki.com/docs/python/faq/#files

But no, it isn’t very clear! Will update the FAQ a bit to have a separate question saying you can spawn external commands with link to some examples.

Just to get a handle on your use case, are you calling PDF Miner from Ruby or PHP? Assuming Python people would call the API?

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

Lots of new libraries

2 Responses to “Lots of new libraries”