Back to contents Shared PHP Python Ruby Choose a language:

ScraperWiki supports a number of 3rd party Python libraries that we recommend for screen scraping, data analysis and data visualisation.

If you would like us to add a library that isn't listed here, please get in touch.

Downloading

requests
Humane and Pythonic way of opening URLs. docs
urllib2, urlparse
Standard Python libraries for opening URLs. docs
gevent
Networking library with sychronous event loop. docs
mechanize
Navigate and complete HTML forms. docs
twill
A thin shell around mechanize, which you might find easier. docs
selenium
Automate operating real browsers like Firefox. NB: Only useful in ScraperWiki if you have a Selenium server to point it to. docs

Parsing

XML (HTML, RSS, Atom...)

lxml
Highly effective HTML parser with a specialist screen scraping library. docs (also stackoverflow)
html5lib
An HTML 5 parser, which also copes with invalid documents the same way that major desktop web browsers do. docs
Beautiful Soup
An alternative popular HTML parser. docs
Beautiful Soup 4
Idiomatic ways of navigating, searching, and modifying the parse tree. Home page, quick start and tips on porting from Beautiful Soup 3
pyquery
Make jquery like queries from Python. docs
PyTidyLib
Calls out to the Tidy library, which cleans up bad HTML. docs
python-stdnum
Handle standardized numbers and codes, from VAT to books.docs
Universal Feed Parser
Parse RSS and Atom feeds in Python. docs

HTML-parser template languages

Scrapemark
Scraping library that uses templates to extract content. docs
scrapely
Given example web pages and data, constructs a parser for similar pages. docs
scrapy
An application framework for crawling web sites and extracting structured data. docs

Other formats

demjson
Fancier JSON library than the built in one. docs
xlrd, xlwt, xlutils
Read, write and process old Excel .xls files. docs
openpyxl
Read and write Excel .xslx/.xlsm files. docs
csvkit
A library of utilities to manipulate CSV files. docs
PDFMiner
Python PDF parser and analyzer. docs
pyPdf
Manipulate PDF files. docs
iCalendar
Parse and generate calendar files. docs
RDFLib
Input and output linked data triples as RDF. docs
OpenStreetMap XML/PBF
Fast and easy reading of OpenStreetMap files with imposm.parser. docs

Geocoding

geopy
Converts addresses into latitude/longitude, and measures distances on the earth. docs
GeoIP
Convert IP addresses into countries and similar. docs
pyephem
Scientific-grade astronomy routines. docs

Data pipes

Python YQL
Yahoo Query Language, an expressive SQL-like language that lets you query, filter, and join data across Web services. docs
Google Data (GData)
Access any service using the Google Data protocol. docs
pipe2py
Converts a Yahoo Pipe into Python so you can run it on ScraperWiki. docs
Fluidinfo, FOM (Fluid Object Mapper)
Read or write to the Fluidinfo shared database. docs: fluidinfo, fom
Tweepy
Twitter API, to read or send tweets. docs
suds
Lightweight SOAP client. docs

Visualising and analysing

General analysis

NumPy
Large, multi-dimensional arrays, matrices and functions to operate on them. docs
SciPy
Uses NumPy to do advanced math, signal processing, optimization, statistics and much more. docs
RPy
Call out to GNU R, a popular statistical computing and graphics package. docs
pandas
An expressive framework for data analysis - the same purpose as R, but within Python. docs
Data Science Toolkit (DSTK)
A collection of the best open data sets and open-source tools for data science. docs
Jellyfish
Approximate and phonetic matching of strings. docs

Plotting

matplotlib
Easily generate lots of charts. docs, example (with ScraperWiki boilerplate)
pygooglechart
Generate charts using the Google Chart API. docs
gviz_api
Helper for making data sources for use with the Google Visualization API. docs

Natural language processing

Natural Language Toolkit (NLTK)
Natural language processing and text analytics. docs
Gensim
"Topic Modelling for Humans". docs

Network analysis

pydot
Layout networks of nodes (graphs) using Graphviz. docs
NetworkX
Analyse and draw complex networks (maths graphs). docs
igraph
Create and manipulate undirected and directed graphs. docs
Python Levenshtein
Compute string distances and similarities. docs

Other

chardet
Universal character encoding detector. docs
Colorific
Automatic color palette detection. more info
unidecode
US-ASCII transliterations of Unicode text. docs
pbkdf2
Password-based encryption key derivation. docs
Dexy
Literate documentation tool. docs
Python Imaging Library (PIL)
Image processing, lots of file formats. docs
bit.ly API
Shorten URLs. docs
Dropbox API
Sync files. docs
Google AdWords API
Control advertising campaigns on Google. docs
Google API client
Call lots of Google APIs. docs