Back to contents
Shared
PHP
Python
Ruby
Choose a language:
ScraperWiki supports a number of 3rd party Python libraries that we recommend for screen scraping, data analysis and data visualisation.
If you would like us to add a library that isn't listed here, please get in touch.
Downloading
- requests
- Humane and Pythonic way of opening URLs. docs
- urllib2, urlparse
- Standard Python libraries for opening URLs. docs
- gevent
- Networking library with sychronous event loop. docs
- mechanize
- Navigate and complete HTML forms. docs
- twill
- A thin shell around mechanize, which you might find easier. docs
- selenium
- Automate operating real browsers like Firefox. NB: Only useful in ScraperWiki if you have a Selenium server to point it to. docs
Parsing
XML (HTML, RSS, Atom...)
- lxml
- Highly effective HTML parser with a specialist screen scraping library. docs (also stackoverflow)
- html5lib
- An HTML 5 parser, which also copes with invalid documents the same way that major desktop web browsers do. docs
- Beautiful Soup
- An alternative popular HTML parser. docs
- Beautiful Soup 4
- Idiomatic ways of navigating, searching, and modifying the parse tree. Home page, quick start and tips on porting from Beautiful Soup 3
- pyquery
- Make jquery like queries from Python. docs
- PyTidyLib
- Calls out to the Tidy library, which cleans up bad HTML. docs
- python-stdnum
- Handle standardized numbers and codes, from VAT to books.docs
- Universal Feed Parser
- Parse RSS and Atom feeds in Python. docs
- Scrapemark
- Scraping library that uses templates to extract content. docs
- scrapely
- Given example web pages and data, constructs a parser for similar pages. docs
- scrapy
- An application framework for crawling web sites and extracting structured data. docs
HTML-parser template languages
Other formats
- demjson
- Fancier JSON library than the built in one. docs
- xlrd, xlwt, xlutils
- Read, write and process old Excel .xls files. docs
- openpyxl
- Read and write Excel .xslx/.xlsm files. docs
- csvkit
- A library of utilities to manipulate CSV files. docs
- PDFMiner
- Python PDF parser and analyzer. docs
- pyPdf
- Manipulate PDF files. docs
- iCalendar
- Parse and generate calendar files. docs
- RDFLib
- Input and output linked data triples as RDF. docs
- OpenStreetMap XML/PBF
- Fast and easy reading of OpenStreetMap files with imposm.parser. docs
Geocoding
- geopy
- Converts addresses into latitude/longitude, and measures distances on the earth. docs
- GeoIP
- Convert IP addresses into countries and similar. docs
- pyephem
- Scientific-grade astronomy routines. docs
Data pipes
- Python YQL
- Yahoo Query Language, an expressive SQL-like language that lets you query, filter, and join data across Web services. docs
- Google Data (GData)
- Access any service using the Google Data protocol. docs
- pipe2py
- Converts a Yahoo Pipe into Python so you can run it on ScraperWiki. docs
- Fluidinfo, FOM (Fluid Object Mapper)
- Read or write to the Fluidinfo shared database. docs: fluidinfo, fom
- Tweepy
- Twitter API, to read or send tweets. docs
- suds
- Lightweight SOAP client. docs
Visualising and analysing
General analysis
- NumPy
- Large, multi-dimensional arrays, matrices and functions to operate on them. docs
- SciPy
- Uses NumPy to do advanced math, signal processing, optimization, statistics and much more. docs
- RPy
- Call out to GNU R, a popular statistical computing and graphics package. docs
- pandas
- An expressive framework for data analysis - the same purpose as R, but within Python. docs
- Data Science Toolkit (DSTK)
- A collection of the best open data sets and open-source tools for data science. docs
- Jellyfish
- Approximate and phonetic matching of strings. docs
Plotting
- matplotlib
- Easily generate lots of charts. docs, example (with ScraperWiki boilerplate)
- pygooglechart
- Generate charts using the Google Chart API. docs
- gviz_api
- Helper for making data sources for use with the Google Visualization API. docs
Natural language processing
- Natural Language Toolkit (NLTK)
- Natural language processing and text analytics. docs
- Gensim
- "Topic Modelling for Humans". docs
Network analysis
- pydot
- Layout networks of nodes (graphs) using Graphviz. docs
- NetworkX
- Analyse and draw complex networks (maths graphs). docs
- igraph
- Create and manipulate undirected and directed graphs. docs
- Python Levenshtein
- Compute string distances and similarities. docs
Other
- chardet
- Universal character encoding detector. docs
- Colorific
- Automatic color palette detection. more info
- unidecode
- US-ASCII transliterations of Unicode text. docs
- pbkdf2
- Password-based encryption key derivation. docs
- Dexy
- Literate documentation tool. docs
- Python Imaging Library (PIL)
- Image processing, lots of file formats. docs
- bit.ly API
- Shorten URLs. docs
- Dropbox API
- Sync files. docs
- Google AdWords API
- Control advertising campaigns on Google. docs
- Google API client
- Call lots of Google APIs. docs