The twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.
In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.
The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!
Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.
I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.
The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.
Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.