graph theory – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Book review: Mastering Gephi Network Visualisation by Ken Cherven Mon, 15 Jun 2015 07:49:27 +0000 1994_7344OS_Mastering Gephi Network VisualizationA little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.

Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.

Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.

The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.

I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent.  From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.

I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.

I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.

The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.

It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.

I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.

The publisher sent me this book, free of charge, for review.

Book review: Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem Sat, 03 Jan 2015 07:36:07 +0000 graphdatabasesRegular readers will know I am on a bit of a graph binge at the moment. In computer science and mathematics graphs are collections of nodes joined by edges, they have all sorts of applications including the study of social networks and route finding. Having covered graph theory and visualisation, I now move on to graph databases. I started on this path with Seven Databases in Seven Weeks which introduces the Neo4j graph database.

And so to Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem which, despite its general title, is really a book about Neo4j. This is no big deal since Neo4j is the leading open source graph database.

This is not just random reading, we’re working on an EU project, NewsReader, which makes significant use of RDF – a type of graph-shaped data. We’re also working on a project for a customer which involves traversing a hierarchy of several thousand nodes. This leads to some rather convoluted joining operations when done on a SQL database, a graph database might be better suited to the problem.

The book starts with some definitions, identifying the types of graph database (property graph, hypergraph, RDF). Neo4j uses property graphs where nodes and edges are distinct items and each can hold properties. In contrast RDF graphs are expressed as triples which encompass both edges and nodes. In hypergraphs multiple edges can be expressed as a single item. A second set of definitions are regarding the types of graph processing system: graph databases and graph analytical engines. Neo4j is designed to provide good performance for database-like queries, acting as a backing store for a web application rather than an analytical engine to carry out offline calculations. There’s also an Appendix comparing NoSQL databases which feels like it should be part of the introduction.

A key feature of native graph databases, such as Neo4j, is “index-free adjacency”. The authors don’t seem to define this well early in the book but later on whilst discussing the internals of Neo4j it is all made clear: nodes and edges are stored as fixed length records with references to a list of nodes to which they are connected. This means its very fast to visit a node, and then iterate over all of its attached neighbours. The alternative index-based lookups may involve scanning a whole table to find all links to a particular node. It is in the area of traversing networks that Neo4j shines in performance terms compared to SQL.

As Robinson et al emphasise in motivating the use of graph databases: Other types of NoSQL database and SQL databases are not built fundamentally around the idea of relationships between data except in quite a constrained sense. For SQL databases there is an overhead to carrying out join queries which are SQLs way of introducing relationships. As I hinted earlier storing hierarchies in SQL databases leads to some nasty looking, slow queries. In practice SQL databases are denormalised for performance reasons to address these cases. Graph databases, on the other hand, are all about relationships.

Schema are an important concept in SQL databases, they are used to enforce constraints on a database i.e. “this thing must be a string” or “this thing must be in this set”. Neo4j describes itself as “schema optional”, the schema functionality seems relatively recently introduced and is not discussed in this book although it is alluded to. As someone with a small background in SQL the absence of schema in NoSQL databases is always the cause of some anxiety and distress.

A chapter on data modelling and the Cypher query language feels like the heart of the book. People say that Neo4j is “whiteboard friendly” in that if you can draw a relationship structure on a whiteboard then you can implement it in Neo4j without going through the rigmarole of making some normalised schema that doesn’t look like what you’ve drawn. This seems fair up to a point, your whiteboard scribbles do tend to be guided to a degree by what your target system is, and you can go wrong with your data model going from whiteboard to data model, even in Neo4j.

I imagine it is no accident that more recent query languages like Cypher and SPARQL look a bit like SQL. Although that said, Cypher relies on ASCII art to MATCH nodes wrapped in round brackets and edges (relationships) wrapped in square brackets with arrows –>  indicating the direction of relationships:

MATCH (node1)-[rel:TYPE]->(node2)

which is pretty un-SQL-like!

Graph databases goes on to describe implementing an application using Neo4j. The example code in the book is in Java but there appears, in py2neo, to be a relatively mature Python client. The situation here seems to be in flux since searching the web brings up references to an older python-embedded library which is now deprecated. The book pre-dates Neo4j 2.0 which introduced some significant changes.

The book finishes with some examples from the real world and some demonstrations of popular graph theory analysis. I liked the real world examples of a social recommendation system, access control and parcel routing. The coverage of graph theory analysis was rather brief, and didn’t explicit use Cypher which would have made the presentation different from what you find in the usual graph theory textbooks.

Overall I have mixed feelings about this book: the introduction and overview sections are good, as is the part on Neo4j internals. It’s a rather slim volume, feels a bit disjointed and is not up to date with Ne04j 2.0 which has significant new functionality.  Perhaps this is not the arena for a dead-tree publication – the Neo4j website has a comprehensive set of reference and tutorial material, and if you are happy with a purely electronic version than you can get Graph Databases for free (here).

]]> 2 758222475
Book review: Linked by Albert-László Barabási Thu, 27 Nov 2014 11:51:11 +0000 linkedI am on a bit of a graph theory binge, it started with an attempt to learn about Gephi, the graph visualisation software, which developed into reading a proper grown up book on graph theory. I then learnt a little more about practicalities on reading Seven Databases in Seven Weeks, which included a section on Neo4J – a graph database. Now I move on to Linked by Albert-László Barabási, this is a popular account of the rise of the analysis of complex networks in the late nineties. A short subtitle used on earlier editions was “The New Science of Networks”. The rather lengthy subtitle on this edition is “How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life”.

In mathematical terms a graph is an abstract collection of nodes linked by edges. My social network is a graph comprised of people, the nodes, and their interactions such as friendships, which are the edges. The internet is a graph, comprising routers at the nodes and the links between them are edges. “Network” is a less formal term often used synonymously with graph, “complex” is more a matter of taste but it implies large and with a structure which cannot be trivially described i.e. each node has four edges is not a complex network.
The models used for the complex networks discussed in this book are the descendants of the random networks first constructed by Erdős and Rényi. They imagined a simple scheme whereby nodes in a network were randomly connected with some fixed probability. This generates a particular type of random network which do not replicate real-world networks such as social networks or the internet. The innovations introduced by Barabási and others are in the measurement of real world networks and new methods of construction which produce small-world and scale-free network models. Small-world networks are characterised by clusters of tightly interconnected nodes with a few links between those clusters, they describe social networks. Scale-free networks contain nodes with any number of connections but where nodes with larger numbers of connections are less common than those with a small number. For example on the web there are many web pages (nodes) with a few links (edges) but there exist some web pages with thousands and thousands of links, and all values in between.
I’ve long been aware of Barabási’s work, dating back to my time as an academic where I worked in the area of soft condensed matter. The study of complex networks was becoming a thing at the time, and of all the areas of physics soft condensed matter was closest to it. Barabási’s work was one of the sparks that set the area going. The connection with physics is around so-called power laws which are found in a wide range of physical systems. The networks that Barabási is so interested in show power law behaviour in the number of connections a node has. This has implications for a wide range of properties of the system such as robustness to the removal of nodes, transport properties and so forth. The book starts with some historical vignettes on the origins of graph theory, with Euler and the bridges of Königsberg problem. It then goes on to discuss various complex networks with some coverage of the origins of their study and the work that Barabási has done in the area. As such it is a pretty personal review. Barabási also recounts some of the history of “six degrees of separation”, the idea that everyone is linked to everyone else by only six links. This idea had its traceable origins back in the early years of the 20th century in Budapest.
Graph theory has been around for a long while, and the study of random networks for 50 years or so. Why the sudden surge in interest? It boils down to a couple of factors, the first is the internet which provides a complex network of physical connections on which a further complex network of connections sit in the form of the web. The graph structure of this infrastructure is relatively easy to explore using automatic tools, you can build a map of millions of nodes with relative ease compared to networks in the “real” world. Furthermore, this complex network infrastructure and the rise of automated experiments has improved our ability to explore and disseminate information on physical networks. For example, the network of chemical interactions in a cell, the network of actors in movies, our social interactions, the spread of disease and so forth. In the past getting such detailed information on large networks was tiresome and the distribution mechanisms for such data slow and inconvenient.
For a book written a few short years ago, Linked can feel strangely dated. It discusses Apple’s failure in the handheld computing market with the Newton palm top device, and the success of Palm with their subsequent range. Names of long forgotten internet companies float by, although even at the time of writing Google was beginning its dominance.
If you are new to graph theory and want an unchallenging introduction then Linked is a good place to start. It’s readable and has a whole load of interesting examples of scale free networks in the wild. Whilst not the whole of graph theory, this is where interesting new things are happening.
]]> 1 758222459
Book review: Mining the Social Web by Matthew A. Russell Mon, 20 Jan 2014 08:41:35 +0000 mining_the_social_web_coverThe twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.

]]> 2 758220897