Book review: Learning SPARQL by Bob DuCharme

The NewsReader project on which we are working at ScraperWiki uses semantic web technology and natural language processing to derive meaning from the news. We are building a simple API to give access to the NewsReader datastore, whose native interface is SPARQL. SPARQL is a SQL-like query language used to access data stored in the Resource Description Framework format (RDF).

I reach Bob DuCharme’s book, Learning SPARQL, through an idle tweet mentioning SPARQL, to which his book account replied. The book covers the fundamentals of the semantic web and linked data, the RDF standard, the SPARQL query language, performance, and building applications on SPARQL. It also talks about ontologies and inferencing which are built on top of RDF.

As someone with a slight background in SQL and table-based databases, my previous forays into the semantic web have been fraught since I typically start by asking what the schema for an RDF store is. The answer to this question is “That’s the wrong question”. The triplestore is the basis of all RDF applications, as the name implies each row contains a triple (i.e. three columns) which are traditionally labelled subject, predicate and object. I found it easier to think in terms of resource, property name and property value. To give a concrete example “David Beckham” is an example of a resource, his height is the name of a property of David Beckham and, according to dbpedia, the value of this property is 1.8288 (metres, we must assume). The resource and property names must be provided in the from of URIs (unique resource identifiers) the property value can be a URI or some normally typed entity such as a string or an integer.

The triples describe a network of nodes (the resource and property values) with property names being the links between them, with this infrastructure any network can be described by a set of triples. SPARQL is a query language that superficially looks much like SQL. It can extract arbitrary sets of properties from the network using the SELECT command, get a valid sub-network described by a set of triples using the CONSTRUCT command, answer a question with a Yes/No answer using the ASK command. And it can tell you “everything” it knows about a particular URI using the DESCRIBE command, where “everything” is subject to the whim of the implementor. It also supports a bunch of other commands which feel familiar to SQListas such as LIMIT, OFFSET, FROM, WHERE, UNION, ORDER BY, GROUP BY, and AS. In addition there are the commands BIND which allows the transformation of variables by functions and VALUES which allows you to make little data structures for use within queries. PREFIX provides shortcuts to domains of URIs, for example http://dbpedia.org/resource/David_Beckham can be written dbpedia:David_Beckham, where dbpedia: is the prefix. SERVICE allows you to make queries across the internet to other SPARQL providers. OPTIONAL allows the addition of a variable which is not always present.

The core of a SPARQL query is a list of triples which act as selectors for the triples required and FILTERs which further filter the results by carrying out calculations on the individual members of the triple. Each selector triple is terminated with “ .” or a “ ;” which indicates that the next triple is as a double with the first element the same as the current one. I mention this because Googling for the meaning of punctuation is rarely successful.

Whilst reading this book I’ve moved from SPARQL querying by search, to queries written by slight modification of existing queries to celebrating writing my own queries, to writing successful queries no longer being a cause for celebration!

There are some features in SPARQL that I haven’t yet used in anger: “paths” which are used to expand queries to not just select a triple define a node with a link but longer chains of links and inferencing. Inferencing allows the creation of virtual triples. For example if we known that Brian is the patient of a doctor called Jane, and we have an inferencing engine which also contains the information the a patient is the inverse of a doctor then we don’t need to specify that Jane has a patient called Brian.

The book ends with a cookbook of queries for exploring a new data source which is useful but needs to be used with a little caution when query against large databases. Most of the book is oriented around running a SPARQL client against files stored locally. I skipped this step, mainly using YASGUI to query the NewsReader data and the SNORQL interface to dbpedia.

Overall summary, a readable introduction to the semantic web and the SPARQL query language.

If you want to see the fruits of my reading then there are still places available on the NewsReader Hack Day in London on 10th June.

Tags: book review, newsreader, SPARQL

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

Book review: Learning SPARQL by Bob DuCharme