SPARQL – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 World Cup Hack Day, London 10th June – a teaser! https://blog.scraperwiki.com/2014/06/world-cup-hack-day-london-10th-june-a-teaser/ https://blog.scraperwiki.com/2014/06/world-cup-hack-day-london-10th-june-a-teaser/#comments Wed, 04 Jun 2014 07:30:24 +0000 https://blog.scraperwiki.com/?p=758221830 With the England team just arrived in Miami for their final preparations for the World Cup, Mohammed Bin Hammam is back in the news for further accusations of corruption.

This is interesting because we saw Hammam’s name on Friday as we were testing out the NewsReader technology in preparation for our Hack Day in London on Tuesday 10th June. NewsReader is an EU project which aims to improve our tools to analyse the news.

And that’s just what it does.

Somewhat shame-faced we must admit that we are somewhat ignorant of the comings and goings of football. However, this ignorance illustrates the power of the NewsReader nicely. We used our simplifed API to the Newsreader technology to search thousands of documents relating to the World Cup. In particular we looked for Sepp Blatter and David Beckham in the news, and who else was likely to appear in events with them. The result of this search can be seen in the chart below. Which shows that Mohammed Bin Hammam appears very frequently in events with Sepp Blatter. Actors

For us soccer ignoramuses, the simple API also provides a brief biography of bin Hammam from wikipedia. Part of the NewsReader technology is to link mentions of names to known individuals and thus wider data about them. We can make a timeline of events involving bin Hammam, which we show below.

Timeline

It’s easy to look at the underlying articles behind these events, and discover that bin Hammam’s previous appearances in the news have related to bribery.

Finally, we used Gephi to generate a pretty, but somewhat cryptic visualisation. beckham_and_blatter The circles represent people we found in the news articles who appeared in events with either Sepp Blatter or David Beckham, they are the purple dots from which many lines emanate. The purple circles represent people who have had interactions with one or other of Blatter or Beckham, the green circles that have had interactions with both. The size of the circle represents the number of interactions. Bin Hammam appears as the biggest green circle.

You can see an interactive version of the first two visualisations here, and the third one is here.

That’s a little demonstration of what can be done with the Newsreader technology, just imagine what someone with a bit more footballing knowledge could do!

If you want to join the fun, then we are running a Hack Day at the Westminister Hub in central London on Tuesday 10th June where you will be able to try out the NewsReader technology for yourself.

It’s free and you can sign up here, on EventBrite: EventBrite Sign Up

]]>
https://blog.scraperwiki.com/2014/06/world-cup-hack-day-london-10th-june-a-teaser/feed/ 1 758221830
Book review: Learning SPARQL by Bob DuCharme https://blog.scraperwiki.com/2014/05/book-review-learning-sparql-by-bob-ducharme/ Thu, 29 May 2014 07:42:40 +0000 https://blog.scraperwiki.com/?p=758221788 learningsparqlThe NewsReader project on which we are working at ScraperWiki uses semantic web technology and natural language processing to derive meaning from the news. We are building a simple API to give access to the NewsReader datastore, whose native interface is SPARQL. SPARQL is a SQL-like query language used to access data stored in the Resource Description Framework format (RDF).

I reach Bob DuCharme’s book, Learning SPARQL, through an idle tweet mentioning SPARQL, to which his book account replied. The book covers the fundamentals of the semantic web and linked data, the RDF standard, the SPARQL query language, performance, and building applications on SPARQL. It also talks about ontologies and inferencing which are built on top of RDF.

As someone with a slight background in SQL and table-based databases, my previous forays into the semantic web have been fraught since I typically start by asking what the schema for an RDF store is. The answer to this question is “That’s the wrong question”. The triplestore is the basis of all RDF applications, as the name implies each row contains a triple (i.e. three columns) which are traditionally labelled subject, predicate and object. I found it easier to think in terms of resource, property name and property value. To give a concrete example “David Beckham” is an example of a resource, his height is the name of a property of David Beckham and, according to dbpedia, the value of this property is 1.8288 (metres, we must assume). The resource and property names must be provided in the from of URIs (unique resource identifiers) the property value can be a URI or some normally typed entity such as a string or an integer.

The triples describe a network of nodes (the resource and property values) with property names being the links between them, with this infrastructure any network can be described by a set of triples. SPARQL is a query language that superficially looks much like SQL. It can extract arbitrary sets of properties from the network using the SELECT command, get a valid sub-network described by a set of triples using the CONSTRUCT command, answer a question with a Yes/No answer using the ASK command. And it can tell you “everything” it knows about a particular URI using the DESCRIBE command, where “everything” is subject to the whim of the implementor. It also supports a bunch of other commands which feel familiar to SQListas such as LIMIT, OFFSET, FROM, WHERE, UNION, ORDER BY, GROUP BY, and AS. In addition there are the commands BIND which allows the transformation of variables by functions and VALUES which allows you to make little data structures for use within queries. PREFIX provides shortcuts to domains of URIs, for example http://dbpedia.org/resource/David_Beckham can be written dbpedia:David_Beckham, where dbpedia: is the prefix. SERVICE allows you to make queries across the internet to other SPARQL providers. OPTIONAL allows the addition of a variable which is not always present.

The core of a SPARQL query is a list of triples which act as selectors for the triples required and FILTERs which further filter the results by carrying out calculations on the individual members of the triple. Each selector triple is terminated with “ .” or a “ ;” which indicates that the next triple is as a double with the first element the same as the current one. I mention this because Googling for the meaning of punctuation is rarely successful.

Whilst reading this book I’ve moved from SPARQL querying by search, to queries written by slight modification of existing queries to celebrating writing my own queries, to writing successful queries no longer being a cause for celebration!

There are some features in SPARQL that I haven’t yet used in anger: “paths” which are used to expand queries to not just select a triple define a node with a link but longer chains of links and inferencing. Inferencing allows the creation of virtual triples. For example if we known that Brian is the patient of a doctor called Jane, and we have an inferencing engine which also contains the information the a patient is the inverse of a doctor then we don’t need to specify that Jane has a patient called Brian.

The book ends with a cookbook of queries for exploring a new data source which is useful but needs to be used with a little caution when query against large databases. Most of the book is oriented around running a SPARQL client against files stored locally. I skipped this step, mainly using YASGUI to query the NewsReader data and the SNORQL interface to dbpedia.

Overall summary, a readable introduction to the semantic web and the SPARQL query language.

If you want to see the fruits of my reading then there are still places available on the NewsReader Hack Day in London on 10th June.

Sign up here!

]]>
758221788