newsreader – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 NewsReader: the developers story Tue, 09 Dec 2014 08:55:45 +0000 newsreaderScraperWiki has been a partner in NewsReader, an EU Framework 7 research project, for the last couple of years. The aim of NewsReader is to give computers the power to “understand” the news; to extract from a myriad of news articles the underlying events which gave rise to those articles; the who, the where, the why and the what of those events. The project is comprised of academic researchers specialising in computational linguistics (VUA in Amsterdam, EHU in the Basque Country and FBK in Trento), Lexis Nexis – a major news aggregator, and a couple of small technology companies: ourselves at ScraperWiki and SynerScope – a Dutch startup specialising in the visualisation of complex networks.

Our role at ScraperWiki is in providing mechanisms to enable developers to exploit the NewsReader technology, and to feed news into the system. As part of this work we have developed a simple REST API which gives access to the KnowledgeStore, the system which underpins NewsReader. The native query language of the KnowledgeStore is SPARQL – the query language of the semantic web. The Simple API provides a set of predefined queries which are easier for end users to work with than raw SPARQL, and help us as service managers by providing a predictable set of optimised queries. If you want to know more technical detail then we’ve written a paper about it (here).

The Simple API has seen live action at a Hack Day on World Cup news which we held in London in the summer. Attendees were able to develop a range of applications which probed violence, money and corruption in the realm of the World Cup. I blogged about our previous Hack Day here and here. The Simple API, and the Hack Day helped us shake out some bugs and add features which will make it even better next time.

“Next time” is another Hack Day to be held in the Amsterdam on 21st January 2015, and London on the 30th January 2015. This time we have processed 6,000,000 articles relating to the car industry over the period 2005-2014. The motor industry is a trillion dollar a year business, so we can anticipate finding lots of valuable information in this horde.

From our previous experience the three things that NewsReader excels at are:

  1. Finding networks of interactions, identifying important players. For the World Cup Hack Day we at ScraperWiki were handicapped slightly by having no interest in football! But the NewsReader technology enabled us to quickly identify that “Sepp Blatter”, “Jack Warner” and “Mohammed bin Hammam” were important in world football. This is illustrated in this slightly cryptic visualisation made using Gephi:beckham_and_blatter
  2. Finding events of a particular type. the NewsReader technology carries out semantic role labeling: taking sentences and identifying what type of event is described in that sentence and what roles the participants took. This information is then aggregated and exposed using semantic web technology. In the World Cup Hack Day participants used this functionality to identify events involving violence, bribery, gambling, and other financial transactions;
  3. Establishing timelines. In the World Cup data we could track the events involving “Mohammed bin Hammam” through time and the type of events he was involved in. This enabled us to quickly navigate to pertinent news articles.Timeline

You can see fragments of code used to extract these data using the Simple API in these GitHub Gists (here and here), and dynamic visualisations illustrating these three features here and here.

The Simple API is up and running already, you can find it (here). It is self-documenting, simply visit the root URL and you’ll see query examples with optional and compulsory parameters. Be aware though: the Simple API is under active development, and the underlying data in the KnowledgeStore is being optimised for the Hack Days so it may not be available when you visit.

If you want to join our automotive Hack Day then you can sign up for the Amsterdam event (here) and the London event (here).

]]> 1 758222468
NewsReader World Cup Hack Day Tue, 29 Jul 2014 14:07:08 +0000 PiekPresents

Piek Vossen describing the NewsReader project

A long time ago*, in a galaxy far, far away** we ran the NewsReader World Cup Hack Day.

*Actually it was on the 10th June .

**It was in the Westminster Hub, London.

NewsReader is a EU FP7 project aimed at developing natural language processing and Semantic Web technology to make sense of large streams of news articles. In NewsReader the underlying events and actors are called “instances”, whilst the mentions of those events in news articles are called… “mentions”. Many mentions can refer to one instance. This is key to the new technology that NewsReader is developing: condensing down millions of news articles into a network representing the original events.

The project has been running for about 18 months, so we are half way through. We’d always planned to use Hack Days to showcase the technology we have been building and to guide our future work. The World Cup Hack Day was the first of these events. We wanted to base the Hack Day around some timely news of a manageable size. The football World Cup fitted the bill. Currently the NewsReader technology works in a batch process so in the couple of months before the Hack Day we processed approximately 300,000 news articles relating to the World Cup. At ScraperWiki we made a Simple API to provide access to the data that the NewsReader technology outputs. We thought this necessary because the raw output is stored in an RDF triplestore and is accessed using SPARQL queries. SPARQL is a query language similar to SQL but is not as widely known, we didn’t want people to spend all day trying to get their first SPARQL query to work. Secondly, the dataset the NewsReader technology generates is several hundred million triples so even a “correct” query could easily cause the the SPARQL endpoint to appear unresponsive. By making the Simple API we could limit the queries made, and make sure that they were queries which ran in a reasonable time.

About 40 people turned up on the day; a big contingent from the BBC, a smattering of people from various organisations based in London and the core NewsReader team, enhanced with various colleagues we’d dragged along. After a brief introduction from Piek Vossen, who put some context around the Hack Day, and me – providing some technical details – the participants went into action.

They made a range of applications, most of them focused around extracting particular types of events, as defined by FrameNet. Gambling, fighting and commerce were common themes.  A rogue group from ScraperWiki ignored the Simple API, bought a 32 CPU with 60GB memory spot instance and wrote their own high performance search tool.

At the end of the day the participants presented their work, and prizes were awarded. Jim Johnson-Rollings from the BBC came first with a live demo of a tool which worked out which team a named football player had played for the course of his career. Second, was Team Fighty who built a tool to discover which football teams were most commonly associated with violence. Team Fail Fast, Fail Often tried out a wide range of things culminating in a swish Gephi network visualisation. The prizes were regional foods from around the NewsReader consortium.

This was my first Hack Day, although ScraperWiki has run a number of such events in the past. We’d spent considerable effort in making the Simple API and it was great to see the hackers make such varied use of it. I was impressed how much they managed to do in such a short time.

The Simple API we developed, the presentation slides and some demo visualisations are available in this bundle of links.

The NewsReader team are grateful to all the participants for taking part and helping us with our project.

A big thanks also to the team at The Hub Westminster who gave us so much support on the day.



Happy Hackers working away on the World Cup data at the Westminster Hub

]]> 1 758222106
World Cup Hack Day, London 10th June – a teaser! Wed, 04 Jun 2014 07:30:24 +0000 With the England team just arrived in Miami for their final preparations for the World Cup, Mohammed Bin Hammam is back in the news for further accusations of corruption.

This is interesting because we saw Hammam’s name on Friday as we were testing out the NewsReader technology in preparation for our Hack Day in London on Tuesday 10th June. NewsReader is an EU project which aims to improve our tools to analyse the news.

And that’s just what it does.

Somewhat shame-faced we must admit that we are somewhat ignorant of the comings and goings of football. However, this ignorance illustrates the power of the NewsReader nicely. We used our simplifed API to the Newsreader technology to search thousands of documents relating to the World Cup. In particular we looked for Sepp Blatter and David Beckham in the news, and who else was likely to appear in events with them. The result of this search can be seen in the chart below. Which shows that Mohammed Bin Hammam appears very frequently in events with Sepp Blatter. Actors

For us soccer ignoramuses, the simple API also provides a brief biography of bin Hammam from wikipedia. Part of the NewsReader technology is to link mentions of names to known individuals and thus wider data about them. We can make a timeline of events involving bin Hammam, which we show below.


It’s easy to look at the underlying articles behind these events, and discover that bin Hammam’s previous appearances in the news have related to bribery.

Finally, we used Gephi to generate a pretty, but somewhat cryptic visualisation. beckham_and_blatter The circles represent people we found in the news articles who appeared in events with either Sepp Blatter or David Beckham, they are the purple dots from which many lines emanate. The purple circles represent people who have had interactions with one or other of Blatter or Beckham, the green circles that have had interactions with both. The size of the circle represents the number of interactions. Bin Hammam appears as the biggest green circle.

You can see an interactive version of the first two visualisations here, and the third one is here.

That’s a little demonstration of what can be done with the Newsreader technology, just imagine what someone with a bit more footballing knowledge could do!

If you want to join the fun, then we are running a Hack Day at the Westminister Hub in central London on Tuesday 10th June where you will be able to try out the NewsReader technology for yourself.

It’s free and you can sign up here, on EventBrite: EventBrite Sign Up

]]> 1 758221830
Book review: Learning SPARQL by Bob DuCharme Thu, 29 May 2014 07:42:40 +0000 learningsparqlThe NewsReader project on which we are working at ScraperWiki uses semantic web technology and natural language processing to derive meaning from the news. We are building a simple API to give access to the NewsReader datastore, whose native interface is SPARQL. SPARQL is a SQL-like query language used to access data stored in the Resource Description Framework format (RDF).

I reach Bob DuCharme’s book, Learning SPARQL, through an idle tweet mentioning SPARQL, to which his book account replied. The book covers the fundamentals of the semantic web and linked data, the RDF standard, the SPARQL query language, performance, and building applications on SPARQL. It also talks about ontologies and inferencing which are built on top of RDF.

As someone with a slight background in SQL and table-based databases, my previous forays into the semantic web have been fraught since I typically start by asking what the schema for an RDF store is. The answer to this question is “That’s the wrong question”. The triplestore is the basis of all RDF applications, as the name implies each row contains a triple (i.e. three columns) which are traditionally labelled subject, predicate and object. I found it easier to think in terms of resource, property name and property value. To give a concrete example “David Beckham” is an example of a resource, his height is the name of a property of David Beckham and, according to dbpedia, the value of this property is 1.8288 (metres, we must assume). The resource and property names must be provided in the from of URIs (unique resource identifiers) the property value can be a URI or some normally typed entity such as a string or an integer.

The triples describe a network of nodes (the resource and property values) with property names being the links between them, with this infrastructure any network can be described by a set of triples. SPARQL is a query language that superficially looks much like SQL. It can extract arbitrary sets of properties from the network using the SELECT command, get a valid sub-network described by a set of triples using the CONSTRUCT command, answer a question with a Yes/No answer using the ASK command. And it can tell you “everything” it knows about a particular URI using the DESCRIBE command, where “everything” is subject to the whim of the implementor. It also supports a bunch of other commands which feel familiar to SQListas such as LIMIT, OFFSET, FROM, WHERE, UNION, ORDER BY, GROUP BY, and AS. In addition there are the commands BIND which allows the transformation of variables by functions and VALUES which allows you to make little data structures for use within queries. PREFIX provides shortcuts to domains of URIs, for example can be written dbpedia:David_Beckham, where dbpedia: is the prefix. SERVICE allows you to make queries across the internet to other SPARQL providers. OPTIONAL allows the addition of a variable which is not always present.

The core of a SPARQL query is a list of triples which act as selectors for the triples required and FILTERs which further filter the results by carrying out calculations on the individual members of the triple. Each selector triple is terminated with “ .” or a “ ;” which indicates that the next triple is as a double with the first element the same as the current one. I mention this because Googling for the meaning of punctuation is rarely successful.

Whilst reading this book I’ve moved from SPARQL querying by search, to queries written by slight modification of existing queries to celebrating writing my own queries, to writing successful queries no longer being a cause for celebration!

There are some features in SPARQL that I haven’t yet used in anger: “paths” which are used to expand queries to not just select a triple define a node with a link but longer chains of links and inferencing. Inferencing allows the creation of virtual triples. For example if we known that Brian is the patient of a doctor called Jane, and we have an inferencing engine which also contains the information the a patient is the inverse of a doctor then we don’t need to specify that Jane has a patient called Brian.

The book ends with a cookbook of queries for exploring a new data source which is useful but needs to be used with a little caution when query against large databases. Most of the book is oriented around running a SPARQL client against files stored locally. I skipped this step, mainly using YASGUI to query the NewsReader data and the SNORQL interface to dbpedia.

Overall summary, a readable introduction to the semantic web and the SPARQL query language.

If you want to see the fruits of my reading then there are still places available on the NewsReader Hack Day in London on 10th June.

Sign up here!

NewsReader – Hack 100,000 World Cup Articles Wed, 16 Apr 2014 13:52:23 +0000 NWR_logo_narrowJune 10, The Hub Westminster (@NewsReader)

Ian Hopkinson has been telling you about our role in the NewsReader project.  We’re making a thing that crunches large volumes of news articles.  We’re combining natural language processing and semantic web technology.  It’s an FP7 project so we’re working with a bunch of partners across Europe.

We’re 18 months into the project and we have something to show off.  Please think about joining us for a fun ‘hack’ event on June 10th in London at  ‘The Hub’, Westminster.  There are 100,000 World Cup news articles we need to crunch and we hope to dig out some new insights from a cacophony of digital noise.  There will be light refreshments throughout the day.  Like all good hack events there will be an end of day reception and we would like you to present your findings and give us some feedback on the experience. (the requisite beer and pizza will be provided)

All of our partners will be there LexisNexis, SynerScope, VU University (Amsterdam), University of the Basque Country (San Sebastian) and Fondazione Bruno Kessler (Trento).  They’re a great team, very knowledgeable in this field, and they love what they are doing.

Ian recently made a short video about the project which is a useful introduction.

If you are a journalist, an editor, a linked data enthusiast or data professional we hope you will care about this kind of innovation.

Please sign up here  ‘NewsReader eventbrite invitation’  and tell your friends.

logo long (for screen 72dpi)

NewsReader – one year on Wed, 26 Feb 2014 09:11:43 +0000 FP7-coo-RGBScraperWiki has been contributing to NewsReader, an EU FP7 project, for over a year now. In that time, we’ve discovered that all the TechCrunch articles would make a pile 4 metres high, and that’s just one relatively small site. The total volume of news published everyday is enormous but the tools we use to process it are still surprisingly crude.

NewsReader aims to develop natural language processing technology to make sense of large streams of news data; to convert a myriad of mentions of events across many news articles into a network of actual events, thus making the news analysable. In NewsReader the underlying events and actors are called “instances”, whilst the mentions of those events in news articles are called… “mentions”. Many mentions can refer to one instance. This is key to the new technology that NewsReader is developing: condensing down millions of news articles into a network representing the original events.

Our partners

The project has 6 member organisations: the VU University Amsterdam, the University of the Basque country (EHU) and the Fondazione Bruno Kessler (FBK) in Trento are academic groups working in computational lingusitics. Lexis Nexis is a major portal for electronically distributed legal and news information, SynerScope is a Dutch startup specialising in visualising large volumes of information, and then there’s ScraperWiki. Our relevance to the project is our background in data journalism and scraping to obtain news information.

ScraperWiki’s role in the project is to provide a mechanism for users of the system to acquire news to process with the technology, to help with the system architecture, open sourcing of the software developed and to foster the exploitation of the work beyond the project. As part of this we will be running a Hack Day in London in the early summer which will allow participants to access the NewsReader technology.

Extract News

Over the last year we’ve played a full part in the project. We were involved mainly in Work Package 01 – User Requirements which meant scoping out large news-related datasets on the web, surveying the types of applications that decision makers used in analysing the news, and proposing the end user and developer requirements for the NewsReader system. Alongside this we’ve been participating in meetings across Europe and learning about natural language processing and the semantic web.

Our practical contribution has been to build an “Extract News” tool. We wanted to provide a means for users to feed articles of interest into the NewsReader system. Extract News is a web crawler which scans a website and converts any news-like content into News Annotation Format (NAF) files which are then processed using the NewsReader natural language processing pipeline. NAF is an XML format which the modules of the pipeline use to both receive input and modify as output. We are running our own instance of the processing virtual machine on Amazon AWS. NAF files contain the content and metadata of individual news articles. Once they have passed through the natural language processing pipeline they are collectively processed to extract semantic information which is then fed into a specialised semantic KnowledgeStore, developed at FBK.


The more complex parts of our News Extract tool are in selecting the “news” part of a potentially messy web page, and providing a user interface suited to monitoring long running processes. At the moment we need to hand code the extraction of each article’s publication date and author on a “once per website” basis – we hope to automate this process in future.

We’ve been trying out News Extract on our local newspapers – the Chester Chronicle and the Liverpool Echo. It’s perhaps surprising how big these relatively small news sites are – approximately 100,000 pages – which takes several days to scrape. If you wanted to collect news on an ongoing basis then you’d probably consume RSS feeds.

What are other team members doing?

Lexis Nexis have been sourcing licensed news articles for use in our test scenarios and preparing their own systems to make use of the NewsReader data. VU University Amsterdam have been developing the English language processing pipeline and the technology for reducing mentions to instances.  EHU are working on scaling the NewsReader system architecture to process millions of documents. FBK have been building the KnowledgeStore – a data store based on Big Data technologies for storing, and providing an interface, to the semantic data that the natural language processing adds to the news. Finally, SynerScope have been using their hierarchical edge-bundling technology and expertise in the semantic web to build high performance visualisations of the data generated by NewsReader.

The team has been busy since the New Year preparing for our first year review by the funding body in Luxembourg: an exercise so serious that we had a full dress rehearsal prior to the actual review!

As our project leader said this morning:

We had the best project review I ever had in my career!!!!!!!

Learn More

We have announced details on our event happening on June 10th – please click here

If you’re interested in learning more about NewsReader, our News Extract tool, or the London Hack Day, then please get in touch!


]]> 1 758221076
European Data Forum Dublin 2013…what was it all about? Fri, 14 Jun 2013 08:00:40 +0000 FP7 LOGOIt was not an accident that the 2013 European Data Forum was held in Dublin given that Ireland’s presidency of the Council of the European Union runs until June 30th. The venue was Croke Park Conference Centre which over looks Ireland’s premier sporting stadium and an historic landmark. It was organised by the Digital Enterprise Research Institute (DERI), an internationally recognised body focused on semantic web research. Its purpose is to directly contribute to the Irish government’s plan of transforming Ireland into a competitive knowledge economy.

Legislation Data

The annual conference brought together data practitioners from industry, research, the public-sector and the community, to discuss the opportunities and challenges of the emerging Big Data Economy in Europe. The delegate list was heavily influenced by FP7 participants –  ScraperWiki is an FP7 participant in the NewsReader Project! 

Senior executives from SAP, Statoil, Ericsson, RTE and many others talked about how big data problems manifest and also about some of the opportunities that big data presents. Statoil gave some innovative examples including using vast amounts of ‘biometric data’ to better identify the existence of oil deposits.

MartaThe person who cares about this programme at an EU level is Marta Nagy-Rothengass Head of Unit “Data Value Chain” in DG CONNECT at the European Commission.  I asked Marta the purpose of the European Data Forum?  She explained that “the primary objective is to get stakeholders who would never normally meet together to think and to act in making better reuse of public data for commercial purpose” and “as a place to network and exchange ideas’.   When asked about the expected outcomes Marta was enthusiastic “we hope that within three years a data value chain and a platform will be established where public and private organisations can create real financial value from data; we hope to have an infrastructure for services and applications for public sector, private sector and citizens that are multilingual and open; and there will be actions at EU level to develop a data skills network, research and innovation activities and kinetic innovation (e.g. geodata that is cross sector, cross border, has monetary value and that offers better decision making and intelligence).

deirdre_leeThe ‘big’ value came with the networking opportunity, 20 data solutions were presented at an exposition that ran alongside the main event. Deirdre Lee, Research Associate in the eGovernment Domain at DERI, the lead organiser told me that the institute is involved in many EU projects that include open, linked, & big data and where they can help improve data quality and availability. About the conference she said “We did not want the conference to be academic, we want to get industry involved to ensure that we reflect real problems with data and we also wanted to showcase some of the solutions that are available”.

The EU Economy is still the largest and wealthiest  block in the world and at least 26 and 3/4 of the countries see this as an opportunity despite its recent economic woes!  According to Wikipedia “The economy of the European Union generates a GDP of over €12.894 trillion (US$16.566 trillion in 2012) according to Eurostat, making it the largest economy in the world. ”

Earlier this week there was a discussion on the radio about how Silicon Valley’s success can be partially linked to the US defence industry’s thirst for technological competitive advantage – I hope that I am not naive in hoping that the EU’s approach to research and innovation in our sector is less ‘defence’ driven.

NewsReader FP7 Project Team Photo in Liverpool June 2013

NewsReader FP7 Project Team Photo in Liverpool June 2013