natural language processing – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 NewsReader – one year on Wed, 26 Feb 2014 09:11:43 +0000 FP7-coo-RGBScraperWiki has been contributing to NewsReader, an EU FP7 project, for over a year now. In that time, we’ve discovered that all the TechCrunch articles would make a pile 4 metres high, and that’s just one relatively small site. The total volume of news published everyday is enormous but the tools we use to process it are still surprisingly crude.

NewsReader aims to develop natural language processing technology to make sense of large streams of news data; to convert a myriad of mentions of events across many news articles into a network of actual events, thus making the news analysable. In NewsReader the underlying events and actors are called “instances”, whilst the mentions of those events in news articles are called… “mentions”. Many mentions can refer to one instance. This is key to the new technology that NewsReader is developing: condensing down millions of news articles into a network representing the original events.

Our partners

The project has 6 member organisations: the VU University Amsterdam, the University of the Basque country (EHU) and the Fondazione Bruno Kessler (FBK) in Trento are academic groups working in computational lingusitics. Lexis Nexis is a major portal for electronically distributed legal and news information, SynerScope is a Dutch startup specialising in visualising large volumes of information, and then there’s ScraperWiki. Our relevance to the project is our background in data journalism and scraping to obtain news information.

ScraperWiki’s role in the project is to provide a mechanism for users of the system to acquire news to process with the technology, to help with the system architecture, open sourcing of the software developed and to foster the exploitation of the work beyond the project. As part of this we will be running a Hack Day in London in the early summer which will allow participants to access the NewsReader technology.

Extract News

Over the last year we’ve played a full part in the project. We were involved mainly in Work Package 01 – User Requirements which meant scoping out large news-related datasets on the web, surveying the types of applications that decision makers used in analysing the news, and proposing the end user and developer requirements for the NewsReader system. Alongside this we’ve been participating in meetings across Europe and learning about natural language processing and the semantic web.

Our practical contribution has been to build an “Extract News” tool. We wanted to provide a means for users to feed articles of interest into the NewsReader system. Extract News is a web crawler which scans a website and converts any news-like content into News Annotation Format (NAF) files which are then processed using the NewsReader natural language processing pipeline. NAF is an XML format which the modules of the pipeline use to both receive input and modify as output. We are running our own instance of the processing virtual machine on Amazon AWS. NAF files contain the content and metadata of individual news articles. Once they have passed through the natural language processing pipeline they are collectively processed to extract semantic information which is then fed into a specialised semantic KnowledgeStore, developed at FBK.


The more complex parts of our News Extract tool are in selecting the “news” part of a potentially messy web page, and providing a user interface suited to monitoring long running processes. At the moment we need to hand code the extraction of each article’s publication date and author on a “once per website” basis – we hope to automate this process in future.

We’ve been trying out News Extract on our local newspapers – the Chester Chronicle and the Liverpool Echo. It’s perhaps surprising how big these relatively small news sites are – approximately 100,000 pages – which takes several days to scrape. If you wanted to collect news on an ongoing basis then you’d probably consume RSS feeds.

What are other team members doing?

Lexis Nexis have been sourcing licensed news articles for use in our test scenarios and preparing their own systems to make use of the NewsReader data. VU University Amsterdam have been developing the English language processing pipeline and the technology for reducing mentions to instances.  EHU are working on scaling the NewsReader system architecture to process millions of documents. FBK have been building the KnowledgeStore – a data store based on Big Data technologies for storing, and providing an interface, to the semantic data that the natural language processing adds to the news. Finally, SynerScope have been using their hierarchical edge-bundling technology and expertise in the semantic web to build high performance visualisations of the data generated by NewsReader.

The team has been busy since the New Year preparing for our first year review by the funding body in Luxembourg: an exercise so serious that we had a full dress rehearsal prior to the actual review!

As our project leader said this morning:

We had the best project review I ever had in my career!!!!!!!

Learn More

We have announced details on our event happening on June 10th – please click here

If you’re interested in learning more about NewsReader, our News Extract tool, or the London Hack Day, then please get in touch!


]]> 1 758221076
Book review: Mining the Social Web by Matthew A. Russell Mon, 20 Jan 2014 08:41:35 +0000 mining_the_social_web_coverThe twitter search and follower tools are amongst the most popular on the ScraperWiki platform so we are looking to provide more value in this area. To this end I’ve been reading “Mining the Social Web” by Matthew A. Russell.

In the first instance the book looks like a run through the APIs for various social media services (Twitter, Facebook, LinkedIn, Google+, GitHub etc) but after the first couple of chapters on Twitter and Facebook it becomes obvious that it is more subtle than that. Each chapter also includes material on a data mining technique; for Twitter it is simply counting things. The Facebook chapter introduces graph analysis, a theme extended in the chapter on GitHub. Google+ is used as a framework to introduce term frequency-inverse document frequency (TF-IDF), an information retrieval technique and a basic, but effective, way to process natural language. Web pages scraping is used as a means to introduce some more ideas about natural language processing and summarisation. Mining mailboxes uses a subset of the Enron mail corpus to introduces MongoDB as a document storage system. The final chapter is a twitter cookbook which includes lots of short recipes for simple twitter related activities but no further analysis. The coverage of each topic isn’t deep but it is practical – introducing the key libraries to do tasks. And it’s alive with suggests for further work, and references to help with that.

The examples in the book are provided as IPython Notebooks which are supplied, along with a Notebook server on a virtual machine, from a GitHub repository. IPython notebooks are interactive Python sessions run through a browser interface. Content is divided into cells which can either be code or simple descriptive text. A code cell can be executed and the output from the code appears in an output cell. These notebooks are a really nice way to present example code since the code has some context. The virtual machine approach is also a great innovation since configuring Python libraries and the IPython server itself, in a platform agnostic manner, is really difficult and this solution bypasses most of those problems. The system makes it incredibly easy to run the example code for yourself, almost too easy in fact, I found myself clicking blindly through some of the example code. Potentially the book could have been presented simply as an IPython notebook, this is likely not economically practical but it would be nice to collect the links to further reading there where they would be more usable. The GitHub repository also provides a great place for interaction with the author: I filed a couple of issues regarding setting the system up and he responded unerringly quickly – as he did for many other readers. Also I discovered incidentally, through being subscribed to the repository, that one of the people I follow on Twitter (and a guest blogger here) was also reading the book. An interesting example of the social web in action!

Mining the social web covers some material I had not come across in my earlier machine learning/ data mining reading. There are a couple of chapters containing material on graph theory using data from Facebook and GitHub data. In the way of benefitting from reading about the same material in different places, Russell highlights that cluster and de-duplication are of course facets of the same subject.

I read with interest the section on using a MongoDB database as a store for tweets and other data in the form of JSON objects. Currently I am bemused by MongoDB. The ScraperWiki platform uses it to store user profile information. I have occasional recourse to try to look things up there. I’ve struggled to see the benefit of MongoDB over a SQL database. Particularly having watched two of my colleagues spend a morning working out how to do a what would be a simple SQL join in MongoDB. Mining the social web has made me wonder about giving MongoDB another chance.

The penultimate chapter is a discussion of the semantic web, introducing both microformats as well as RDF technology, although the discussion is much less concrete than earlier chapters. Microformats are HTML elements which hold semantic information about a page using an agreed schema, to give an example: the geo microformat encodes geographic information. In the absence of such a microformat, geographic information such as latitude and longitude could be encoded in pretty much any way, making it necessary to either use custom scrapers on a page by page basis or complex heuristics to infer the presence of such information. RDF is one of the underpinning technologies for the semantic web: a shorthand for a worldwide web marked up such that machines can understand the meaning of webpages. This touches on the EU Newsreader project on which we are collaborators, and which seeks to generate this type of semantic mark up for news articles using natural language processing.

Overall, definitely worth reading. We’re interested in extending our tools for social media and with this book in hand I’m confident we can do it and be aware of more possibilities.

]]> 2 758220897
Asking data questions of words Tue, 09 Apr 2013 13:22:02 +0000 The vast majority of my contributions to the web have been loosely encoded in the varyingly standard-compliant family of languages called English. It’s a powerful language for expressing meaning, but the inference engines needed to parse it are pretty complex, staggeringly ancient, yet cutting edge (i.e. brains). We tend to think about data a lot at ScraperWiki, so I wanted to explore how I can ask data questions of words.

Different engines render English into browser-viewable markup (HTML): twitter, wordpress, Facebook, tumblr and emails; alongside various iterations of employers’ sites, industry magazines, and short notes on things I’ve bought on Amazon. Much of this is scrapeable or available via APIs, and a lot of ScraperWiki’s data science work has been gathering, cleaning, and analysing data from websites.

For sharing facts as data, people publish CSVs, tables and even occasionally access to databases, but I think there are lessons to learn from the web’s primary, human-encoded content. I’ll share my first tries, and hope to get some feedback (comment below, or drop me a line on the Scraperwiki Google Group, please).


There’s a particularly handy Python package for treating words as data, and the people behind NLTK wrote a book (available online) introducing people not only to the code package, but to ways of thinking programmatically about language.

One of many nice things about NLTK is that it gives you loads of straightforward functions so you can put names to different ways of slicing up and comparing your words. To get started, I needed some data words. I happened to have a collection of CSV files containing my archive of tweets, and the ever-helpful Dragon at ScraperWiki helped me convert all these files into one long text doc, which I’m calling my twitter corpus.

Then, I fed NLTK my tweets and gave it a bunch of handles – variables – on which I may want to tug in future, (mainly to see what they do).

[sourcecode language=”python”]
from nltk import *
filename = ‘tweets.txt’

def txt_to_nltk(filename):
raw = open(filename, ‘rU’).read()
tokens = word_tokenize(raw)
words = [w.lower() for w in tokens]
vocab = sorted(set(words))
cleaner_tokens = wordpunct_tokenize(raw)
# “Text” is a datatype in NLTK
tweets = Text(tokens)
# For language nerds, you can tag the Parts of Speech!
tagged = pos_tag(cleaner_tokens)
return dict(
raw = raw,
tokens = tokens,
words = words,
vocab = vocab,
cleaner_tokens = cleaner_tokens,
tweets = tweets,
tagged = tagged

tweet_corpus = txt_to_nltk(filename)

Following some exercises in the book, I jumped straight to the visualisations. I asked for a lexical dispersion plot of some words I assumed I must have tweeted about. The plot illustrates the occurrence of words within the text. Because my corpus is laid-out chronologically (the beginning of the text is older than the end), I assumed I would see some differences over time:

[sourcecode language=”python”]
tweet_corpus.dispersion_plot([“coffee”, “Shropshire”, “Yorkshire”,
“cycling”, “Tramadol”])

Can you guess what some of them might be?


This ended up pretty much as I’d expected: illustrating my move from Shropshire to Yorkshire. It shows when I started tweeting about cycling, and the lovely time I ended up needing to talk about powerful painkillers (yep, that’s related to cycling!). I continuously cover the word “coffee” in my tweets. This kind of visualisation could be particularly useful for marketers watching the evolution of keywords, or head-hunters keeping an eye out for emerging skills. Basically, anyone who wants to see when a topic gathers reference within a set of words (e.g. the back-catalog of an industry blog).

Alongside the lexical dispersion plot, I also wanted to focus on a few particular words within my tweets. I looked into how I tweet about coffee, and used a few of NLTK’s most basic functions. A simple: ‘tweet_corpus.count(“coffee”)’, for example, gives me the beginnings of keyword metrics from my social media. (I’ve tweeted “coffee” 809 times, btw.) Using the vocab variable, I can ask Python – ‘len(vocab)’ – how many different words I use (around 35k), though this tends to include some redundancies like plurals and punctuation. Taking an old linguist’s standby, I also created a concordance, getting the occurrences within context. NLTK beautifully lined this all up for me with a single command: ‘tweetcorpus.concordance(“coffee”)’

View the code on Gist.

I could continue to walk through other NLTK exercises, showing you how I built bigrams and compared words, but I’ll leave further exploration for future posts.

What I would like to end with is an observation/question: this noodling in the natural language processing on social data makes it clear that a very few commands can be used to provide context and usage metrics for keywords. In other words, it isn’t very hard to see how often you’ve said (in this case tweeted) a keyword you may be tracking. You could treat just about any collection of words as your own corpus (company blog, user manuals, other social media…), and start asking some very straightforward questions very quickly.

What other data questions would you want to ask of your words?