data – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Henry Morris (CEO and social mobility start-up whizz) on getting contacts from PDF into his iPhone Wed, 30 Sep 2015 14:11:16 +0000 Henry Morris

Henry Morris

Meet @henry__morris! He’s the inspirational serial entrepreneur that set up PiC and upReach.  They’re amazing businesses that focus on social mobility.

We interviewed him for

He’s been using it to convert delegate lists that come as PDF into Excel and then into his Apple iphone.

It’s his preferred personal Customer Relationship Management (CRM) system, it’s a simple and effective solution for keeping his contacts up to date and in context.

Read the full interview

Got a PDF you want to get data from?
Try our easy web interface over at!


Scraping Spreadsheets with XYPath Wed, 12 Mar 2014 17:11:12 +0000 Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents.

Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated databases, they don’t impose a strong sense of structure, they’re easy to print out and they can be imported into different pieces of software.

And so people pass their data around in Excel files and CSV files, or they replicate the look-and-feel of a spreadsheet in the browser with a table, often creating a large number of tables.

But the very traits that made spreadsheets so simple for the user creating the data, hamper us when we want to reuse the data.

There’s no guarantee that the headers that we’re looking for are in the top row of the table, or even the same row every time, or that exactly the same text appears in the header each time.

The problem is that it’s very easy to think about a spreadsheet’s table in terms of absolute positions: cell F12, for example, is the sixth column and the twelfth row. Yet when we’re thinking about the data, we’re generally interested in the labels of cells, and the intersections of rows and columns: in a spreadsheet about population demographics we might expect to find the UK’s current data at the intersection of the column labelled “2014” and the row named “United Kingdom”.



So we wrote XYPath!

XYPath is a Python library that helps you navigate spreadsheets. It’s inspired by the XML query language XPath, which lets us describe and navigate parts of a webpage.

We use Messytables to get the spreadsheet into Python: it doesn’t really care whether the file it’s loading is an XLS, CSV, a HTML page or a ZIP containing CSVs, it gives us a uniform interface to all these table-containing filetypes.

So, looking at our World Bank spreadsheet above, we could say:

Look for cells containing the word “Country Code”: there should be only one. To the right of it are year headers; below it are the names of countries.  Beneath the years, and to the right of the countries are the population values we’re interested in. Give me those values, and the year and country they’re for.

In XYPath, that looks something like:

region_cell = pop_table.filter("Country Code").assert_one()
years = region_cell.fill(RIGHT)
countries = region_cell.fill(DOWN)
print list(years.junction(countries))

That’s just scratching the surface of what XYPath lets us do, because each of those named items is the same sort of construct: a “bag” of cells, which we can grow and shrink to talk about any groups of cells, even those that aren’t a rectangular block.

We’re also looking into ways of navigating higher-dimensional data efficiently (what if the table records the average wealth per person and other statistics, too? And also provides a breakdown by gender?) and have plans for improving how tables are imported through Messytables.

Get in touch if you’re interested in either harnessing our technical expertise at understanding spreadsheets, or if you’ve any questions about the code!

Try Table Xtract or Call ScraperWiki Professional Services

Digging Olympic Data at Londinium MMXII Tue, 24 Jul 2012 09:50:23 +0000 This is a guest post by Makoto Inoue, one of the organisers of this weekend’s Londinium MMXII hackathon.

The Olympics! Only a few days to go until seemingly every news camera on the planet is pointed at the East End of London, for a month of sporting coverage. But for data diggers everywhere, this is also a gigantic opportunity to analyse and visualise whole swathes of sporting data, as well as create new devices and apps to amplify, manage and make sense of the data in interesting ways.

Remapping past Olympic results into London 2012 schedule to predict the medal ranking leader board

I’m organising the Londinium MMXII Hackathon which happens the day after the opening of the Olympics so that the participants can do cool hacks using real time data. But while you can use Twitter and Facebook to gather social buzz, or TfL, Google Maps and Foursquare to do geo mashups, it turns out the one dataset we’re missing is real time game results. I spent a long time trying to find out if there are publicly available data APIs but in the end it looked like we were out of luck!

Out of luck, that was, until we found out about ScraperWiki. Rather than waiting for the data to come to us, ScraperWiki lets us go grab the freshest data ourselves – after all, there will be tons of news sites publishing the Olympic schedule, and many (like as the BBC) are well structured enough to reliably scrape. Since the BBC publishes the schedule (and, from the look of it, the result) of each event, including most importantly, the exact time of each sport, we can easily set periodic scheduler jobs to scrape the latest data as it is announced. Perfect!

I’ve already written one scraper while writing an Olympic medal rivalry article, so feel free to copy the scraper as your own starting point. Setting an hourly cronjob on ScraperWiki is normally a premium service, but the guys at ScraperWiki are so keen to see what data the Londinium MMXII hackers can come up with, they’re allowing all participants free access to set an hourly cron, for the duration of the hackathon (thanks ScraperWiki!). So let’s join the hackathon and hack together!!

Three hundred thousand tonnes of gold Wed, 04 Jul 2012 20:17:18 +0000 On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as 1.58887418588e+13.

Doesn’t matter for now. Let’s look at the graph over time:

It’s not very exciting, unless you happen to be interested in phrases such as “debasing our fiat currency” and “return to the gold standard”. In truth, one should really divide the values by the GDP, or the national population, or the cumulative inflation over the time period to scale it properly.

Nevertheless, I decided also to look at the gold price, which can be seen as a graph (click the [Graph] button, then [Group Column (x-axis)]: “date” and [Group Column (y-axis)]: “price”) on the Data Hub. They give this dataset the title: Gold Prices in London 1950-2008 (Monthly).

Why does the data stop in 2008 just when things start to get interesting?

I discovered a download url in the metadata for this dataset:

which is somewhere within the githubtm as part of the repository in which there resides a 60 line Python scraper known as

Aha, something I can work with! I cut-and-pasted the code into ScraperWiki as scrapers/gold_prices and tried to run it. Obviously it didn’t work as-is — code always requires some fiddling about when it is transplanted into an alien environment. The module contained three functions: download(), extract() and upload().

The download() function didn’t work because it tries to pull from the broken link:

This is one of unavoidable failures that can befall a webscraper, and was one of the motivations for hosting code in a wiki so that such problems can be trivially corrected without an hour of labour checking out the code in someone else’s favourite version control system, setting up the environment, trying to install all the dependent modules, and usually failing to get it to work if you happen to use Windows like me.

After some looking around on the Bundesbank website, I found the Time_series_databases (Click on [Open all] and search for “gold”.) There’s Yearly average, Monthly average and Daily rates. Clearly the latter is the one to go for as the other rates are averages and likely to be derivations of the primary day rate value.

I wonder what a “Data basket” is.

Anyways, moving on. Taking the first CSV link and inserting it into that code hits a snag in the extract() function:

downloaded = 'cache/bbk_WU5500.csv'
outpath = 'data/data.csv'
def extract():
    reader = csv.reader(open(downloaded))
    # trim junk from files
    newrows = [ [row[0], row[1]] for row in list(reader)[5:-1] ]

    existing = []
    if os.path.exists(outpath):
        existing = [ row for row in csv.reader(open(outpath)) ]

    starter = newrows[0]
    for idx,row in enumerate(existing):
        if row[0] == starter[0]:
            del existing[idx:]

    # and now add in new data
    outrows = existing + newrows
    csv.writer(open(outpath, 'w')).writerows(outrows)

ScraperWiki doesn’t have persistent files, and in this case they’re not helpful because all these lines of code are basically replicating the features through use of the following two lines:

    ldata = [ { "date":row[0], "value":float(row[1]) }  for row in newrows  if row[1] != '.' ]["date"], ldata)

And now your always-up-to-date gold price graph is yours to have at the cost of select date, value from swdata order by date –> google annotatedtimeline.

But back to the naked github disclosed code. Without its own convenient feature, this script must use its own upload() function.

def upload():
    import datastore.client as c
    dsurl = ''
    client = c.DataStoreClient(dsurl)

Ah, we have another problem: a dependency on the undeclared datastore.client library, which was probably so seamlessly available to the author on his own computer that he didn’t notice its requirement when he committed the code to the github where it could not be reused without this library. The library datastore.client is not available in the github/datasets account; but you can find it in the completely different github/okfn account.

I tried calling this code by cut-and-pasting it into the ScraperWiki scraper, and it did something strange that looked like it was uploading the data to somewhere, but I can’t work out what’s happened. Not to worry. I’m sure someone will let me know what happened when they find a dataset somewhere that is inexplicably more up to date than it used to be.

But back to the point. Using the awesome power of our genuine data-hub system we can take the us_debt_to_the_penny, and attach the gold_prices database to perform the combined query to scales ounces of gold into tonnes:

    AS debt_gold_tonnes
FROM swdata AS debt
LEFT JOIN gold_prices.swdata as gold
  ON =
WHERE is not null

and get the graph of US government debt expressed in terms of tonnes of gold.

So that looks like good news for all the gold-bugs, the US government debt in the hard currency of gold has been going steadily down by a factor of two since 2001 to around 280 thousand tonnes. The only problem with that there’s only 164 thousand tonnes of gold in the world according to the latest estimates.

Other fun charts people find interesting such as gold to oil ratio can be done once the relevant data series is loaded and made available for joining.

]]> 3 758217195
5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) Fri, 25 May 2012 07:13:33 +0000

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

International Data Journalism Awards….deadline fast approaching..(10th April 2012) Mon, 26 Mar 2012 17:00:29 +0000 Everybody is talking and trying to do ‘data journalism’ and the first ever International Data Journalism Awards have been established to recognise the huge effort that people are making in this field.  It’s a great opportunity to showcase your work.  Backed by Google, the prizes are generous at €45,000 (over $55,000) to six winners and the process is being managed by Global Editors

The main objectives are to a) Contribute to setting high standards and highlighting the best practices in data journalism and b) Demonstrate the value of data journalism among editors and media executives.

There are three categories :-

  1. Data-driven investigative journalism
  2. Data visualisation & storytelling
  3. Data-driven applications

The competition is open to media companies, non-profit organisations, freelancers and individuals. Applicants are welcome to submit their best data journalism projects before 10 April 2012 at submit-your-work/.

To find out more about the competition and how to apply check out  If you have any questions about the competition get in touch with the lovely Liliana Bounegru, DJA Coordinator (bounegru [at] ejc [dot] net). Liliana works at the European Journalism Centre

Fine set of graphs at the Office of National Statistics Thu, 22 Mar 2012 11:47:01 +0000 It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK).

If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through?

A Population Pyramid is a straightforward way to visualize the data, like so:

This image is sufficient for determining what needs to be supplied (eg more children means more schools and toy-shops), but it doesn’t explain why.

The “why?” and “what’s going on?” questions are much more interesting, but are pretty much guesswork because they refer to layers in the data that you cannot see. For example, the number of people in East Devon of a particular age is the sum of those who have moved into the area at various times, minus those who have moved away (temporarily or permanently), plus those who were already there and have grown older but not yet died. For any bulge, you don’t know which layer it belongs to.

In this 2015 population pyramid there are bulges at 28, 50 and a pronounced spike at 68, as well as dips at 14 and 38. In terms of birth years, these correspond to 1987, 1965 and 1947 (spike), and dips at 2001 and 1977.

You can pretend they correspond to recessions, economic boom times and second wave feminism, but the 1947 post-war spike when a mass of men-folk were demobilized from the military is a pretty clean signal.

What makes this data presentation especially lovely is that it is localized, so you can see the population pyramid per city:

Cambridge, as everyone knows, is a university town, which explains the persistent spike at the age 20.

And, while it looks like there is gender equality for 20 year old university students, there is a pretty hefty male lump up to the age of 30 — possibly corresponding folks doing higher degrees. Is this because fewer men are leaving town at the appropriate age to become productive members of society, or is there an influx of foreign grad students from places where there is less of a gender equality? The data set of student origins and enrollments would give you the story.

As to the pyramid on the right hand side, I have no idea what is going on in Camden to account for that bulge in 30 year olds. What is obvious, though, is that the bulge in infants must be related. In fact, almost all the children between the ages of 0 and 16 years will have corresponding parents higher up the same pyramid. Also, there is likely to be a pairwise cross-gender correspondence between individuals of the same generation living together.

These internal links, external data connections, sub-cohorts and new questions raised the more you look at it means that it is impossible to create a single all-purpose visualization application that could serve all of these. We can wonder as to whether an interface which worked via javascript-generated SQL calls (rather than flash and server-side queries) would have enabled someone with the right skills to roll their own queries and, for example, immediately find out which city and age group has the greatest gender disparity, and whether all spikes at the 20-year-old age bracket can be accounted for by universities.

For more, see An overview of ONS’s population statistics.

As it is, someone is going to have to download/scrape, parse and load at least one year of source data into a data hub of their choice in order to query this (we’ve started on 2010’s figures here on ScraperWiki – take a look). Once that’s done, you’d be able to sort the cities by the greatest ratio between number of 20 year olds and number of 16 year olds, because that’s a good signal of student influx.

I don’t have time to get onto the Population projection models, where it really gets interesting. There you have all the clever calculations based on guestimates of migration, mortality and fertility.

What I would really like to see are these calculations done live and interactively, as well as combined with economic data. Is the state pension system going to go bankrupt because of the “baby boomers”? Who knows? I know someone who doesn’t know: someone who’s opinion does not rely (even indirectly) on something approaching a dynamic data calculation. I mean, if the difference between solvency and bankruptcy is within the margin of error in the estimate of fertility rate, or 0.2% in the tax base, then that’s not what I’d call bankrupt. You can only find this out by tinkering with the inputs with an element of curiosity.

Privatized pensions ought to be put into the model as well, to give them the macro-economic context that no pension adviser I’ve ever known seems capable of understanding. I mean, it’s evident that the stock market (in which private pensions invest) does happen to yield a finite quantity of profit each year. Ergo it can support a finite number of pension plans. So a national policy which demands more such pension plans than this finite number is inevitably going to leave people hungry.

Always keep in mind the long term vision of data and governance. In the future it will all come together like transport planning, or the procurement of adequate rocket fuel to launch a satellite into orbit; a matter of measurements and predictable consequences. Then governance will be a science, like chemistry, or the prediction of earthquakes.

But don’t forget: we can’t do anything without first getting the raw data into a usable format. Dave McKee’s started on 2010’s data here … fancy helping out?

Happy New Year and Happy New York! Tue, 03 Jan 2012 20:32:42 +0000 We are really pleased to announce that we will be hosting our very first US two day Journalism Data Camp event in conjunction with the Tow Center for Digital Journalism at Columbia University and supported by the Knight Foundation on February 3rd and 4th 2012.

We have been working with Emily Bell @emilybell, Director of the Tow Center and Susan McGregor @SusanEMcG, Assistant Professor at the Columbia J School to plan the event. The main objective is to liberate and use New York data for the purposes of keeping business and power accountable.

After a short introduction on the first day, we will split the event into three parallel streams; journalism data projects; liberating New York data; and ‘learn to scrape’. We plan to inject some fun by running a derby for the project stream and also by awarding prizes in all of the streams.  We hope to make the event engaging and enjoyable.

We need journalists, media professionals, students of journalism, political science or  information technology, coders, statisticians and public data boffs to dig up the data!

Please pick a stream and sign-up to help us to make New York a data driven city!

Our thanks to Columbia University, Civic Commons, The New York Times, and CUNY for allowing us to use their premises as we sojourned in the big apple

Zarino has created a map with our US events which we will update with additional events as we add locations.

‘Big Data’ in the Big Apple Thu, 29 Sep 2011 15:05:25 +0000 My colleague @frabcus captured the main theme of Strata New York #strataconf in his most recent blog post.  This was our first official speaking engagement in the USA as a Knight News Challenge 2011 winner.  Here is my twopence worth!

At first we were a little confused at the way in which the week long conference was split into three consecutive mini conferences with what looked like repetitive content.  The reality was that the one day Strata Jump Start was like an MBA for people trying to understand the meaning of ‘Big Data’.  It gave a 50,000 foot view of what is going on and made us think about the legal stuff, how it will impact the demand for skills and how the pace with which data is exploding will dramatically change the way in which businesses operate – every CEO should attend or watch the videos and learn!

The following two days called the Strata Summit were focused on what people need to

Big Apple…and small products

think about strategically to get business ready for the onslaught.  In his welcome address Edd Dumbill program chair for O’Reilly said “Computers should serve humans….we have been turned into filing clerks by computers….we spend our day sifting, sorting and filing information…something has gone upside down, fortunately the systems that we have created are also part of the solution…big data can help us…it may be the case that big data has to help us!”

To use the local lingo we took a ‘deep dive’ into various aspects of the challenges.  The sessions were well choreographed and curated.  We particularly liked the session ‘Transparency and Strategic Leaking’ by Dr Michael Nelson (Leading Edge Forum- CSC) where he talked about how companies need to be pragmatic in an age when it is impossible to stop data leaking out of the door.  Companies he said ‘are going to have to be transparent’ and ‘are going to have to have a transparency policy’.   He referred to a recent article in the Economist ‘The Leaking Corporation’ and its assertion that corporations that leak their own data ‘control the story’.

Courtesy of O’Reilly Media

Simon Wardley’s (Leading Edge Forum – CSC) ‘Situation Normal Everything Must Change’ segment made us laugh especially the philosophical little quips that came from his encounter with a London taxi driver – he conducted it at lightening speed and his explanation of ‘ecosystems’ and how big data offers a potential solution to the ‘Innovation Paradox’ was insightful.   It was a heavy duty session but worth it!

Courtesy of O’Reilly Media

There were tons of excellent sessions to peruse.  We really enjoyed Cathy O’Neill’s  What kinds of people are needed for data management’ which talked about data scientists and how they can help corporations  to discern ‘noise’ from signal.

Our very own Francis Irving was interviewed about how ScraperWiki relates to Big Data and Investigative Journalism.

Courtesy of O’Reilly Media

Unfortunately we did not manage to see many of the technology exhibitors #fail. However we did see some very sexy ideas including a wonderful software start-up called – a platform for data prediction competitions and its Chief Data Scientist Jeremy Howard gave us some great ideas on how to manage ‘labour markets’.

..Oh yes and we checked out why it is called Strata….

We had to leave early to attend the Online New Association – #ONA event in Boston so we missed part III which was the two day Strata Conference itself – it is designed for people at the cutting edge of data –  the data scientists and data activists!  I just hope that we manage to get to Strata 2012 in Santa Clara next February.

In his closing address ‘Towards a global brain’  Tim O’Reilly gave a list of 10 scary things that are leading into the perfect humanitarian storm including…Climate Change, Financial Meltdown, Disease Control, Government inertia……so we came away thinking of a T-Shirt theme…Hmm we’re f**ked so lets scrape!!!

Courtesy Hugh MacLeod

Start Talking to Your Data – Literally! Fri, 23 Sep 2011 15:22:38 +0000 Because ScraperWiki has a SQL database and an API with SQL extraction, I can SQL inject (haha!) straight into the API URL and use the JSON output.

So what does all that mean? I scraped the CSV files of Special Advisers’ meetings gifts and hospitalities at Number 10. This is being updated as the data is published because I can schedule the scraper to run. If it fails to run I get notified via email.

Now, I’ve written a script that publishes this information along with data from 4 other scrapers relating to Number 10 Downing Street, to a twitter account, Scrape_No10. Because I’ve made a twitter bot, I can tweet out a sentence and control the order and timing of tweets. I can even attach a hashtag which I can then rescrape to find what the social media sphere has attached to each data point. This has the potential to have the data go fish for you, as a journalist, but it is not immediately useful to the newsroom.

So I give you MoJoNewsBot! I have written a script as a module in an IRC chat bot. This queries my data via the ScraperWiki API and injects what I write into the SQL and extracts the answer from the resultant JSON file, giving me a written output into the chat room. For example:

Now I can write the commands in a private chat window with MoJoNewsBot or I can do it in the room. This means that rooms can be made for the political team in a newsroom or the environment team or the education team, and they can have their own bots with modules specific to their data streams. That way, computer assisted reporting can be collaborative and social. If you’re working on a story that has a political and an educational angle then you pop into both rooms. So both teams can see what you’re asking of the data. In that sense, you’ve got a social, data driven, virtual newsroom. As such, I’ve added other modules for the modern journalist.

With MoJoNewsBot you can look for twitter trends, search tweets, lookup last tweets, get the latest headlines from various news sources and check Google News. The bot has basic functions like Google search, Wolfram Alpha lookup, Wikipedia lookup, reminder setting and even a weather checker.

Here’s an example of the code needed to query the API and return a string from the JSON:

type = 'jsondict'
scraper = 'special_advisers_gifts_and_hospitality'
site = ''
query = ('SELECT `Name of Special Adviser`, `Type of hospitality
received`, `Name of Organisation`, `Date of Hospitality`
FROM swdata WHERE `Name of Special Adviser` = "%s" ORDER BY
`Date of Hospitality` desc' % userinput)

params = { 'format': type, 'name': scraper, 'query': query}	

url = site + urllib.urlencode(params)

jsonurl = urllib2.urlopen(url).read()
swjson = json.loads(jsonurl)

for entry in swjson[:Number]:
    ans = ('On ' + entry["Date of Hospitality"] + ' %s'
          % userinput + ' got ' +
          entry["Type of hospitality received"] + ' from '
          + entry["Name of Organisation"])

This is just a prototype and a proof of concept. I would add to the module so the query could cover a specific date range. After that, I could go back to ScraperWiki and write a scraper that pulls in the other 4 Number 10 scrapers and constructs the larger database. Then all I need to do is change the name of the scraper in my module to this new one and I can now query the much larger dataset that includes ministers and permanent secretaries!

Now that’s computer assisted reporting!

PS: have fixed the bug in .gn so the links match the headlines