UN-OCHA – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Building the Humanitarian Data Exchange team https://blog.scraperwiki.com/2015/05/building-the-humanitarian-data-exchange-team/ Wed, 20 May 2015 13:01:05 +0000 https://blog.scraperwiki.com/?p=758222407 hdx-repo-betaWe’ve been working with the United Nations Office for the Coordination of Humanitarian Affairs (UN-OCHA) to build the Humanitarian Data Exchange (HDX, @humdata).

The aim of the HDX is to bring together data from multiple, open sources and provide it in a unified way to humanitarian workers to help them with their work in the field. As such it’s a good illustration of the type of multi-disciplinary team required to successfully execute on the delivery of an Open Data project.

HDX team at ScraperWiki officesThis project brings together a wide range of skills which we think are essential to the success of such an enterprise.

  • Project sponsor – a critical role in a large organisation, the sponsor promotes the project internally and helps bring in the support of the senior management team.
  • Project manager – holds the project together including building the team, communications, budget and outreach to external stakeholders.
  • Project technical manager – marshalling a group of software developers is a very particular task.
  • Designers – it’s no good getting together a big pile of lovely Open Data if you present it in such a way that the users don’t want to access it, this is why we have designers;
  • Data architects – how to structure data in the Exchange, and describe that data in machine readable forms is a job for a Data Architect;
  • Statisticians – these people are really serious about the provenance of data, where does it come from? how exactly was it collected?;
  • Data managers – these people manage the process of bringing new data into the system;
  • Data scientists – a blend of two skills developer and statistical analysis
  • Lead developer – managers the rest of the team and how they deliver software
  • Front-end developers – do the things relating to user interface, appearance and so forth;
  • Back-end developers – do the things relating to making the data work behind the scenes;
  • System administrators – do the things relating to making a website available and reliable on the web;

This isn’t to say all big Open Data projects need a whole person for each role but this is the range of roles which need to be covered.

Hiring this range of skills is not easy as there is a general skills shortage and stiff competition from the private sector, but as a humanitarian organisation the UN OCHA benefits from volunteer help, most recently in crowd-sourced assistance to get structured data from country situation reports. Read about it here

The Organisations involved in making the Humanitarian Data Exchange happen are:

UN OCHA own the HDX project.  ScraperWiki is providing technical implementation management and data science.
Frog Design is providing the user experience, research and design.  The OKFN is providing its CKAN software and high level strategic advice.  We are a distributed and cohesive team working all over the world doing everything possible to ensure that this amazing project is successful.

Take a little look at the video about the project

Find and share data on the Humanitarian Data Exchange. Follow @humdata on Twitter to keep up to date.

Us and the UN https://blog.scraperwiki.com/2013/10/us-and-the-un/ Tue, 15 Oct 2013 14:44:04 +0000 https://blog.scraperwiki.com/?p=758220064 un-ocha-logoWhen you think of humanitarian crises like war, famine, earthquakes, storms and the like you probably don’t think “data”. But data is critical to humanitarian relief efforts, background questions like:

  • What is the country we’re deploying to like in terms of population, geography, infrastructure, economic development, social structures?
  • What resources, financial and human, are available to deploy?

Once a relief effort is underway more questions arise:

  • How many people have been displaced this week and where are they now?
  • How many tents have we deployed and how many more are needed?
  • How do we safely get trucks from point A to point B?

ScraperWiki has been helping the United Nations Office for the Coordination of Humanitarian Affairs (UN-OCHA) with some of these questions.

The work is taking place in the Data and Analysis Project (DAP) in which we are playing a part by addressing the background data problem. There is lots of data on individual countries which has been collated to various locations such as the World Bank, the World Health Organisation, the CIA world factbook, wikipedia and so forth. Although these are all excellent resources, an analyst at the UN doesn’t want to be visiting all of these websites and wrangling their individual interfaces to get the data they need.

This is where ScraperWiki comes in – we’ve been scraping data from over 20 different websites to help build a Common Humanitarian Dataset. This will be supplemented by internal data from the UN and ultimately, the hardest data of all – operational data from the field, which comes in many and various forms.

From a technical point of view the websites we have scraped do not present great challenges, they are international organisations with mandates to make data available, often supplying an API to make data more readily accessible by machine. Although it’s worth noting that sometimes scraping a webpage is easier than using an API because a webpage is self-documenting for humans and an API may not be well documented, or even reveal all of the underlying data.

Our technical challenge has been in ensuring the consistency and comparability of data. The UN is particularly sensitive about the correct and diplomatic naming of geographic entities. Now the initial data is in,  the UN can start to use it, obviously we don’t pass on a chance to have a look at some new data either! The UN are building systems to make the data available in a variety of ways but in the first instance both in ScraperWiki and the UN we’ve been using Tableau both for visualising the data and for transforming the data into formats more suitable for analysis in spreadsheet formats which are still the lingua franca for numerical data.

In total we’ve collected over 300,000 data points for 240 different regions (mostly countries – plus dependent territories such as Martinique, Aruba, Hong Kong, French Guiana and so forth), covering a period of 60 years with 233 different indicators. You can get an overview of all this data from the chart below which shows the number of values we have for each year, coloured by the data source:


The amount of data available for each country varies, recent created countries such as South Sudan don’t have a lot of data associated with them similarly Caribbean and Pacific Islands, often dependent territories are also lacking.

We’ve been able to quickly slice out data for the 24 countries in which the UN have offices and they’ve been able to join this with some of their internal data to get live position reports.

To give a few examples, here is under five mortality data for 2010 from two different sources, we can see here that the two sources are highly correlated but not necessarily identical.

under five mortality

We can see that increasing GDP is correlated with a reduction in under five mortality but only up to a point:

Mortality vs GDP

This is only a start, there are endless things to find in this data!

UN-OCHA will soon launch an alpha instance for this data through their ReliefWeb Labs site (labs.reliefweb.int).