Hi! We've renamed ScraperWiki.
The product is now QuickCode and the company is The Sensible Code Company.

Blog

The Data Hob

Keeping with the baking metaphor, a hob is a projection or shelf at the back or side of a fireplace used for keeping food warm. The central part of a wheel into which the spokes are inserted looks kind of like a hob, and is called the hub (etymology).

Lately there has been a move to refer to certain websites as data hubs.

But what does this mean?

In transport terminology, the cities of Chicago or Paris are known as the hub train stations of the networks because all the lines run out from them to all over their respective countries, like the spokes of a wheel.

Back in the virtual world, the Open Knowledge Foundation has decided to rebrand their CKAN system as the Data Hub, describing it as a “community-run catalogue of useful sets of data on the Internet”. It also contains 3290 datasets that you can “browse, learn about and download”.

At the same time, and apparently entirely independently, Microsoft have a prototype service called Data Hub. It’s billed as an “online service for data discovery, distribution and curation”. I can’t tell how many datasets it has, as nothing is visible unless you ask them if you can register.

What about the other word? What does data mean?

I do have a working definition of “data” — it is a representation which allows for aggregations. That is, a set of information on which you can perform the SQL GROUP BY operation.

There are other crucial SQL operations, such as the WHERE and ORDER BY clauses, but these don’t creatively transform things in the way that the GROUP BY operation does.

If we go back to the UN peacekeepers database (which I talked at length about here), each record in the swdata table is:

 (country text, mission text, people integer, month text)

We can find the number of people from each country sent to all the missions in the month of June 2010 using the following query:

SELECT swdata.country,
       sum(people) as tpeople
FROM swdata
WHERE month='2010-06'
GROUP BY country
ORDER BY tpeople desc

The ORDER BY clause gives the highest numbers first:

country tpeople
Pakistan 10692
Bangladesh 10641
India 8920
Nigeria 5732
Egypt 5461
Nepal 5148
Ghana 3748
Rwanda 3654
Jordan 3599
Uruguay 2566

Of course it’s no surprise that Pakistan is on this table above Belgium, it’s got 17 times the population. What we need to see here is the per capita table.

This obviously requires the table of populations per country. Unfortunately there wasn’t one on Scraperwiki yet, so I had to create one from the Wikipedia article.

Now I would have loved to have derived it from the editable source of the wikipedia article, as I described elsewhere, but is impossible to do because it is insanely programatic:

|2||align=left|{{IND}}||1,210,193,422||March 1, 2011
||{{#expr: 1210193000/{{worldpop}}*100 round 2}}%
|-
|3||align=left|{{USA}}||{{data United States of America|poptoday 1}}
|-
...
|16||align=left|{{EGY}}||{{formatnum: {{#expr: (79602+4.738*{{Age in days|2011|1|1}}) round 0}}000}}
||{{CURRENTMONTHNAME}} {{CURRENTDAY}}, {{CURRENTYEAR}}
||{{#expr: (79602000+4738*{{Age in days|2011|1|1}})/{{worldpop}}*100 round 2}}%
|-

As you can see, some rows contain the numbers properly, some have the numbers transcluded from a different place, and some are expressed in terms of a mathematical formula.

The Wikipedia template programming language is so dauntingly sophisticated, except for not being able to produce the ranking numbers automatically, over 200 of which would have needed to be retyped following the creation of South Sudan. Oh, and the table has to be updated into over 30 different languages.

Country identifications are tricky because they change (like Sudan) and do have a lot of different spellings.

(One thing I have never understood is why countries have different names in different languages. I mean, something like the train system around San Francisco is called BART, short for Bay Area Rapid Transit, and everyone calls it that, because that is the name it has given itself. Even the spanish call it Bart. So why do they insist on calling the US, the United States, Estados Unidos? What does that shorten to? The EU?)

My population scraper, which should really have been derived in some way from the UNstats demographics data. To make it match up with the country spellings in the peacekeeper data I had to add some extra alternative spellings:

countryalts = { u"Cxf4te d'Ivoire":["Cote d Ivoire"],
    "United States":["United States of America"],
    "Democratic Republic of the Congo":["DR Congo"],
    "South Korea":["Republic of Korea"] }

This allows us to perform a join between the two data sets together to create the per-capita peace keepers table per country:

attach: country_populations

SELECT swdata.country,
    round(sum(people)*1000000/poptable.Population,2)
      as permillionpop,
    sum(people) as peacekeepers,
    poptable.Population
FROM swdata
LEFT JOIN poptable on swdata.country=poptable.Country
WHERE month='2010-06'
GROUP BY swdata.country
ORDER BY permillionpop desc

…which gives a much more representative country ranking that puts Uruguay, Jordan and Rwanda in the top three places.

It would have been even better to use a more fine grained demographics database so as to select only for the population of a country between the ages of 20 and 40, say.

Now, suppose I had a third data set, which was the financial contribution by each country to the UN peacekeeping efforts, and I was able to join that into the query in some meaningful way that tested the relationship. For example, do countries that contribute fewer peacekeepers contribute more money to make up for deficit, or are some nations just naturally more generous than others — in proportion of their GDP? (This looks like a fourth data set to me.)

This kind of operation only works if all the datasets have been imported into the same database or complex.

What is this starting to feel like we are doing?

Does it feel like we are fitting wooden spokes into some kind of a central object that joins them into a rigid entity which acts as a single object pivoting on an axis?

Maybe we should call it a data windmill.

One Response to “The Data Hob”

  1. michalskop March 9, 2012 at 11:41 pm #

    A bit off-topic:
    I’ve never understood why the English call Praha by the name Prague(-:
    New York is written Nueva York in Spanish and even Nova Iorque in Portuguese.
    The reason for changing names is very probably pronunciation and writting, the English do not have sound for beggining of Austria in German, do not have many letters from European alphabets, but still want for some reason to refer to such cities or countries.
    And Estados Unidos are shortened as EE.UU.,because it is a plural, unlike Union Europea -> U.E.
    The European Union itself is another interesting example: it has over 20 original names.

We're hiring!