OKFN – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Open your data with ScraperWiki https://blog.scraperwiki.com/2013/07/open-your-data-with-scraperwiki/ https://blog.scraperwiki.com/2013/07/open-your-data-with-scraperwiki/#comments Thu, 11 Jul 2013 16:48:15 +0000 http://blog.scraperwiki.com/?p=758219146 Open data activists, start your engines. Following on from last week’s announcement about publishing open data from ScraperWiki, we’re now excited to unveil the first iteration of the “Open your data” tool, for publishing ScraperWiki datasets to any open data catalogue powered by the OKFN’s CKAN technology.


Try it out on your own datasets. You’ll find it under “More tools…” in the ScraperWiki toolbar:


And remember, if you’re running a serious open data project and you hit any of the limits on our free plan, just let us know, and we’ll upgrade you to a data scientist account, for free.

If you would like to contribute to the underlying code that drives this tool you can find its repository on github here – http://bit.ly/1898NTI.

https://blog.scraperwiki.com/2013/07/open-your-data-with-scraperwiki/feed/ 2 758219146
Three hundred thousand tonnes of gold https://blog.scraperwiki.com/2012/07/tonnes-of-gold/ https://blog.scraperwiki.com/2012/07/tonnes-of-gold/#comments Wed, 04 Jul 2012 20:17:18 +0000 http://blog.scraperwiki.com/?p=758217195 On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as 1.58887418588e+13.

Doesn’t matter for now. Let’s look at the graph over time:

It’s not very exciting, unless you happen to be interested in phrases such as “debasing our fiat currency” and “return to the gold standard”. In truth, one should really divide the values by the GDP, or the national population, or the cumulative inflation over the time period to scale it properly.

Nevertheless, I decided also to look at the gold price, which can be seen as a graph (click the [Graph] button, then [Group Column (x-axis)]: “date” and [Group Column (y-axis)]: “price”) on the Data Hub. They give this dataset the title: Gold Prices in London 1950-2008 (Monthly).

Why does the data stop in 2008 just when things start to get interesting?

I discovered a download url in the metadata for this dataset:


which is somewhere within the githubtm as part of the repository https://github.com/datasets/gold-prices in which there resides a 60 line Python scraper known as process.py.

Aha, something I can work with! I cut-and-pasted the code into ScraperWiki as scrapers/gold_prices and tried to run it. Obviously it didn’t work as-is — code always requires some fiddling about when it is transplanted into an alien environment. The module contained three functions: download(), extract() and upload().

The download() function didn’t work because it tries to pull from the broken link:


This is one of unavoidable failures that can befall a webscraper, and was one of the motivations for hosting code in a wiki so that such problems can be trivially corrected without an hour of labour checking out the code in someone else’s favourite version control system, setting up the environment, trying to install all the dependent modules, and usually failing to get it to work if you happen to use Windows like me.

After some looking around on the Bundesbank website, I found the Time_series_databases (Click on [Open all] and search for “gold”.) There’s Yearly average, Monthly average and Daily rates. Clearly the latter is the one to go for as the other rates are averages and likely to be derivations of the primary day rate value.

I wonder what a “Data basket” is.

Anyways, moving on. Taking the first CSV link and inserting it into that process.py code hits a snag in the extract() function:

downloaded = 'cache/bbk_WU5500.csv'
outpath = 'data/data.csv'
def extract():
    reader = csv.reader(open(downloaded))
    # trim junk from files
    newrows = [ [row[0], row[1]] for row in list(reader)[5:-1] ]

    existing = []
    if os.path.exists(outpath):
        existing = [ row for row in csv.reader(open(outpath)) ]

    starter = newrows[0]
    for idx,row in enumerate(existing):
        if row[0] == starter[0]:
            del existing[idx:]

    # and now add in new data
    outrows = existing + newrows
    csv.writer(open(outpath, 'w')).writerows(outrows)

ScraperWiki doesn’t have persistent files, and in this case they’re not helpful because all these lines of code are basically replicating the scraperwiki.sqlite.save() features through use of the following two lines:

    ldata = [ { "date":row[0], "value":float(row[1]) }  for row in newrows  if row[1] != '.' ]
    scraperwiki.sqlite.save(["date"], ldata)

And now your always-up-to-date gold price graph is yours to have at the cost of select date, value from swdata order by date –> google annotatedtimeline.

But back to the naked github disclosed code. Without its own convenient database.save feature, this script must use its own upload() function.

def upload():
    import datastore.client as c
    dsurl = 'http://datahub.io/dataset/gold-prices/resource/b9aae52b-b082-4159-b46f-7bb9c158d013'
    client = c.DataStoreClient(dsurl)

Ah, we have another problem: a dependency on the undeclared datastore.client library, which was probably so seamlessly available to the author on his own computer that he didn’t notice its requirement when he committed the code to the github where it could not be reused without this library. The library datastore.client is not available in the github/datasets account; but you can find it in the completely different github/okfn account.

I tried calling this client.py code by cut-and-pasting it into the ScraperWiki scraper, and it did something strange that looked like it was uploading the data to somewhere, but I can’t work out what’s happened. Not to worry. I’m sure someone will let me know what happened when they find a dataset somewhere that is inexplicably more up to date than it used to be.

But back to the point. Using the awesome power of our genuine data-hub system we can take the us_debt_to_the_penny, and attach the gold_prices database to perform the combined query to scales ounces of gold into tonnes:

    AS debt_gold_tonnes
FROM swdata AS debt
LEFT JOIN gold_prices.swdata as gold
  ON gold.date = debt.date
WHERE gold.date is not null
ORDER BY debt.date

and get the graph of US government debt expressed in terms of tonnes of gold.

So that looks like good news for all the gold-bugs, the US government debt in the hard currency of gold has been going steadily down by a factor of two since 2001 to around 280 thousand tonnes. The only problem with that there’s only 164 thousand tonnes of gold in the world according to the latest estimates.

Other fun charts people find interesting such as gold to oil ratio can be done once the relevant data series is loaded and made available for joining.

https://blog.scraperwiki.com/2012/07/tonnes-of-gold/feed/ 3 758217195
Growing back to the Future: Allotments in the UK, open data stories and interventions https://blog.scraperwiki.com/2011/10/growing-back-to-the-future-allotments-in-the-uk-open-data-stories-and-interventions/ https://blog.scraperwiki.com/2011/10/growing-back-to-the-future-allotments-in-the-uk-open-data-stories-and-interventions/#comments Mon, 31 Oct 2011 15:34:16 +0000 http://blog.scraperwiki.com/?p=758215768 IMGP4515This is a guest blog post from Farida Vis. She attended EuroHack at the Open Government Data Camp 2011. It consisted of a series of short talks combined with plenty of opportunities for hacking in groups in the second part the workshop.

On the day, we were given an introduction to data driven journalism by data journalist Nicolas Kayser-Brill, who has recently launched J++, a new media company that builds data journalism applications. Friedrich Lindenberg (OKF) and Aidan McGuire (ScraperWiki) gave a thorough overview of scraping, mainly focusing on the very popular ScraperWiki with Friedrich highlighting its application to EU spending data. Finally Chris Taggart (Open Corporates) talked about EU spending data as well as Open Corporates, and gave a hands-on workshop on Google Refine.

My personal interest lies in rather everyday data, related to ‘mundane issues’ that people relate to easily, principally because they feature in their everyday lives. This allows for a rethinking of political participation and civic engagement beyond the rather stale ways in which this is measured traditionally. I’m interested in what Liz Azyan has started calling ‘really useful’ data, which has the ordinary end user firmly in mind. Personally I find huge spending data difficult to get my head round (but I guess I’m not alone in that) and so I’m interested in exploring a more manageable example and seeing how far I can take it. So for some time now, I have been looking at the issue of allotments in the UK. At EuroHack I had not really intended to pitch my project, but having briefly talked about what I was doing to Aidan McGuire before the start of the workshop, he highlighted it on my behalf and then there was luckily no turning back. I was delighted with people’s interest in the project. Below highlights what we looked at on the day and what happened next.


What issue did we look at?

An allotment is a small plot of publicly owned land you rent from the council for a small annual fee, giving people the possibility to grow their own fruit and vegetables. I have an allotment myself (here’s a picture) and was lucky that when I decided to get one eleven years ago the waiting list was only two months, so my partner and I got one nearly immediately. Since then those numbers have shot up to the extent that on our site in South Manchester the waiting list is now fifteen years, highlighting a nationwide problem. The last few years have seen a staggering increase in demand, no doubt fuelled by growing broader environmental concern and awareness, yet no significant increase in the numbers of extra allotments have been created to meet this demand. The New Local Government Network reports that during the 1940s there were around 1.4 million allotments in the UK with only 200,000 today, which partly reflects that ‘growing your own’ goes through cycles of popularity. During a period of complete lack of interest, it is difficult for councils to hold on to this land as allotments that nobody wants. But what do you do when it seems everybody wants one again?

Earlier this year, the Department for Communities and Local Government issued a public consultation on 1294 Statutory Duties pertaining to local authorities to possibly reduce their number. These duties included Section 23 of the Allotments Act from 1908, which ensures local authorities provide allotments (and should take seriously such a request made by at least six tax paying citizens in a council), causing some newspapers to suggest that ‘The Good Life’ was now under threat. The Act remained unchanged however and this summer the government announced that of the 6,103 responses received, nearly half contained a comment on the Allotments Act, suggesting on a ‘straw poll’ level at least that this is an issue people care about.


What were we interested in?

Although it is tempting to simply highlight this problem in a different way, with additional data and accompanying visualisations, I was keen to highlight that whilst I do think there is an issue with councils not providing more sites, it is also clear to me that they are not exactly in a position to necessarily do so given the current economic climate. So therefore whatever we did, it was important to me that we used part of the day to start thinking about alternative solutions to the waiting list crisis. For example by identifying underused plots of lands (brown field sites and others), which could serve as temporary growing spaces (pop-up allotments anyone?). In my attempt to ‘do something about this’ I was joined by Daniela Silva and Pedro Markun from the Sao Paulo based think-and-do tank Esfera; data journalist Nicolas Kayser-Brill; python/js developer and self described open data fan Anna Powell-Smith, and finally Andrew Mackenzie who was at the OGD camp to film, part of an ongoing project that records the open data movement.

What data did we have?

Although there is very little allotment data available, as councils rarely publish it, Transition Town West Kirby (TTWK), led by Margaret and Ian Campbell, has for the last three years used the Freedom of Information Act, to obtain allotment waiting list data through WhatDoTheyKnow. They publish this data, along with a report each year and these figures are now widely used in the mainstream media. The reports however focus on national averages and do not highlight specific differences between councils or identify councils where problems are particularly severe. My co-researcher at Leicester, Yana Manyukhina and I had recently put in our own FOI request to build on the TTWK data. Our request focused on rental cost, water charges, whether discounts were available to plot holders. Aside from this we also requested the tenancy agreements councils use to manage their allotment sites. An analysis of these agreements may reveal further differences between councils, which could prove to be significant to citizens living in these locations. Because I am Manchester based, we also had a look at allotment location data manually collected by Feeding Manchester, which is interested in sustainable food for Greater Manchester.


What did we do?

After my introduction, Anna decided to work on the FOI data, using Google Fusion tables. In a UK context, The Guardian Data Store frequently uses these in order to highlight differences per council related to a specific topic. I had previously standardised the TTWK data so that each council now included a figure for how many people were waiting per every 100 allotments (the data set also includes further details about number of sites and allotments per council). Anna and I decided that we would add data from the FOI Yana and I had to the TTWK data, namely: the rental cost, water charges, and discounts given. I need to do further work on standardising the rent charge per council, which now is still expressed in a range of different old fashioned measurements. Allotment sizes were traditionally measured in ‘poles’ and ‘rods’ (from 1908 onwards a standard plot was 10 rods), though many now use square yards and metres.

Pedro and Nicolas both worked on building a series of scrapers, using ScraperWiki, scraping the Feeding Manchester data, Landshare data (Landshare is an initiative that is already offering alternatives, matching up individuals who have land, with those who wish to cultivate it) as well as a number of council sites. Aside from this Pedro and I also worked with an idea that ScraperWiki’s Julian Todd had given me at an earlier meeting (at OKCON in Berlin), and that is to use OpenStreetMap to get people to mark up allotments. In our extended idea (usefully articulated by Andrew Mackenzie on the day), other possible growing spaces, possibly with a newly agreed land use tag could also be mapped. In the end Pedro built a site that pulled in all the OSM data to show allotment sites in the UK and would update daily every time a new allotment was marked up on OpenStreetMap.

What happened next?

The enthusiasm and the great work we did during hackday meant that I wanted to reflect this in my presentation at the camp the next day. I addressed this desire to both highlight the issues over current allotment data collection (lack of ontologies), access to or knowledge of this data combined with this huge surge in demand from ordinary people wanted to grow their own produce. Going beyond simply a better visualisation of council data obtained via FOIs I strongly emphasised the possibility for a technological intervention into this growing (pardon the pun) issue, by building stronger ontologies for allotment data (Pedro and I talked about this a lot afterwards), but also to think beyond the unproductive ‘councils just need to provide more allotments’ deadlock. Following my presentation I had various offers from people keen to help out with the mapping, but one person on Twitter confirmed my feeling that in order to get a lot of people to map, to do this directly in OpenStreetMap was still quite a daunting prospect for the ordinary end user. I toyed with the idea of filming a simple step-by-step tutorial, but in the end Pedro suggested to use a new, more user friendly interface, one he is currently developing for the Sao Paulo Council in Brazil. This is currently still under development, but we will hopefully have an update soon.

Anna and I made excellent progress and had a great chat with Lisa Evans from the Guardian Data Store, at the camp to present, who expressed an interest in putting the allotment data on the date store. I will work with Anna over the next few days to complete the data set and do a short write up. Hopefully releasing this data through such a well known and respected site might generate some further interest. Daniela also interviewed me for the Esfera blog and she has written up our EuroHack day in Portuguese here.

All this flurry of activity did not go unnoticed and the project has now received official support from the OKF, with Community Coordinator Kat Braybrooke as the key liaison. Although Kat and I had talked for months about this project already, it seemed that it needed the critical mass, collective brainstorming and hacking at EuroHack and afterwards to push this open data part of the project to the next level. Kat and I will be meeting with a range of NGOs and interested parties soon, who have expressed an interest in pulling resources and making a joint intervention in to this problem. It is hard to express how exciting it was to connect with such amazing people at EuroHack, who all did such a tremendous amount of work on this project and especially to end up with such a great result. An OKF site highlighting the mapping project will launch shortly and we hope to give you further updates in the not-too-distant future. Watch this (growing) space!

If you would like to get involved or receive further info on the project, feel free to get in touch via email or twitter.

Farida Vis from the University of Leicester in the UK (where she teaches Media and Communication) recently took part in EuroHack, a pre-conference workshop in Warsaw, Poland, on 19 October, at the Open Government Data Camp 2011, organised by the European Journalism Centre and the Open Knowledge Foundation. Farida is very grateful to the EU Commission for supporting her attendance at EuroHack and the OGD Camp with a travel bursary. 

https://blog.scraperwiki.com/2011/10/growing-back-to-the-future-allotments-in-the-uk-open-data-stories-and-interventions/feed/ 1 758215768
Diggers and Dinosaurs – Scraping at the Mozilla Festival https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/ https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/#comments Mon, 17 Oct 2011 15:40:33 +0000 http://blog.scraperwiki.com/?p=758215654 In a complete paradigm shift of the epic battle between Godzilla and Mothra we are turning our backs on the old claymation medium and embracing the digital age where dinosaurs and diggers (yes, I am aware we are a machine and not a moth) can roam free across the lawless plains of web 2.0.

Both can be found at the Mozilla Festival park in London on 4-6 November. If you’re lucky you might even spot a wily firefox. There will be an inclosure on the Friday from 18:00 where our tamed digger driver, Francis Irving, can give you some driving lessons.

As part of the Data Journalism Workshop on the Saturday, 10:00-17:00, we’ll be hosting a ‘Scraping 101’ session. There will be a host of data trackers to guide you through the web wilderness including Open Knowledge Foundation‘s Jonathan Gray and the European Journalism Centre‘s Liliana Bounegru. There will be herds of other data/web beasts roaming the plains so we suggest you stay inside or close to your digger.

If you’re interested in a close encounter of the data kind sign up for the event here.

So watch out Mozilla Festival – you’re being ScraperWikied!

https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/feed/ 1 758215654
Announcing The Big Clean, Spring 2011 https://blog.scraperwiki.com/2010/11/announcing-the-big-clean-spring-2011/ Wed, 10 Nov 2010 16:51:30 +0000 http://blog.scraperwiki.com/?p=758214004 We’re very excited to announce that we’re helping to organise an international series of events to convert not-very-useful, unstructured, non-machine-readable sources of public information into nice clean structured data.

This will make it much easier for people to reuse the data, whether this is mixing it with other data sources (e.g. different sources of information about the area you live in) or creating new useful services based on the data (like TheyWorkForYou or Where Does My Money Go?). The series of events will be called The Big Clean, and will take place next spring, probably in March.

The idea was originally floated by Antti Poikola on the OKF’s international open-government list back in September, and since then we’ve been working closely with Antti and Jonathan Gray at OKFN to start planning the events.

Antti and Francis Irving (mySociety) will be running a session on this at the Open Government Data Camp on the 18-19th November in London. If you’d like to attend this session, please add your name to the following list:

If you can’t attend but you’re interested in helping to organise an event near you, please add your name/location to the following wiki page:

All planning discussions will take place on the open-government list!