Ian Hopkinson – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 GOV.UK – Contracts Finder… £1billion worth of carrots! https://blog.scraperwiki.com/2015/10/gov-uk-contracts-finder-1billion-worth-of-carrots/ Wed, 07 Oct 2015 09:23:14 +0000 https://blog.scraperwiki.com/?p=758224098 Photo by Fovea Centralis / CC BY-ND-2.0

Carrots by Fovea Centralis /CC BY-ND-2.0

This post is about the government Contracts Finder website. This site has been created with a view to helping SMEs win government business by providing a “one-stop-shop” for public sector contracts.

Government has been doing some great work transitioning their departments to GOV.UK and giving a range of online services a makeover. We’ve been involved in this work, in the first instance scraping the departmental content for GOV.UK, then making some performance dashboards for content managers on the Performance Platform.

More recently we’ve scraped the content for databases such as the Air Accident Investigation Board, and made the new Civil Service People Survey website.

As well as this we have an interest in other re-worked government services such as the Charity Commission website, data.gov.uk and the new Companies House website.

Getting back to Contracts Finder – there’s an archive site which lists opportunities posted before 26th February 2015 and a live site, the new Contracts Finder website, which has live opportunities after 26th February 2015. Central government departments and their agencies were required to advertise contracts over £10k on the old Contracts Finder website. In addition the wider public sector were able to advertise contracts on there too, but weren’t required (although on the new Contracts Finder they are required to on contracts over £25k).

The confusingly named Official Journal of the European Union (OJEU) also publishes calls to tender. These are required by EU law over a certain threshold value depending on the area of business in which they are placed. Details of these thresholds can be found here. The Contracts Finder also lists opportunities over these thresholds but it is not clear that this must be the case.

The interface of the new Contracts Finder website is OK, but there is far more flexibility to probe the data if you scrape it from the website. For the archive data this is more a case of downloading the CSV files provided although it is worth scraping the detail pages indicated from the downloads in order to get additional information such as the supplier to which work was awarded.

The headline data published in an opportunity is the title and description, the name of the customer with contact details, the industry – a categorisation of the requirements, a contract value, and a closing date for applications.

We run the scrapers on our Platform which makes it easy to download the data as an Excel spreadsheet or CSV, which we can then load into Tableau for analysis. Tableau allows us to make nice visualisations of the data, and to carry out our own ad hoc queries of the data free from the constraints of the source website. There are about 15,000 entries on the new site, and about 40,000 in the archive.

The initial interest for us was just getting an overview of the data, how many contracts were available in what price range?  As an example we looked at proposals in the range £10k-£250k in the Computer and Related Services sector. The chart below shows the number of opportunities in this range grouped by customer.


These opportunities are actually all closed. How long were opportunities open for? We can see in the histogram below. Most adverts are open for 2-4 weeks, however a significant number have closing dates before their publication dates – it’s not clear why.


There is always fun to be found in a dataset of this size. For example, we learn that Shrewsbury Council would have appeared to have tendered for up to £1bn worth of fruit and vegetables (see here). With a kilogram of carrots costing less than a £1 this is a lot of veg, or a mis-entry in the data maybe!

Closer to home we discover that Liverpool Council spent £12,000 for a fax service for 2 years! There are also a collection of huge contracts for the MOD which appears to do its contracting from Bristol.

Getting down to more practical business we can use the data to see what opportunities we might be able to apply for. We found the best way to address this was to build a search tool in Tableau which allowed us to search and filter on multiple criteria (words in title, description, customer name, contract size) and view the results grouped together. So it is easy, for example, to see that Leeds City Council has tendered for £13million in Computer and Related Services, the majority of which went on a framework contract with Fujitsu Services Ltd. Or that Oracle won a contract for £6.5 million from the MOD for their services. You can see the austere interface we have made to this data below


Do you have some data which you want exploring? Why not get in touch with us!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
The Royal Statistical Society Conference–Exeter 2015 https://blog.scraperwiki.com/2015/09/the-royal-statistical-society-conferenceexeter-2015/ Fri, 11 Sep 2015 10:23:55 +0000 https://blog.scraperwiki.com/?p=758223991 rss-exeterScraperWiki have been off to the Royal Statistical Society Conference in Exeter to discuss our wares with the delegates. The conference was very friendly with senior RSS staff coming to see how we were doing through the week.

We shared the exhibitor space in the fine atrium of The Forum at Exeter University with Wiley, the Oxford University Press, the Cambridge University Press, the ONS, ATASS sports, Phastar, Taylor and Francis and DataKind, alongside the Royal Statistical Society’s own stand.

We talked to a wide range of people, some with whom we have done business already such as the Q-step programme, and the people from the Lancaster University Data Science MSc. We had interns from these two programmes over the summer. We’ve also done business with the ONS, who were there both as delegates and to try out the new ONS website on an expert audience. Other people we had met before on twitter, such as Scott Keir – the Head of education and statistical literacy at the Royal Statistical Society – and Kathyrn Torney, who won a data journalism award for her work on suicide in Northern Ireland.

Other people just dropped by for a chat, our ScraperWiki stickers are very popular with the parents of young children!

Our story is that we productionise the analysis and presentation of data. PDFtables.com is where the story starts, with a tool which accurately extracts the tabular content of PDF to Excel spreadsheets as an online service (and an API). DataBaker can then be used to convert an Excel spreadsheet in human-readable from, with boilerplate, pretty headings and so forth into a format more amenable to onward machine processing. DataBaker is a tool we developed for the ONS to help it transform the output of some of its dataflows where re-engineering the original software required more money or will than was available. The process is driven by short recipes which we trained the staff at the ONS to write, obviously we can write them for clients if preferred. The final stage is data presentation, and here we use an example from our contract software development: the Civil Service People Survey website. ORC International are contracted to run the actual survey, asking the questions and collating the results. For the 2014 survey the Civil Service took us on to provide the data exploration tool to be used by survey managers and, ultimately, all Civil Servants. The website uses the new GOV.UK styling and through in-memory processing is able to provide very flexible querying, very fast.

I’ve frequently been an academic delegate to conferences, this was my first time as an exhibitor at a conference. I have to say I commend the experience to my former academic colleagues! As an exhibitor I had a chair, a table for my lunch, and people came and talked to us about what we did with little prompting. Furthermore, I did not experience the dread pressure of trying to work out which of multiple parallel sessions I should attend!

As it was Aine and I went to a number of presentations, including Dame Julia Slingo’s one on uncertainty in climate and weather prediction, Andrew Hudson-Smith’s talk on urban informatics and Scott Zeger’s talk on Statistical Problems in Individualized health.

We liked Exeter, and the conference venue at the University. It was a short walk, up an admittedly steep hill, from the railway station and another short walk into town. The opening reception was held in the Royal Albert Memorial Museum, which is a very fine venue.

I joined the Royal Statistical Society having decided that they were were the closest thing data scientists had to a professional body in the UK, and in general they seemed like My Sort of People!

All in all a very interesting and worthwhile trip, we hope to continue and strengthen our relationship with the Royal Statistical Society and its members.

Horizon 2020–Project TIMON https://blog.scraperwiki.com/2015/09/horizon-2020project-timon/ Wed, 02 Sep 2015 09:27:55 +0000 https://blog.scraperwiki.com/?p=758223939 imageScraperWiki are members of a new EU Horizon 2020 project: TIMON “Enhanced real time services for optimized multimodal mobility relying on cooperative networks and open data”. This is a 3.5 year project, that commenced in June 2015, whose objectives are:

  • to improve road safety;
  • to provide greater transport flexibility in terms of journey planning across multiple modes of transport;
  • to reduce emissions;

From a technology point of view these objectives are to be achieved by deploying sensing equipment to cars, and bringing the data from these sensors, roadside sensors and public data together. This data will then be used to develop a set of services to users, as listed below:

  • Driver assistance services to provide real-time alerting for hazards, higher than usual traffic density, pedestrians and vulnerable road users;
  • Services for vulnerable road users to provide real-time alerting and information to vulnerable road users, initially defined as users of two-wheeled vehicles (powered and unpowered);
  • Multimodal dynamic commuter service to provide adaptive routing which is sensitive to road, weather and public transport systems;
  • Enhanced real time traffic API to provide information to other suppliers to build their own services;
  • TIMON collaborative ecosystem will give users the ability to share transport information on social media to enhance the data provided by sensors;

In common with all projects of this type, the consortium is comprised of a wide variety of different organisations from across Europe, these include:

  • University of Deusto in Bilbao (Spain) who are coordinating the work. Their expertise is in artificial intelligence and communications technologies for transport applications;
  • Fraunhofer Institute for Embedded Systems and Communication Technologies ESK in Munich (Germany). Their expertise is in communications technologies;
  • Centre Tecnològic de Telecommunicacions de Catalunya (CTTC) in Castelldefels (Spain). Their expertise is in positioning using global navigation satellite systems (GNSS), of which GPS is an example;
  • INTECS headquartered in Rome (Italy). INTECS are a large company which specialise in software and hardware systems particularly in the communications area, covering defence, space as well as automotive systems;
  • ScraperWiki in Liverpool (United Kingdom). We specialise in Open Data, producing data-rich web applications and data ingestion and transformations;
  • GeoX KFT in Budapest (Hungary). Their expertise is in Geographical Information Systems (GIS) delivered over the web and to mobile devices;
  • XLAB in Ljubljana (Slovenia). Their expertise is in cloud computing, cloud security and integrated data systems;
  • ISKRA Sistemi in Ljubljana (Slovenia) are the technical coordinator of the project. ISKRA are developers and providers of process automation, communications and security systems for power distribution, telecommunications, and railway and road traffic;
  • JP LPT in Ljubljana (Slovenia). JP LPT are fully owned by the municipality of Ljubljana, and are responsible for controlling and managing urban mobility in that city;
  • Confederation of Organisations in Road Transport Enforcement (CORTE) in Brussels (Belgium). They are an  international non-profit association in the transport area, specialising in technology dissemination and user needs research;
  • TASS International Mobility Center in Helmond (Netherlands). TASS for this project they are providing access to their traffic test-bed in Helmond for the technologies developed;

Our role in the work is to bring data from a variety of sources into the project and make it available, via an API, to other project partners to provide the services. We anticipate bringing in data such as public transport schedules, live running information, bicycle scheme points and occupancy and so forth. We will also be doing work, alongside all the other partners, on user needs and requirements in the early part of the project.

ScraperWiki has been a partner in the very successful NewsReader project of FP7. We enjoy the stimulating environment of working with capable partners across Europe, if you are building a consortium to make a proposal then please get in touch!

You can find out more about the project on the  TIMON website, and follow our accounts on Twitter, Facebook and Linkedin.

Or if you’d like to get in touch with ScraperWiki directly to talk about the project then contact on hello@scraperwiki.com.

Got a PDF you want to get into Excel?
Try our easy web interface over at PDFTables.com!
Book review: Docker Up & Running by Karl Matthias and Sean P. Kane https://blog.scraperwiki.com/2015/07/book-review-docker-up-running-by-karl-matthias-and-sean-p-kane/ Fri, 17 Jul 2015 11:00:56 +0000 https://blog.scraperwiki.com/?p=758223401 This last week I have been reading dockerDocker Up & Running by Karl Matthias and Sean P. Kane, a newly published book on Docker – a container technology which is designed to simplify the process of application testing and deployment.

Docker is a very new product, first announced in March 2013, although it is based on older technologies. It has seen rapid uptake by a number of major web-based companies who have open-sourced their tooling for using Docker. We have been using Docker at ScraperWiki for some time, and our most recent projects use it in production. It addresses a common problem for which we have tried a number of technologies in search of a solution.

For a long time I have thought of Docker as providing some sort of cut down virtual machine, from this book I realise this is the wrong mindset – it is better to think of it as a “process wrapper”. The “Advanced Topics” chapter of this book explains how this is achieved technically. This makes Docker a much lighter weight, faster proposition than a virtual machine.

Docker is delivered as a single binary containing both client and server components. The client gives you the power to build Docker images and query the server which hosts the running Docker images. The client part of this system will run on Windows, Mac and Linux systems. The server will only run on Linux due to the specific Linux features that Docker utilises in doing its stuff. Mac and Windows users can use boot2docker to run a Docker server, boot2docker uses a minimal Linux virtual machine to run the server which removes some of the performance advantages of Docker but allows you to develop anywhere.

The problem Docker and containerisation are attempting to address is that of capturing the dependencies of an application and delivering them in a convenient package. It allows developers to produce an artefact, the Docker Image, which can be handed over to an operations team for deployment without to and froing to get all the dependencies and system requirements fixed.

Docker can also address the problem of a development team onboarding a new member who needs to get the application up and running on their own system in order to develop it. Previously such problems were addressed with a flotilla of technologies with varying strengths and weaknesses, things like Chef, Puppet, Salt, Juju, virtual machines. Working at ScraperWiki I saw each of these technologies causing some sort of pain. Docker may or may not take all this pain away but it certainly looks promising.

The Docker image is compiled from instructions in a Dockerfile which has directives to pull down a base operating system image from a registry, add files, run commands and set configuration. The “image” language is probably where my false impression of Docker as virtualisation comes from. Once we have made the Docker image there are commands to deploy and run it on a server, inspect any logging and do debugging of a running container.

Docker is not a “total” solution, it has nothing to say about triggering builds, or bringing up hardware or managing clusters of servers. At ScraperWiki we’ve been developing our own systems to do this which is clearly the approach that many others are taking.

Docker Up & Running is pretty good at laying out what it is you should do with Docker, rather than what you can do with Docker. For example the book makes clear that Docker is best suited to hosting applications which have no state. You can copy files into a Docker container to store data but then you’d need to work out how to preserve those files between instances. Docker containers are expected to be volatile – here today gone tomorrow or even here now, gone in a minute. The expectation is that you should preserve state outside of a container using environment variables, Amazon’s S3 service or a externally hosted database etc – depending on the size of the data. The material in the “Advanced Topics” chapter highlights the possible Docker runtime options (and then advises you not to use them unless you have very specific use cases). There are a couple of whole chapters on Docker in production systems.

If my intention was to use Docker “live and in anger” then I probably wouldn’t learn how to do so from this book since the the landscape is changing so fast. I might use it to identify what it is that I should do with Docker, rather than what I can do with Docker. For the application side of ScraperWiki’s business the use of Docker is obvious, for the data science side it is not so clear. For our data science work we make heavy use of Python’s virtualenv system which captures most of our dependencies without being opinionated about data (state).

The book has information in it up until at least the beginning of 2015. It is well worth reading as an introduction and overview of Docker.

Dr Ian Hopkinson is Senior Data Scientist at ScraperWiki, where we often use Docker to help customers manage their data. You can read more about our professional services here.

Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia https://blog.scraperwiki.com/2015/07/book-review-learning-spark-by-holden-karau-andy-konwinski-patrick-wendell-and-matei-zaharia/ Mon, 06 Jul 2015 10:00:46 +0000 https://blog.scraperwiki.com/?p=758223243 learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Book review: Mastering Gephi Network Visualisation by Ken Cherven https://blog.scraperwiki.com/2015/06/book-review-mastering-gephi-network-visualisation-by-ken-cherven/ Mon, 15 Jun 2015 07:49:27 +0000 https://blog.scraperwiki.com/?p=758223042 1994_7344OS_Mastering Gephi Network VisualizationA little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.

Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.

Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.

The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.

I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent.  From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.

I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.

I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.

The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.

It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.

I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.

The publisher sent me this book, free of charge, for review.

Book review: Cryptocurrency by Paul Vigna and Michael J. Casey https://blog.scraperwiki.com/2015/05/cryptocurrency-by-paul-vigna-and-michael-j-casey/ Fri, 22 May 2015 12:18:43 +0000 https://blog.scraperwiki.com/?p=758222909 cryptocurrencyAmongst hipster start ups in the tech industry Bitcoin has been a thing for a while. As one of the more elderly members of this community I wanted to understand a bit more about it. Cryptocurrency: How Bitcoin and Digital Money are Challenging the Global Economic Order by Paul Vigna and Michael Casey fits this bill.

Bitcoin is a digital currency: the Bitcoin has a value which can be exchanged against other currencies but it has no physical manifestation. The really interesting thing is how Bitcoins move around without any central authority, there is no Bitcoin equivalent of the Visa or BACS payment systems with their attendant organisations or central back as in the case of a normal currency. This division between Bitcoin as currency and Bitcoin as decentralised exchange mechanism is really important.

Conventional payment systems like Visa have a central organisation which charges retailers a percentage on every payment made using their system. This is exceedingly lucrative. Bitcoin replaces this with the blockchain – a distributed ledger in which transactions are encrypted. The validation is carried out by so-called ‘miners’ who are paid in Bitcoin for carrying out a computationally intensive encryption task which ensures the scarcity of Bitcoin and helps maintain its value. In principle anybody can be a Bitcoin miner, all they need is the required free software and the ability to run the software. The generation of new Bitcoin is strictly controlled by the fundamental underpinnings of the blockchain software. Bitcoin miners are engaged in a hardware arms race with each other as they compete to complete units on the blockchain, more processing power equals more chances to complete blocks ahead of the competition and hence win more Bitcoin. In practice mining meaningful quantities these days requires significant, highly specialised hardware.

Vigna and Casey provide a history of Bitcoin starting with a bit of background as to how economists see currency, this amounts to the familiar division between the Austrian school and the Keynesians. The Austrians are interested in currency as gold, whilst the Keynesians are interested in Bitcoin as a medium for exchange. As a currency Bitcoin doesn’t appeal to Keysians since there is no “quantitative easing” in Bitcoin, the government can’t print money.

Bitcoin did not appear from nowhere, during the late 90s and early years of the 20th century there were corporate attempts at building digital currencies. These died away, they had the air of lone wolf operations hidden within corporate structures which met their end perhaps when they filtered up to a certain level and their threat to the current business model was revealed. Or perhaps in the chaos of the financial collapse.

More pertinently there were the cypherpunks, a group interested in cryptography operating on the non-governmental, non-corporate side of the community. This group was also experimenting with ideas around digital currencies. This culminated in 2008 with the launch of Bitcoin, by the elusive Satoshi Nakamoto, to a cryptography mailing list. Nakamoto has since disappeared, no one has ever met him, no one knows whether he is the pseudonym of one of the cypherpunks, and if so, which one.

Following its release Bitcoin experienced a period of organic growth with cryptography enthusiasts and the technically curious. With the Bitcoin currency growing an ecosystem started to grow around it beginning with more user-friendly routes to accessing the blockchain – wallets to hold your Bitcoins, digital currency exchanges and tools to inspect the transactions on the blockchain.

Bitcoin has suffered reverses, most notoriously the collapse of the Mt Gox currency exchange and its use in the Silk Road market, which specialised in illegal merchandise. The Mt Gox collapse demonstrated both flaws in the underlying protocol and its vulnerability to poorly managed components in the ecosystem. Alongside this has been the wildly fluctuating value of the Bitcoin against other conventional currencies.

One of the early case studies in Cryptocurrency is of women in Afghanistan, forbidden by social pressure if not actual law from owning private bank accounts. Bitcoin provides them with a means for gaining independence and control over at least some financial resources. There is the prospect of it becoming the basis of a currency exchange system for the developing world where transferring money within a country or sending money home from the developed world are as yet unsolved problems, beset both with uncertainty and high costs.

To my mind Bitcoin is an interesting idea, as a traditional currency it feels like a non-starter but as a decentralized transaction mechanism it looks very promising. The problem with decentralisation is: who do you hold accountable? In two senses, firstly the technical sense – what if the software is flawed? Secondly, conventional currencies are backed by countries not software, a country has a stake in the success of a currency and the means to execute strategies to protect it. Bitcoin has the original vision of a vanished creator, and a very small team of core developers. As an aside Vigna and Casey point out there is a limit within Bitcoin of 7 transactions per second which compares with 10,000 transactions per second handled by the Visa network.

It’s difficult to see what the future holds for Bitcoin, Vigna and Casey run through some plausible scenarios. Cryptocurrency is well-written, comprehensive and pitched at the right technical level.

A tool to help with your next job move https://blog.scraperwiki.com/2015/05/a-tool-to-help-with-you-next-job-move/ Tue, 05 May 2015 07:32:49 +0000 https://blog.scraperwiki.com/?p=758222772 A guest post from Jyl Djumalieva.

Jyl Djumalieva

Jyl Djumalieva

During February and March this year I had a wonderful opportunity to share the workspace with ScraperWiki team. As an aspiring data analyst, I found it very educational to learn how real-life data science happens. After observing ScraperWiki data scientist do some analytical heavy lifting I was inspired to embark on an exploratory analytics project of my own. This is what I’d like to talk about in this blog post.

The world of work has fundamentally changed: employees no longer grow via vertical promotions. Instead they develop transferable skills and learn new skills to reinvent their careers from time to time.

As people go through career transitions, they increasingly need tools that would help them make an informed choice on what opportunities to pursue. Imagine, if you were considering your career options, whether you would benefit from a high level overview of what jobs are available and where. To prioritize your choices and guide personal development, you might also want to know the average salary for a particular job, as well as skills and technology commonly used. In addition, perhaps you would appreciate insights on what jobs are related to your current position and could be your next move.

So, have tools that would meet all of these requirements previously existed? As far as I am aware: no. The good news is that there is a lot of job market information available. However, the challenge is that the actual job postings are not linked directly to job metadata (e.g. average salary, tools and technology used, key skills, and related occupations) This is why, as an aspiring data analyst, I have decided to bring the two sources of information together; you can find a merging of job postings with metadata demonstrated in the Tableau workbook here. For this purpose I have primarily used Python, publicly available data and Tableau Public, not to mention the excellent guidance of the ScraperWiki data science team.

The source of the actual job postings was the Indeed.com job board. I chose to use this site because it is an aggregator of job postings and includes vacancies from companies’ internal applicant tracking systems in addition to vacancies published on open job boards. The site also has an API, which allows to search job postings using multiple criteria and then retrieve results in an xml format. For this example, I used Indeed.com API to collect data for all job postings in the state of New York that were available on March 10, 2015. The script used to accomplish this task is posted to this bitbucket repository.

Teaching  jobs from Indeed in New York State, February 2015

Teaching jobs from Indeed in New York State, February 2015

To put in place the second element of the project – job metadata – I have gathered information from O*NET, which stands for Occupational Information Network (O*NET) and is a valuable online source of occupational information. O*NET is implemented under the sponsorship of the US Department of Labor/Employment and Training Administration. O*NET provides API for accessing most of the aspects of each occupation, such as skills, knowledge, abilities, tasks, related jobs, etc. It’s also possible to scrape data on average reported wages for each occupation directly from the O*NET website.

So, at this point, we have two types of information: actual jobs postings and job metadata. However, we need to find a way to link these two to turn it into a useful analytical tool. This has proven to be the most challenging part of the project. The real world job titles are not always easily assigned to a particular occupation. For instance, what do “Closing Crew” employees do, or what is the actual occupation of an “IT Invoice Analyst?”

The process for dealing with this challenge included exact and fuzzy matching of actual job titles with previously reported job titles, and assigning an occupation based on a key word in a job title (e.g., “Registered nurse”). Fuzzy wuzzy python library was a great resource for this task. I intend to continue improving the accuracy and efficiency of the title matching algorithm.

After collecting, extracting and processing the data, the next step was to find a way to visualize it. Tableau Public turned out to be a suitable tool for this task and made it possible to “slice and dice” the data from a variety of perspectives. In the final workbook, the user can search the actual job postings for NY state as well as see the information on geographic location of the jobs, average annual and hourly wages, skills, tools and technology frequently used and related occupations.

Next move, tools and technologies

Next move, tools and technologies from the Tableau tool

I encourage everyone who wants to understand what’s going on in their job sector to check my Tableau workbook and bitbucket repo out. Happy job transitioning!

Adventures in Kaggle: Forest Cover Type Prediction https://blog.scraperwiki.com/2015/04/adventures-in-kaggle-forest-cover-type-prediction/ Sun, 26 Apr 2015 06:36:25 +0000 https://blog.scraperwiki.com/?p=758222766 forest_coverRegular readers of this blog will know I’ve read quite few machine learning books, now to put this learning into action. We’ve done some machine learning for clients but I thought it would be good to do something I could share. The Forest Cover Type Prediction challenge on Kaggle seemed to fit the bill. Kaggle is the self-styled home of data science, they host a variety of machine learning oriented competitions ranging from introductory, knowledge building (such as this one) to commercial ones with cash prizes for the winners.

In the Forest Cover Type Prediction challenge we are asked to predict the type of tree found on 30x30m squares of the Roosevelt National Forest in northern Colorado. The features we are given include the altitude at which the land is found, its aspect (direction it faces), various distances to features like roads, rivers and fire ignition points, soil types and so forth. We are provided with a training set of around 15,000 entries where the tree types are given (Aspen, Cottonwood, Douglas Fir and so forth) for each 30x30m square, and a test set for which we are to predict the tree type given the “features”. This test set runs to around 500,000 entries. This is a straightforward supervised machine learning “classification” problem.

The first step must be to poke about at the data, I did a lot of this in Tableau. The feature most obviously providing predictive power is the elevation, or altitude of the area of interest. This is shown in the figure below for the training set, we see Ponderosa Pine and Cottonwood predominating at lower altitudes transitioning to Aspen, Spruce/Fir and finally Krummholz at the highest altitudes. Reading in wikipedia we discover that Krummholz is not actually a species of tree, rather something that happens to trees of several species in the cold, windswept conditions found at high altitude.


Data inspection over I used the scikit-learn library in Python to predict tree type from features. scikit-learn makes it ridiculously easy to jump between classifier types, the interface for each classifier is the same so once you have one running swapping in another classifier is a matter of a couple of lines of code. I tried out a couple of variants of Support Vector Machines, decision trees, k-nearest neighbour, AdaBoost and the extremely randomised trees ensemble classifier (ExtraTrees). This last was best at classifying the training set.

The challenge is in mangling the data into the right shape and selecting the features to use, this is the sort of pragmatic knowledge learnt by experience rather than book-learning. As a long time data analyst I took the opportunity to try something: essentially my analysis programs would only run when the code had been committed to git source control and the SHA of the commit, its unique identifier, was stored with the analysis. This means that I can return to any analysis output and recreate it from scratch. Perhaps unexceptional for those with a strong software development background but a small novelty for a scientist.

Using a portion of the training set to do an evaluation it looked like I was going to do really well on the Kaggle leaderboard but on first uploading my competition solution things looked terrible! It turns out this was a common experience and is a result of the relative composition of the training and test sets. Put crudely the test set is biased to higher altitudes than the training set so using a classifier which has been trained on the unmodified training set leads to poorer results then expected based on measurements on a held back part of the training set. You can see the distribution of elevation in the test set below, and compare it with the training set above.


We can fix this problem by biasing the training set to more closely resemble the test set, I did this on the basis of the elevation. This eventually got me to 430 rank on the leaderboard, shown in the figure below. We can see here that I’m somewhere up the long shallow plateau of performance. There is a breakaway group of about 30 participants doing much better and at the bottom there are people who perhaps made large errors in analysis but got rescued by the robustness of machine learning algorithms (I speak from experience here!).


There is no doubt some mileage in tuning the parameters of the different classifiers and no doubt winning entries use more sophisticated approaches. scikit-learn does pretty well out of the box, and tuning it provides marginal improvement. We observed this in our earlier machine learning work too.

I have mixed feelings about the Kaggle competitions. The data is nicely laid out, the problems are interesting and it’s always fun to compete. They are a great way to dip your toes in semi-practical machine learning applications. The size of the awards mean it doesn’t make much sense to take part on a commercial basis.

However, the data are presented such as to exclude the use of domain knowledge, they are set up very much as machine learning challenges – look down the competitions and see how many of them feature obfuscated data likely for reasons of commercial confidence or to make a problem more “machine learning” and less subjectable to domain knowledge. To a physicist this is just a bit offensive.

If you are interested in a slightly untidy blow by blow account of my coding then it is available here in a Bitbucket Repo.

Book review: How Linux works by Brian Ward https://blog.scraperwiki.com/2015/04/book-review-how-linux-works-by-brian-ward/ Tue, 14 Apr 2015 07:43:18 +0000 https://blog.scraperwiki.com/?p=758222748 hlw2e_cover-new_webA break since my last book review since I’ve been coding, rather than reading, on the commute into the ScraperWiki offices in Liverpool. Next up is How Linux Works by Brian Ward. In some senses this book follows on from Data Science at the Command Line by Jeroen Janssens. Data Science was about doing analysis with command line incantations, How Linux Works tells us about the system in which that command line exists and makes the incantations less mysterious.

I’ve had long experience with doing analysis on Windows machines, typically using Matlab, but over many years I have also dabbled with Unix systems including Silicon Graphics workstations, DEC Alphas and, more recently, Linux. These days I use Ubuntu to ensure compatibility with my colleagues and the systems we deploy to the internet. Increasingly I need to know more about the underlying operating system.

I’m looking to monitor system resources, manage devices and configure my environment. I’m not looking for a list of recipes, I’m looking for a mindset. How Linux Works is pretty good in this respect. I had a fair understanding of pipes in *nix operating systems before reading the book, another fundamental I learnt from How Linux Works was understanding that files are used to represent processes and memory. The book is also good on where these files live – although this varies a bit with distribution and time. Files are used liberally to provide configuration.

The book has 17 chapters covering the basics of Linux and the directory hierarchy, devices and disks, booting the kernel and user space, logging and user management, monitoring resource usage, networking and aspects of shell scripting and developing on Linux systems. They vary considerably in length with those on developing relatively short. There is an odd chapter on rsync.

I got a bit bogged down in the chapters on disks, how the kernel boots, how user space boots and networking. These chapters covered their topics in excruciating detail, much more than required for day to day operations. The user startup chapter tells us about systemd, Upstart and System V init – three alternative mechanisms for booting user space. Systemd is the way of the future, in case you were worried. Similarly, the chapters on booting the kernel and managing disks at a very low level provide more detail than you are ever likely to need. The author does suggest the more casual reader skip through the more advanced areas but frankly this is not a directive I can follow. I start at the beginning of a book and read through to the end, none of this “skipping bits” for me!

The user environments chapter has a nice section explaining clearly the sequence of files accessed for profile information when a terminal window is opened, or other login-like activity. Similarly the chapters on monitoring resources seem to be pitched at just the right level.

Ward’s task is made difficult by the complexity of the underlying system. Linux has an air of “If it’s broke, fix it and if ain’t broke, fix it anyway”. Ward mentions at one point that a service in Linux had not changed for a while therefore it was ripe for replacement! Each new distribution appears to have heard about standardisation (i.e. where to put config files) but has chosen to ignore it. And if there is consistency in the options to Linux commands it is purely co-incidental. I think this is my biggest bugbear in Linux, I know which command to use but the right option flags are more just blindly remembered.

The more Linux-oriented faction of ScraperWiki seemed impressed by the coverage of the book. The chapter on shell scripting is enlightening, providing the mindset rather than the detail, so that you can solve your own problems. It’s also pragmatic in highlighting where to to step in shell scripting and move to another language. I was disturbed to discover that the open-square bracket character in shell script is actually a command. This “explain the big picture rather than trying to answer a load of little questions”, is a mark of a good technical book.  The detail you can find on Stackoverflow or other Googling.

How Linux Works has a good bibliography, it could do with a glossary of commands and an appendix of the more in depth material. That said it’s exactly the book I was looking for, and the writing style is just right. For my next task I will be filleting it for useful commands, and if someone could see their way to giving me a Dell XPS Developer Edition for “review”, I’ll be made up.