Data Science – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… Wed, 23 Sep 2015 15:14:58 +0000 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:


What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead., by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at!
The most prescribed medication for each BNF Chapter Wed, 19 Aug 2015 10:53:06 +0000 In previous blog posts I introduced the definitions of the elements used in my research to find the most prescribed items for each BNF Chapter,as they have been understood I will now reveal my findings. Using Tableau Public I found out the top 10 most prescribed items in 2014 for several months and results were mostly the same across these months. I was particularly interested in the what would be the most prescribed medication under each BNF Chapter for those months. In this blog I will talk about my results for April 2014.

BNF Chapter 1: Gastro-intestinal System

The most prescribed item was Omeprazole Tablets 20mg as there were 2,002,522 items of this medication dispensed in April 2014.This is shown in Figure 1. Omeprazole tablets are used to treat symptoms caused by excess amounts of acid in the stomach such as stomach ulers, heartburn and indigestion, by reducing acid in the stomach.It can also be used to promote healing of erosion in the esophagitis, caused by stomach acid.This medication may have been dispensed in large amounts because there is a lack of enzymes in the food a lot of us eat today, such as white bread, pasta, cookies, cakes,cracker etc ,therefore the stomach produces an excessive amount of stomach acid to compensate, leading to acid indigestion.

Top 10 prescribed items for BNF Chapter 1

Top 10 prescribed items for BNF Chapter 1

BNF Chapter 2: Cardiovascular System

The most prescribed item was Simvastatin tablets 40mg at 1,829,256 items dispensed in April 2014. This medication is used to reduce levels of “bad” low-density lipoprotein cholesterol, while increasing levels of ‘good’ high- density cholesterol. However the second most prescribed medication under BNF Chapter 2 was Aspirin Dispersible tablets at 1,781,963 items. Aspirin Dispersible tablets are used for pain relief, reducing temperature and also act as an anti-inflammatory to reduce swelling and aspirin can be used to prevent formation of blood clots. It is argued that Simvastatin tablets were dispensed a lot because many doctors prescribe them to patients who have high blood pressure as preventive measure, as well as prescribing it to patients with high cholesterol. Another reason why Aspirin Dispersible tablets were dispensed in large amounts is that it can be used to treat a number of different symptoms that are common as most people suffer from them at some point in their life.

BNF Chapter 3: Respiratory System

The most prescribed item was the Salbutamol inhaler 100mcg in April 2014. It is used to treat breathing disorders, it relaxes muscles in the lungs and helps to keep the airways open, making it easier to breathe. It is used by adults, adolescents and children aged 4 to 11 years, as it is a good, reliable way to help those with either mild, moderate or severe asthma.There were 890,667 items dispensed of this medication.The high number of items of the Salbutamol Inhaler 100mcg is likely to be because asthma is a serious long-term condition that affected 1 in 12 adults and 1 in 11 children in 2014 in the UK according to Asthma UK. There were 1,167 deaths from asthma in the UK in 2011 and the use of the fast working Salbutamol Inhaler can reduce and prevent symptoms of asthma and so decrease the number of deaths caused by this condition. This again shows why the Salbutamol Inhaler was the most prescribed item in BNF Chapter 3.

BNF Chapter 4: Central Nervous System

The most prescribed medication was Paracetamol tablets 500mg, which are used to treat pain including headaches, toothache, cold or flu symptoms and back or period pain. This item was prescribed a lot more than other items in this BNF Chapter as 1,551,769 items of it were dispensed in April 2014. This is evident when it is compared to the 676,756 items dispensed of the second most prescribed medication which was Citalopram Hydrobromide tablets 20mg.Citalopram Hydrobromide is a medicine that is used in depression and panic disorders. A reason why paracetamol may be dispensed so much is because the symptoms it treats are common to many people.

BNF Chapter 5 : Infections

The most prescribed medication in April 2014 was Amoxicillin Capsules 500mg at 548,282 items. It is a penicillin antibiotic that can be used to treat many different types of infection caused by bacteria, such as tonsillitis, bronchitis, pneumonia, gonorrhea, and infections of the ear, nose, throat and skin.

BNF Chapter 6:Endocrine System

The most prescribed medication was Metformin Hydrochloride tablets 500 mg, which are used in diabetes mellitus. There were 996,821 items of this dispensed in April 2014.It is used to control high blood sugar levels and helps the body to have the correct response to the insulin it produces.Therefore it is typically used by people who have type2 diabetes.

BNF Chapter 7: Obstetrics, Gynaecology and Urinary- tract Disorders

The most prescribed item was Tamsulosin Hydrochloride tablets 400mcg at 497,341 items. It is used to relax the muscles in the prostate and bladder neck, making it easier for men with enlarged prostate to urinate. There is a large difference between the most prescribed medication and the second most prescribed medication as solifenacin tablets 5 mg was dispensed 116,124 times in April 2014.This medication is used in reducing the frequency of passing urine. The large difference is shown in figure2

Top 10 prescribed items in BNF chapter 7

Top 10 prescribed items in BNF chapter 7

BNF Chapter 8: Malignant Diseases and Immunosuppression

The most prescribed item was Azathioprine tablets 500mg, which helps to suppress over activity in the immune system and helps to limit inflammation reducing pain and swelling. It is also used by people who have undergone organ transplants as it helps to reduce the chance of the body rejecting the new organ. There were 52,476 items of this medication dispensed in April 2014.This is closely followed by the second most prescribed item at 49,661 items dispensed of Tamoxifen Citrate tablets 20mg. Tamoxifen Citrate tablets are effective in the treatment of breast cancer in women and men that can spread to vital organs. This condition is life threatening, therefore tamoxifen Citrate tablets that can be effective in treating it were dispensed in large amounts.

BNF Chapter 9: Nutrition and Blood

The most prescribed item in April 2014 was Folic Acid 5mg at 442,617 items. This is used to help produce and maintain new cells and prevent changes to DNA that could cause cancer. According to Cancer Research UK there were 331,487 new cases of cancer in the UK in 2011. This is a large number and it can be argued that in order to reduce the number of people who get cancer, there are large quantities of drugs dispensed that are useful in preventing cancer such as Folic Acid 5MG. Therefore explaining why it was the most prescribed item in BNF Chapter 9 in all the months I looked at.

BNF Chapter 10: Musculoskeletal and Joint Diseases

The most prescribed medication was Naproxen tablets 500mg at 307,772 items in April 2014. Naproxen is used to reduce hormones that cause inflammation and pain in the body.

BNF Chapter 11: Eye

The most prescribed item in April 2014 was Latanoprost Eye Drops 500mcg at 182,354 items. This is used for conditions such as Ocular Hypertension and Glaucoma as it helps to lower pressure in the eye by increasing fluid drainage from the eye.

BNF Chapter 12: Ear, Nose and Oropharynx

The most prescribed item was Beclometasone Dipropionate, which is used to relieve symptoms of asthma and chronic obstructive pulmonary disease. There were 140,352 items of this medication dispensed in April 2014. The drug Mometasone Furoate Nasal Spray 5mg at 134,921 items closely followed this.This medication is a steroid used to treat nasal symptoms such as sneezing, and runny nose caused by allergies. It is likely that these two medications were dispensed in large amounts at this time because there is a lot of pollen in the air in April, leading to symptoms that these drugs treat .

BNF Chapter 13: Skin

The most prescribed medication in April 2014 was Doublebase Gel at 143,002 items dispensed. It is used to moisturize skin by replacing lost water within the skin. Figure 3 shows that BNF Chapter 13 has the closest number of the items for the top two most prescribed items as the second most prescribed item Diprobase Cream was dispensed 142,806 times. This medication is used to treat eczema and other dry skin conditions.

Top 10 prescriptions for BNF chapter 13

Top 10 prescriptions for BNF chapter 13

BNF Chapter 14 :Immunological products and Vaccines

The most prescribed medication was Revaxis vaccine 0.5ml at 38,473 items in April 2014. Revaxis is used to protect against diphtheria, tetanus and polio.It is usually given to children from the age of six, teenagers and adults.The booster vaccine is typically given out at secondary school.

BNF Chapter 15: Anaesthesia

The most prescribed item was Lido Hydrochloride Injection 1% 2ml, which is used as local anaesthetic used when performing operations. There were 19,954 items of this dispensed in April 2014.


The final figure is the most prescribed items for each BNF Chapter in April 2014. It shows that out of all of the most prescribed items of all the BNF Chapters ,Omeprazole Capsule under BNF Chapter 1( Gastro-intestinal system) was dispensed in the largest amount, as there were 2,002,522 items of this medication dispensed.It also shows that one of the least dispensed medications when looking at the most prescribed items was Azathioprine tablets 500mg at 52,476 items. The BNF Chapter that had the smallest number of items dispensed was chapter 15( Anaesthesia).

Overall I have not been surprised with the drugs that were the most prescribed for each BNF Chapter as they are drugs that treat long-term illnesses or symptoms that most people will experience in their life.This is apparent when looking at the most prescribed items for BNF Chapters 2,3,6,9,12 and 13 which treat or prevent long-term illnesses such as high cholesterol, asthma, diabetes and cancer.Whilst the most prescribed items for most of the other BNF Chapters are helpful with common symptoms that nearly everyone will experience in their lives such as sneezing,inflammation and pain in the body,headaches, toothache and different types of infection caused by bacteria.

The most prescribed medication for each BNF Chapter in April 2014

The most prescribed medication for each BNF Chapter in April 2014

I now look forward to digging deeper into the GP Prescribing data to find out more new things such as whether there are seasonal variations in the medication that is prescribed and whether there are variations by location.

Got a PDF you want to get data from?
Try our easy web interface over at!
GP Prescribing Datasets Fri, 14 Aug 2015 07:39:54 +0000 hsciclogoIn a previous blog post I described the terms used in the GP Prescribing data. Here I will introduce you to the various datasets which are published in this series. They can all be found on the Health and Social Care Information Centre data catalogue page.

Prescriptions Dispensed in the Community, Statistics for England

Every year this bulletin gives a summary of prescriptions dispensed in the community by community pharmacists, appliance contractors and dispensing doctors in England for the previous 10 years. The first one that was made publicly accessible was prescriptions dispensed in the community , statistics for England 1994-2004 and since then every year there has been one that gives an overview of the previous 10 years.The latest bulletin is prescriptions dispensed in the community, statistics for England 2004-2014.

Each bulletin shows the changes within the most recent 10 years. It also includes the overall net ingredient cost of prescriptions, the leading BNF sections in terms of chemical name, number of items, item difference and NIC difference.

The specific source for these statistics is the Prescription Cost Analysis (PCA) data. The Health and Social Care Information Centre (HSCIC) publishes the Prescription Cost Analysis National Statistic, based on PCA figures for the most recent year in April annually.

Clinical Commissioning Group Prescribing Data

Clinical Commissioning Groups are part of the NHS and are responsible for the services in their local area by finding out what services are needed and what is provided. The Clinical Commissioning Group prescribing data is released every quarter of the year,therefore there is clinical commissioning group data for January-March,April- june, July – September and October – December. It uses a (CSV) file to make it more accessible to the public.

The Clinical Commissioning Group data covers prescription data on the North of England,Midlands and East of England, London and South of England. Variations in number of items and NIC may be affected by the population in the different areas.

GP Practice Prescribing Presentation-level Data

The practice level data is the finest-grain presentation of the prescribing data. It is collected and presented every month for each year and has a large file size (over 1GB). It includes a list of all medications, dressings  and appliances that are prescribed and dispensed each month under their BNF code and BNF name. For all the medicine, dressings and appliances the data shows the total number of items prescribed and dispensed, the total net ingredient cost, the total actual cost and the total quantity. It also gives the code of each practice, period, strategic health authority and primary care trust. This data covers prescriptions written and dispensed in England and those dispensed outside of England.

NICE Technology Appraisals in the NHS in England, Innovation Scorecard – Experimental statistics

The National Institute for Health and Care Excellence (NICE) work  to reduce variation in the availability and quality of NHS treatments and care and so try to provide equal high quality care across the different area of England by giving advice on the use of medication and treatment by the NHS. This dataset is experimental and uses available data to show variations and trends across time and type of medication and location. It includes interactive maps and spreadsheets that allow users to pick the organisations and time periods they want to find trends for and therefore it also includes national level data medication use, area team level data for medicine use, CCG medicine use and trust level data for medication technology sales and purchases.

Hospital Prescribing

This data has been  presented every year since 2004, it compares cost of NICE appraised medicines for primary and secondary care trusts and every year the focus is on a different kind of medication for example in 2010 it was on ADHD, psychosis and anti –  tumour necrosis factor (TNF) medicines, in 2011 it was on medicines for HIV and AIDS and in 2012 it was focused on antibacterial drugs.

The report includes: the information sources, NIC, Coverage, Overall cost (nationally and strategic health authority level, medications appraised by NICE, specific therapeutic area, sources and definitions and drugs included in the Analyses. The last hospital prescribing dataset available on the Health and Social Care Information Centre data catalogue page was in 2012.

What’s next ?…

After reading about what each of the datasets were about, I decided that the GP Prescribing Presentation Level Dataset would be the best for me to use when trying to discover trends and patterns in GP Prescribing data. I now look forward to trying to find out what the most prescribed medication is for each BNF Chapter, whether there are seasonal variations for some items and maybe the differences in the amount of money the NHS is spending on proprietary drugs and generic drugs.

Got a PDF you want to get data from?
Try our easy web interface over at!
GP Prescribing data for the UK Wed, 12 Aug 2015 10:36:10 +0000 hsciclogoOver the past few weeks I have been looking at GP Prescribing data from the Health & Social Care Information Centre, which presents the number of items and cost of all the different medication prescribed and dispensed by GP practices across the UK. The dataset amounts to millions of rows of data each month. I am trying to find trends and patterns that occur with regards to the number of items that occur within this data.

As part of my internship, provided by the Q-step programme,  I am trying to think more quantitatively.

One of the things I have learnt is that when given a dataset the first thing to do with it is to break it down and make sure the meaning of everything is understood. Therefore with the data I am looking at I researched the meaning of each heading for the columns on the dataset. In this blog I will explain what each of these terms mean.

The British National Formulary (BNF)

Central to the GP prescribing data is the BNF. This is the British National Formulary which is produced by The British Medical Association and the Royal Pharmaceutical Society. It is used to give doctors and nurses advice on the selection, prescribing, dispensing and administration of medication in the UK. The BNF classifies medicines into therapeutic groups which are known as BNF Chapters. There are 15 BNF chapters and some ‘pseudo BNF chapters’ (numbered 18 to 23) that include items such as dressings and appliances. The 15 BNF Chapters are:

  • Chapter 1:Gastro-intestinal System
  • Chapter 2: Cardiovascular System
  • Chapter 3: Respiratory System
  • Chapter 4: Central Nervous system
  • Chapter 5: Infection
  • Chapter 6: Endocrine System
  • Chapter 7: Obstetrics, Gynaecology and Urinary- tract disorders
  • Chapter 8: Malignant Diseases and Immunosuppression
  • Chapter 9: Nutrition and blood
  • Chapter 10: Musculoskeletal and joint diseases
  • Chapter 11: Eye
  • Chapter 12:Ear, nose, and oropharynx
  • Chapter 13: Skin
  • Chapter14: Immunological Products and Vaccines
  • Chapter 15: Anesthesia

Under each BNF Chapter there are subsections for example, under Chapter 2 (Cardiovascular System) one of the subsections is 2.12 Lipid-regulating drugs.

The BNF Code is the unique code that each medication has. An example of a BNF Code is 0212000U0AAADAD which is for the drug Simvastatin Tablet 40mg. The BNF Code for each drug is formed as follows:

  • Characters 1 & 2 show the BNF Chapter (02)
  • 3 & 4 show the BNF Section (12)
  • 5 & 6 show the BNF paragraph (00)
  • 7 shows the BNF sub-paragraph (0)
  • 8 & 9 show the Chemical Substance (U0)
  • 10 & 11 show the Product (AA)
  • 12 & 13 indicate the Strength and Formulation (AD)
  • 14 & 15 show the equivalent (AD). The ‘equivalent’ is defined as follows:
    • If the presentation is a generic, the 14th and 15th character will be the same as the 12th and 13th character.
    • Where the product is a brand the 14 and 15 digit will match that of the generic equivalent, unless the brand does not have a generic equivalent in which case A0 will be used.

The BNF Name is the individual preparation name for each drug. It includes the name of the drug, which could be branded or generic, followed by form it comes in and the strength of the medication. On the GP Prescribing Data – Presentation Level  dataset I used, the BNF names were often presented in an abbreviated form due to the limited number of characters available in the dataset.

Other terms

  • The Strategic Health Authorities (SHA) is an NHS organisation established to lead the strategic development of the local health service and manage Primary Care Trusts and NHS Trusts and are responsible for organising working relationships by getting service level agreement.
  • A Primary Care Trust (PCT) are under SHAs and are local organisations that are responsible for managing health services in the community. Examples of PCT are GP surgeries, NHS walk-in centres, dentists and opticians. However in the last 2-3 years the PCTs have been converted to Care Commissioning Groups (CCG), but much of the data that I was looking at talks about PCTs.
  • Items are defined as the number of items that were dispensed in the specified month. A prescription item is a single supply of a medicine, dressing or appliance written on a prescription form. If one prescription form includes four medicines, it is counted as four prescription items.
  • Quantity is the drug dispensed measured in the units. The units are dependent on the makeup of the medication, for example if it is a tablet or capsule the quantity will be the number of tablets or capsules, whereas if it is a solid such as a cream or gel the quantity will be in grammes.
  • Net Ingredient Cost (NIC) is the price of the drug written on the price list or drug tariff.
  • Actual Cost is the estimated cost to the NHS. It is calculated by subtracting the average percentage discount per item (based on the previous month) from the Net Ingredient Cost, and adding in the cost of a container for each prescription item. It is usually lower than NIC.
  • Period is the year and month that the dataset covers.

Now that I understand the meanings of each column of the dataset I am looking at, I am trying to find new things with it. Feel free to refer back to this blog when reading my future blogs on my findings, especially if you stumble upon something you have forgotten the meaning of.

Got a PDF you want to get data from?
Try our easy web interface over at!
Book Review: Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia Mon, 06 Jul 2015 10:00:46 +0000 learning-spark-book-coverApache Spark is a system for doing data analysis which can be run on a single machine or across a cluster, it  is pretty new technology – initial work was in 2009 and Apache adopted it in 2013. There’s a lot of buzz around it, and I have a problem for which it might be appropriate. The goal of Spark is to be faster and more amenable to iterative and interactive development than Hadoop MapReduce, a sort of Ipython of Big Data. I used my traditional approach to learning more of buying a dead-tree publication, Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, and then reading it on my commute.

The core of Spark is the resilient distributed dataset (RDD), a data structure which can be distributed over multiple computational nodes. Creating an RDD is as simple as passing a file URL to a constructor, the file may be located on some Hadoop style system, or parallelizing an in-memory data structure. To this data structure are added transformations and actions. Transformations produce another RDD from an input RDD, for example filter() returns an RDD which is the result of applying a filter to each row in the input RDD. Actions produce a non-RDD output, for example count() returns the number of elements in an RDD.

Spark provides functionality to control how parts of an RDD are distributed over the available nodes i.e. by key. In addition there is functionality to share data across multiple nodes using “Broadcast Variables”, and to aggregate results in “Accumulators”. The behaviour of Accumulators in distributed systems can be complicated since Spark might preemptively execute the same piece of processing twice because of problems on a node.

In addition to Spark Core there are Spark Streaming, Spark SQL, MLib machine learning, GraphX and SparkR modules. Learning Spark covers the first three of these. The Streaming module handles data such as log files which are continually growing over time using a DStream structure which is comprised of a sequence of RDDs with some additional time-related functions. Spark SQL introduces the DataFrame data structure (previously called SchemaRDD) which enables SQL-like queries using HiveQL. The MLlib library introduces a whole bunch of machine learning algorithms such as decision trees, random forests, support vector machines, naive Bayesian and logistic regression. It also has support routines to normalise and analyse data, as well as clustering and dimension reduction algorithms.

All of this functionality looks pretty straightforward to access, example code is provided for Scala, Java and Python. Scala is a functional language which runs on the Java virtual machine so appears to get equivalent functionality to Java. Python, on the other hand, appears to be a second class citizen. Functionality, particularly in I/O, is missing Python support. This does beg the question as to whether one should start analysis in Python and make the switch as and when required or whether to start in Scala or Java where you may well be forced anyway. Perhaps the intended usage is Python for prototyping and Java/Scala for production.

The book is pitched at two audiences, data scientists and software engineers as is Spark. This would explain support for Python and (more recently) R, to keep the data scientists happy and Java/Scala for the software engineers. I must admit looking at examples in Python and Java together, I remember why I love Python! Java requires quite a lot of class declaration boilerplate to get it into the air, and brackets.

Spark will run on a standalone machine, I got it running on Windows 8.1 in short order. Analysis programs appear to be deployable to a cluster unaltered with the changes handled in configuration files and command line options. The feeling I get from Spark is that it would be entirely appropriate to undertake analysis with Spark which you might do using pandas or scikit-learn locally, and if necessary you could scale up onto a cluster with relatively little additional effort rather than having to learn some fraction of the Hadoop ecosystem.

The book suffers a little from covering a subject area which is rapidly developing, Spark is currently at version 1.4 as of early June 2015, the book covers version 1.1 and things are happening fast. For example, GraphX and SparkR, more recent additions to Spark are not covered. That said, this is a great little introduction to Spark, I’m now minded to go off and apply my new-found knowledge to the Kaggle – Avito Context Ad Clicks challenge!

Which plane had the most accidents? Thu, 02 Jul 2015 15:53:18 +0000 Searching by facets

Last year, ScraperWiki helped migrate lots of specialist datasets to GOV.UK.

This afternoon, we happened to notice that the Air Accidents Investigation Branch reports, which we scraped from their old site, are live.


The user interface is called Finder Frontend, and is used by GOV.UK wherever the user needs to search for items by varying criteria. In the jargon, it’s called “faceted search”.

We enabled this type of searching by scraping the “Aircraft category”, “Report type” and “Date” fields. Users can then filter the accident reports by one or more of those criteria at once.

Most accident prone

Since we scraped it, we also happen to have the data in an SQL database in our Data Science Platform. A quick query reveals which aircraft has the most accident reports about it.

AAIB query

The answer is G-AWNB, a Boeing 747-136. It was made in 1970, and has 10 accident reports (some of those are errata, so it doesn’t mean ten accidents).

Here are three of its accidents, chosen to span time:

  1. In 1975 in Scotland, part of a flap detached during a training flight and struck the cabin door.

  2. In 1987, shortly after takeoff, a steward noticed a skin panel had ruptured on the left wing, and the hapless plane had to jettison its fuel and return to Heathrow.

  3. Lest you think it was just a badly made or maintained plane, in 1995, also at Heathrow, it suffered bad luck. A faulty passenger jetty rose up damaging the cabin door – repairs took several days.


ScraperWiki often helps with migration projects like the AAIB data. As another example, we’re working on migrating insurance data between two ERP systems at the moment.

The skillset of understanding a (poorly) documented dataset, and producing the best quality output for re-use, is an important part of data science. We use the same skill as part of lots of other projects.

Understanding data fully is the first stage of doing useful analysis with data.

Scientists and Engineers… of What? Fri, 26 Jun 2015 11:25:00 +0000 “All scientists are the same, no matter their field.” OK that sounds like a good ‘quotable’ quote, and since I didn’t see it said by anyone else, I can claim it as my own saying. The closest quote to this I saw was “No matter what engineering field you’re in, you learn the same basic science and mathematics. And then maybe you learn a little bit about how to apply it.” by Noam Chomsky. These statements are similar but not quite the same.


While the former focuses on what scientists actually DO, the later has more to do with what people LEARN in the process of becoming engineers. The aim of this post is to try to prove that scientists and engineers are essentially the same in terms of the methods, processes and procedures they use to get their job done, no matter their field of endeavour.

To make clearer the argument I’m trying to put up, I am narrowing down my comparison to two ‘types’ of scientists. The first I’d call ‘mainstream scientists’ and the second are data scientists. Who are mainstream scientists? Think of them as the sort of physicists, mathematicians, scientists and engineers that worked on the Orion project. (If you haven’t heard of this project, please watch this video of what the Orion project was about).

So Project Orion was about man trying to ascend to Mars by an atomic bomb propelled by nuclear reactions. Just watching the video and thinking about the scientific process followed, the ‘trial-and-error’ methodology (sic) and the overall project got me thinking that data scientists are just like that! So let’s get down to the actual similarities.

To start with, every introduction to science usually begins with a description of the scientific ‘method’ which (with a little variation here and there) includes: formulation of a question, hypothesis, prediction, testing, analysis, replication, external review, data recording and sharing. (this version of the scientific process was borrowed here). Compare this with the software development life cycle that a data scientist would normally follow: Requirement gathering and analysis, Design, Implementation or coding, Testing, Deployment, Maintenance (source). It’s not difficult now to see that one process was derived from the other, is it?

Moving on, the short name I’d give to much of the scientific and software development process is ‘trial-and-error’ methodology (name has actually been upgraded to ‘Agile’ methodology). Project Orion’s ‘mainstream’ scientists tried (and failed at) several options for getting the rocket to escape Earth’s gravity. Data scientists try several ways to get their analytics done. In both scenarios, sometimes, an incremental step damages the entire progress made so far, and the question of ‘how do we get back to the last good configuration?’ arises. Data scientists have been having good success in recent times in this regard by using some form of version control system (like Git). How do the mainstream scientists manage theirs? I don’t know about now, but Project Orion didn’t have a provision for that.

So are mainstream scientists and data scientists the same? I’ll say a definite yes since they follow similar methods to get their work or research done. If you’re a data scientist, feel free now to identify with every other scientist in the world. Don’t feel any less a scientist because your work does not overtly affect people’s lives (like displacing people for fear of nuclear contamination, or damaging earth’s landscape as an unexpected by-product of your experiments) as mainstream scientists do. In reality, with the tools you have at your disposal as a data scientist, you have the potential to do more damage than that!

And one other quote of Noam Chomsky’s would be a good way to end this post: “If you’re teaching today what you were teaching five years ago, either the field is dead or you are.” So scientists are forward-thinking people, ever innovative, no matter their field, and that’s what makes them scientists.

Technology Radar Report Fri, 26 Jun 2015 07:43:33 +0000 Creating a sustainable technology company involves keeping up with technology. The thing about technology is that it changes, and we have to look to the future, and invest our time now in things that will be valuable in the future. Or, we could switch to doing SharePoint consultancy for the rest of our lives, but I think most of us here would regard that as “checking out”.

This is a partly personal perspective of the future as I see it from our little hill in the Northwest of England (Brownlow Hill). Iʼm really just sketching out a few things that I see as being important for ScraperWiki. And since theyʼre important for ScraperWiki, theyʼre important for you! Or at least, you might be interested too.


The future is already here – itʼs just not evenly distributed. — William Gibson

Gibsonʼs quote certainly applies to the software industry. All of the things I highlight already exist and are in use (some for quite a long time now), they just haven’t reached saturation yet. So looking to the near future is a matter of looking to the now, and making an educated guess as to what technologies will become increasingly abundant.

Python 3

(I have been saying this for 6 years now, but) Python 3 is a real thing and in five yearsʼ time we will have stopped using Python 2 and all switched to Python 3. If you think this all seems obvious, I don’t think we can say the same about the transition from Perl 5 to Perl 6 (which lives in a perpetual state of being “out by christmas”) or from Latex 2e to Latex 3.

Encouragingly, in 2015 real people are using it for real projects (including ScraperWiki!). I would now consider it foolish to start a greenfield Python project in Python 2. If you maintain a Python library, it is starting to look negligent if it doesn’t work with Python 3.

Your Python 2 programming skills will mostly transfer to Python 3. There will be some teething trouble: print() and urllib still fox me sometimes, and I find myself using list() a lot more when debugging (because more things are generators). Niggly details aside, basically everything works and most things are a bit better.

The Go Programming Language

Globally I think the success of Go (the programming language) still remains uncertain, but its ecosystem is now large enough to sustain it in its own right. The risks here are not particularly technical but in the community. I think we would have difficulty hiring a Go programmer (we would have to find a programmer and train them).

The challenge for the next year or so is to work out what existing skills people have that transfer to Go, and, related to that, what a good framework of pre-cursor skills for learning Go looks like. Personally speaking, when learning Go my C skills help me a lot, as does the fact that I already know what a coroutine is. I would say that knowledge of Java interfaces will help.

I don’t think there’s a good path to learning Go yet, it will be interesting to see what develops. For the “Go curious” the Tour of the Go Programming Language is worth a look.

Docker / containers

Docker is healthy, and while it might not win the “container wars” clearly containers are a thing that are going to be technically useful for the next few years (flashback to OS VM). Effort in learning Docker is likely to also be useful in other “API over container” solutions.


Increasingly software is accessed not via a library but via a service available on the web (Software as a Service, SaaS). For example, ScraperWiki has a service to convert PDFs to tables.

ScraperWiki already use a few of these (for email delivery, database storage, accounting, payments, uptime alerts, notifications), and we’ll almost certainly be using more in the future. The obvious difference compared to using a library or building it yourself is that Software as a Service has a direct monetary cost. But that doesn’t necessarily make it more expensive. Consider e-mail delivery. ScraperWiki definitely has the technical expertise to manage our own mail delivery. But as a startup, we don’t have the time to maintain mail servers or the desire keep our mail server skills up to date. We’d rather buy that expertise in the form of the service that Sendgrid offers.

The future is much like the present. We will continue to make buy/build decisions, and increasingly the “buy” side will be a SaaS. The challenges will be in evaluating the offerings. Do they have a nice icon?

Amazon Web Services (AWS)

The mother of all SaaS.

It’s not going away and it’s getting increasingly complex. Amazon release new products every few weeks or so, and the web console becomes increasingly bewildering. I think @frabcus’s observation that “operating the AWS console” is a skill is spot on. I think there is an analogy (suggested by @IanHopkinson_) with the typing pool to desktop word processor transition: a low-paid workforce skilled in typing got replaced by giving PCs with word processors to high-paid executives with no typing skills. We no longer need IT technicians to build racks and wire them together, but instead relatively well paid devops staff do it virtually.

Cloud Formation. It’s a giant “JSON language” that describes how to create and wire together any piece of AWS infastructure.


Probably the thing to look at though. Even if we don’t use it directly (for example, we might use some replacement for Elastic Beanstalk or generate Cloud Formation files with scripts), knowing how to read it will be useful.

Big instances versus MapReduce

Whilst I think MapReduce will remain an important technology for the sector as a whole, this will be in opposition to the “single big instance”. Don’t get too hung up on terminology, I’m really using MapReduce as a placeholder for all MapReduce and hadoop-like “big data query” technologies.

Amazon Web Services makes it possible to rent “High Performance Computing” class nodes, for reasonable amounts of money. In 2015, you can get a 16 core (32 hyperthreads) instance with 60 or 244 Gigabytes of RAM for a couple of bucks per hour. I think the gap between laptops and big instances is widening, meaning that more ad hoc analysis will be done on a transient instance. You can process some pretty big datasets with 244 GB of RAM without needing to go all Hadoopy.

That is not to say that we should ignore MapReduce, but the challenge may be to find datasets of interest that actually require it.


Snowden’s revelations tell us that the NSA, and other state-level actors, are basically everywhere. In particular, there are hostile actors in the data centre. We should consider node to node communications as going across the public internet, even if they are in the same data centre. Practically speaking, this means HTTPS / TLS everywhere.

If we provide a data service to our clients using AWS then ideally only the client, us, and AWS should have access to that data. It is unfortunate that AWS have to have access to the data, but it is practical necessity. Having trusted AWS, we can’t stop them (or even know) shipping all of our data to the NSA, so it is a matter of their reputation that they not do that. At least if we encrypt our network traffic, AWS have to take fairly aggressive steps to send our data to anyone else (they have to fish our session keys out of their RAM, or mass transfer the contents of their RAM somewhere).

There is lots more to do and discuss here. Fortunately ScraperWiki is pretty healthy in this regard, we are sensitive to it and we’re always discussing security.

Browser IDE

Here I’m talking about the “behind the scenes” world that is accessed from the Developer Tools. There is an awesome box of tools there. Programmers are probably all aware of the JavaScript Console and the Web Inspector, but these are the tip of a very large and featureful iceberg. Almost everything is dynamic: adding and disabling CSS rules updates the page live, as does editing the HTML. There is a fully featured single-step debugger that includes a code editor. Only the other day I learnt of the “emulate mobile device” mode for screen size and network.

Spend time poking about with the Developer Tools.

Machine Learning

Although it’s not an area that I know much about, I suspect that it’s not just a buzzword and it may turn out to be useful.

git / Version Control

git is great and there is a lot to learn, but don’t forget its broader historical context. Believe it or not git is not the first version control tool to come along, and is not the first Software Configuration Management company. Just because git does it one particular way doesn’t mean that that way is best. It means that it is merely good enough for one person to manage the flow of patches of patches that go to make up the Linux kernel. I would also remind everyone that git != github. Practically, be aware of which bits of your workflow are git, and which are github.

(I’m bound to say something like that, Software Configuration Management used to be part of my consultancy expertise)

Google have declared this race won. They’ve shut down their own online code management product and have started hosting projects on github.

A plausible future is where everyone uses git and most people are blind to there being anything better and most people think that git == github. Whingeing aside, that future is a much better place to work in than if sourceforge had won.

The Future Technology Radar Report

Who knows what will be on the radar in the future.

]]> 1 758223193
Elasticsearch and elasticity: building a search for government documents Mon, 22 Jun 2015 08:39:44 +0000 A photograph of clouds under a magnifying glass.

Examining Clouds” by Kate Ter Harr, licensed under CC BY 2.0.

Based in Paris, the OECD is the Organisation for Economic Co-operation and Development. As the name suggests, the OECD’s job is to develop and promote new social and economic policies.

One part of their work is researching how open countries trade. Their view is that fewer trade barriers benefit consumers, through lower prices, and companies, through cost-cutting. By tracking how countries vary, they hope to give legislators the means to see how they can develop policies or negotiate with other countries to open trade further.

This is a huge undertaking.

Trade policies not only operate across countries, but also by industry.  This process requires a team of experts to carry out the painstaking research and detective work to investigate current legislation.

Recently, they asked us for advice on how to make better use of the information available on government websites. A major problem they have is searching through large collections of document to find relevant legislation. Even very short sections may be crucial in establishing a country’s policy on a particular aspect of trade.

Searching for documents

One question we considered is: what options do they have to search within documents?

  1. Use a web search engine. If you want to find documents available on the web, search engines are the first tool of choice. Unfortunately, search engines are black boxes: you input a term and get results back without any knowledge of how those results were produced. For instance, there’s no way of knowing what documents might have been considered in any particular search. Personalised search also governs the results you actually see. One normal-looking search of a government site gave us a suspiciously low number of results on both Google and Bing. Though later searches found far more documents, this is illustrative of the problems of search engines for exhaustive searching.
  2. Use a site’s own search feature. This is more likely to give us access to all the documents available. But, every site has a different layout and there’s a lack of a unified user interface for searching across multiple sites at once. For a one-off search of documents, having to manually visit and search across several sites isn’t onerous. Repeating this for a large number of searches soon becomes very tedious.
  3. Build our own custom search tool. To do this, we need to collect all the documents from sites and store those in a database that we run. This way we know what we’ve collected, and we can design and implement searches according to what the OECD need.


Enter Elasticsearch: a database designed for full text search and one which seemed to fit our requirements.

Getting the data

To see how Elasticsearch might help the OECD, we collected several thousand government documents from one website.

We needed to do very little in the way of processing. First, we extracted text from each web page using Python’s lxml. Along with the URL and the page title, we then created structured documents (JSON) suitable for storing in Elasticsearch.

Running Elasticsearch and uploading documents

Running Elasticsearch is simple. Visit the release page, download the latest release and just start it running. One sensible thing to do out of the box is change the default cluster name — the default is just elasticsearch. Making sure Elasticsearch is firewalled off from the internet is another sensible precaution.

When you have it running, you can simply send documents to it for storage using a HTTP client like curl:

curl "http://localhost:localport/documents/document" -X POST -d @my_document.json

For the few thousand documents we had, this wasn’t sluggish at all, though it’s also possible to upload documents in bulk should this prove too slow.


Once we have documents stored, the next thing to do is query them!

Other than very basic queries, Elasticsearch queries are written in JSON, like the documents it stores, and there’s a wide variety of query types bundled into Elasticsearch.

Query JSON is not difficult to understand, but it can become tricky to read and write due to the Russian doll-like structure it quickly adopts. In Python, the addict library is a useful one for making it easier to more directly write queries out without getting lost inside an avalanche of {curly brackets}.

As a demo, we implemented a simple phrase matching search using the should keyword.

This allows combination of multiple phrases, favouring documents containing more matches. If we use this to search for, e.g. "immigration quota"+"work permit", the results will contain one or both of these phrases. However, results with both phrases are deemed more relevant.

The Elasticsearch Tool


With our tool, researchers can enter a search, and very quickly get back a list of URLs, document titles and a snippet of a matching part of the text.


What we haven’t implemented is the possibility of automating queries which could also save the OECD a lot of time. Just as document upload is automated, we could run periodic keyword searches on our data. This way, Elasticsearch could be scheduled to lookout for phrases that we wish to track. From these results, we could generate a summary or report of the top matches which may prompt an interested researcher to investigate.

Future directions

For (admittedly small scale) searching, we had no problems with a single instance of Elasticsearch. To improve performance on bigger data sets, Elasticsearch also has built-in support for clustering, which looks straightforward to get running.

Clustering also ensures there is no single point of failure. However, there are known issues in that current versions of Elasticsearch can suffer document loss if nodes fail.

Provided Elasticsearch isn’t used as the only data store for documents, this is a less serious problem. It is possible to keep checking that all documents that should be in Elasticsearch are indeed there, and re-add them if not.

Elasticsearch is powerful, yet easy to get started with. For instance, its text analysis features support a large number of languages out of the box. This is important for the OECD who are looking at documents of international origin.

It’s definitely worth investigating if you’re working on a project that requires search. You may find that, having found Elasticsearch, you’re no longer searching for a solution.

Book review: Mastering Gephi Network Visualisation by Ken Cherven Mon, 15 Jun 2015 07:49:27 +0000 1994_7344OS_Mastering Gephi Network VisualizationA little while ago I reviewed Ken Cherven’s book Network Graph Analysis and Visualisation with Gephi, it’s fair to say I was not very complementary about it. It was rather short, and had quite a lot of screenshots. It’s strength was in introducing every single element of the Gephi interface. This book, Mastering Gephi Network Visualisation by Ken Cherven is a different, and better, book.

Networks in this context are collections of nodes connected by edges, networks are ubiquitous. The nodes may be people in a social network, and the edges their friendships. Or the nodes might be proteins and metabolic products and the edges the reaction pathways between them. Or any other of a multitude of systems. I’ve reviewed a couple of other books in this area including Barabási’s popular account of the pervasiveness of networks, Linked, and van Steen’s undergraduate textbook, Graph Theory and Complex Networks, which cover the maths of network (or graph) theory in some detail.

Mastering Gephi is a practical guide to using the Gephi Network visualisation software, it covers the more theoretical material regarding networks in a peripheral fashion. Gephi is the most popular open source network visualisation system of which I’m aware, it is well-featured and under active development. Many of the network visualisations you see of, for example, twitter social networks, will have been generated using Gephi. It is a pretty complex piece of software, and if you don’t want to rely on information on the web, or taught courses then Cherven’s books are pretty much your only alternative.

The core chapters are on layouts, filters, statistics, segmenting and partitioning, and dynamic networks. Outside this there are some more general chapters, including one on exporting visualisations and an odd one on “network patterns” which introduced diffusion and contagion in networks but then didn’t go much further.

I found the layouts chapter particularly useful, it’s a review of the various layout algorithms available. In most cases there is no “correct” way of drawing a network on a 2D canvas, layout algorithms are designed to distribute nodes and edges on a canvas to enable the viewer to gain understanding of the network they represent.  From this chapter I discovered the directed acyclic graph (DAG) layout which can be downloaded as a Gephi plugin. Tip: I had to go search this plugin out manually in the Gephi Marketplace, it didn’t get installed when I indiscriminately tried to install all plugins. The DAG layout is good for showing tree structures such as organisational diagrams.

I learnt of the “Chinese Whispers” and “Markov clustering” algorithms for identifying clusters within a network in the chapter on segmenting and partitioning. These algorithms are not covered in detail but sufficient information is provided that you can try them out on a network of your choice, and go look up more information on their implementation if desired. The filtering chapter is very much about the mechanics of how to do a thing in Gephi (filter a network to show a subset of nodes), whilst the statistics chapter is more about the range of network statistical measures known in the literature.

I was aware of the ability of Gephi to show dynamic networks, ones that evolved over time, but had never experimented with this functionality. Cherven’s book provides an overview of this functionality using data from baseball as an example. The example datasets are quite appealing, they include social networks in schools, baseball, and jazz musicians. I suspect they are standard examples in the network literature, but this is no bad thing.

The book follows the advice that my old PhD supervisor gave me on giving presentations: tell the audience what you are go to tell them, tell them and then tell them what you told them. This works well for the limited time available in a spoken presentation, repetition helps the audience remember, but it feels a bit like overkill in a book. In a book we can flick back to remind us what was written earlier.

It’s a bit frustrating that the book is printed in black and white, particularly at the point where we are asked to admire the blue and yellow parts of a network visualisation! The referencing is a little erratic with a list of books appearing in the bibliography but references to some of the detail of algorithms only found in the text.

I’m happy to recommend this book as a solid overview of Gephi for those that prefer to learn from dead tree, such as myself. It has good coverage of Gephi features, and some interesting examples. In places it is a little shallow and repetitive.

The publisher sent me this book, free of charge, for review.