MOT – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… https://blog.scraperwiki.com/2015/09/which-car-should-i-not-buy-find-out-with-the-scraperwiki-mot-website/ Wed, 23 Sep 2015 15:14:58 +0000 https://blog.scraperwiki.com/?p=758223689 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

level2

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223689
MOT Data Analysis: Progress Along the Fault-Pattern Finding Path https://blog.scraperwiki.com/2015/06/mot-data-analysis-progress-along-the-fault-pattern-finding-path/ Fri, 12 Jun 2015 15:10:50 +0000 https://blog.scraperwiki.com/?p=758223031 How do data science and data engineering differ? And where do they overlap? I agree to a large extent with the answer given here.

A data scientist must be able to ask the right questions – ‘right’ in this context meaning interesting, providing intelligence that can lead to process improvement or greater profitability (you don’t want to invest time and skills finding out why the sky is blue because it won’t make a difference if it was grey!). After getting out the right questions, a data scientist must know how to answer them, so he needs the technical expertise – skills in suitable programming and database technology that can help do statistics, machine learning, and data mining to uncover the answers. Then the last stage is data presentation – using charts, graphs or other presentation tools.

A data engineer is more interested in the infrastructure and architecture that aid fast and efficient processing of big data. He is closer to a software and hardware engineer than a data scientist, but both the data scientist and engineer are good programmers, database enthusiasts and fast learners of new skills. So it stands to reason that the transition for a typical professional in conformance with the current shift from the computer age to the information age (as I like to think) would be Software Engineer –> Data Engineer –> Data Scientist (if desired). In the ideal scenario, following strict (theoretical) definitions, in time we should not need software engineers anymore, but rather data engineers and scientists as they’ll have the software engineering skills and more.

But back to the real world, where the line is blurred, or sometimes doesn’t exist, and job titles do not matter but job descriptions and requirements do. You are given a problem, and you need to figure out how to solve it, mastering new skills if you have to.

Now in my first blog post I introduced the MOT data set as the data set on which my MSc research project is based. The MOT data set is part of UK’s open data and is available in the public domain. The aim of the project is to explore the data as fully as possible to find out the top faults for vehicles that failed at MOT testing and to be able to display these faults interactively. So let me talk about the progress of work on the MOT data set (that was the reason for this post!). One interesting question was: What are the faults typically detected for different make of cars at MOT testing? The MOT data set contains information that could help to create hierarchies or levels of faults for vehicles that failed at tests. To illustrate the meaning of the hierarchy, if a Ford failed during tests, for example, one may be interested in finding out:

  • Was it the brakes, lightning, tyres, parts of the engine, etc. that was faulty?
  • If it was the engine, what part of it?
  • In turn, what was the exact problem with that part?
  • Finally what is the simple statement of the problem discovered?

Peter at ScraperWiki (closest to a Data Engineer in my opinion by above definition – he might disagree) explored this using the test item detail and item group data sets and produced, using Python, the following tree showing hierarchies of MOT test failures.tree-of-failure_720

What other progress has there been? Well, the test results data has also been factored into the equation. So we now have a reasonable description of the fault discovered for each vehicle that was tested and failed at first MOT testing, and we have these presented in hierarchical format, e.g. for a VOLVO XC70 D OCEAN RACE 7827 that failed during testing, we can see the levels or hierarchies of the problem detected as presented in the MOT data set:

Level 3 Suspension
Level 2 Anti-roll bars
Level 1 Linkage
Description of Fault given in MOT data set has excessive play in a ball joint

Hopefully this will make sense to a mechanical engineer or car designer working for Volvo!

The final aim of the project is a web site to enable users to get information of their interest. This dictates the next steps and so watch this space for updates.

]]>
758223031