Pius Okoh – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Which car should I (not) buy? Find out, with the ScraperWiki MOT website… https://blog.scraperwiki.com/2015/09/which-car-should-i-not-buy-find-out-with-the-scraperwiki-mot-website/ Wed, 23 Sep 2015 15:14:58 +0000 https://blog.scraperwiki.com/?p=758223689 I am finishing up my MSc Data Science placement at ScraperWiki and, by extension, my MSc Data Science (Computing Specialism) programme at Lancaster University. My project was to build a website to enable users to investigate the MOT data. This week the result of that work, the ScraperWiki MOT website, went live. The aim of this post is to help you understand what goes on ‘behind the scenes’ as you use the website. The website, like most other web applications from ScraperWiki, is data-rich. This means that it is designed to give you facts and figures and provide an interface for you to interactively select and view data of your interest, in this case to query the UK MOT vehicle testing data.

The homepage provides the main query interface that allows you to select the car make (e.g. Ford) and model (e.g. Fiesta) you want to know about.

You have the option to either view the top faults (failure modes) or the pass rate for the selected make and model. There is the option of “filter by year” which selects vehicles by the first year on the road in order to narrow down your search to particular model years (e.g. FORD FIESTA 2008 MODEL).

When you opt to view the pass rate, you get information about the pass rate of your selected make and model as shown:

When you opt to view top faults you see the view below, which tells you the top 10 faults discovered for the selected car make and model with a visual representation.

These are broad categorisations of the faults, if you wanted to break each down into more detailed descriptions, you click the ‘detail’ button:

level2

What is different about the ScraperWiki MOT website?

Many traditional websites use a database as a data source. While this is generally an effective strategy, there are certain disadvantages associated with this practice. The most prominent of which is that a database connection effectively has to always be maintained. In addition the retrieval of data from the database may take prohibitive amounts of time if it has not been optimised for the required queries. Furthermore, storing and indexing the data in a database may incur a significant storage overhead.

mot.scraperwiki.com, by contrast, uses a dictionary stored in memory as a data source. A dictionary is a data structure in Python similar to a hash-able map in Java. It consists of key-value pairs and is known to be efficient for fast lookups. But where do we get a dictionary from and what should be its structure? Let’s back up a little bit, maybe to the very beginning. The following general procedure was followed to get us to where we are with the data:

  • 9 years of MOT data was downloaded and concatenated.
  • Unix data manipulation functions (mostly command-line) were used to extract the columns of interest.
  • Data was then loaded into a PostgreSQL database where data integration and analysis was carried out. This took the form of joining tables, grouping and aggregating the resulting data.
  • The resulting aggregated data was exported to a text file.

The dictionary is built using this text file, which is permanently stored in an Amazon S3 bucket. The file contains columns including make, model, year, testresult and count. When the server running the website is initialised this text file is converted to a nested dictionary. That is to say a dictionary of dictionaries, the value associated with a key is another dictionary which can be accessed using a different key.

When you select a car make, this dictionary is queried to retrieve the models for you, and in turn, when you select the model, the dictionary gives you the available years. When you submit your selection, the computations of the top faults or pass rate are made on the dictionary. When you don’t select a specific year the data in the dictionary is aggregated across all years.

So this is how we end up not needing a database connection to run a data-rich website! The flip-side to this, of course, is that we must ensure that the machine hosting the website has enough memory to hold such a big data structure. Is it possible to fulfil this requirement at a sustainable, cost-effective rate? Yes, thanks to Amazon Web Services offerings.

So, as you enjoy using the website to become more informed about your current/future car, please keep in mind the description of what’s happening in the background.

Feel free to contact me, or ScraperWiki about this work and …enjoy!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223689
Scientists and Engineers… of What? https://blog.scraperwiki.com/2015/06/scientists-and-engineers-of-what/ Fri, 26 Jun 2015 11:25:00 +0000 https://blog.scraperwiki.com/?p=758223225 “All scientists are the same, no matter their field.” OK that sounds like a good ‘quotable’ quote, and since I didn’t see it said by anyone else, I can claim it as my own saying. The closest quote to this I saw was “No matter what engineering field you’re in, you learn the same basic science and mathematics. And then maybe you learn a little bit about how to apply it.” by Noam Chomsky. These statements are similar but not quite the same.

datamainstream

While the former focuses on what scientists actually DO, the later has more to do with what people LEARN in the process of becoming engineers. The aim of this post is to try to prove that scientists and engineers are essentially the same in terms of the methods, processes and procedures they use to get their job done, no matter their field of endeavour.

To make clearer the argument I’m trying to put up, I am narrowing down my comparison to two ‘types’ of scientists. The first I’d call ‘mainstream scientists’ and the second are data scientists. Who are mainstream scientists? Think of them as the sort of physicists, mathematicians, scientists and engineers that worked on the Orion project. (If you haven’t heard of this project, please watch this video of what the Orion project was about).

So Project Orion was about man trying to ascend to Mars by an atomic bomb propelled by nuclear reactions. Just watching the video and thinking about the scientific process followed, the ‘trial-and-error’ methodology (sic) and the overall project got me thinking that data scientists are just like that! So let’s get down to the actual similarities.

To start with, every introduction to science usually begins with a description of the scientific ‘method’ which (with a little variation here and there) includes: formulation of a question, hypothesis, prediction, testing, analysis, replication, external review, data recording and sharing. (this version of the scientific process was borrowed here). Compare this with the software development life cycle that a data scientist would normally follow: Requirement gathering and analysis, Design, Implementation or coding, Testing, Deployment, Maintenance (source). It’s not difficult now to see that one process was derived from the other, is it?

Moving on, the short name I’d give to much of the scientific and software development process is ‘trial-and-error’ methodology (name has actually been upgraded to ‘Agile’ methodology). Project Orion’s ‘mainstream’ scientists tried (and failed at) several options for getting the rocket to escape Earth’s gravity. Data scientists try several ways to get their analytics done. In both scenarios, sometimes, an incremental step damages the entire progress made so far, and the question of ‘how do we get back to the last good configuration?’ arises. Data scientists have been having good success in recent times in this regard by using some form of version control system (like Git). How do the mainstream scientists manage theirs? I don’t know about now, but Project Orion didn’t have a provision for that.

So are mainstream scientists and data scientists the same? I’ll say a definite yes since they follow similar methods to get their work or research done. If you’re a data scientist, feel free now to identify with every other scientist in the world. Don’t feel any less a scientist because your work does not overtly affect people’s lives (like displacing people for fear of nuclear contamination, or damaging earth’s landscape as an unexpected by-product of your experiments) as mainstream scientists do. In reality, with the tools you have at your disposal as a data scientist, you have the potential to do more damage than that!

And one other quote of Noam Chomsky’s would be a good way to end this post: “If you’re teaching today what you were teaching five years ago, either the field is dead or you are.” So scientists are forward-thinking people, ever innovative, no matter their field, and that’s what makes them scientists.

]]>
758223225
MOT Data Analysis: Progress Along the Fault-Pattern Finding Path https://blog.scraperwiki.com/2015/06/mot-data-analysis-progress-along-the-fault-pattern-finding-path/ Fri, 12 Jun 2015 15:10:50 +0000 https://blog.scraperwiki.com/?p=758223031 How do data science and data engineering differ? And where do they overlap? I agree to a large extent with the answer given here.

A data scientist must be able to ask the right questions – ‘right’ in this context meaning interesting, providing intelligence that can lead to process improvement or greater profitability (you don’t want to invest time and skills finding out why the sky is blue because it won’t make a difference if it was grey!). After getting out the right questions, a data scientist must know how to answer them, so he needs the technical expertise – skills in suitable programming and database technology that can help do statistics, machine learning, and data mining to uncover the answers. Then the last stage is data presentation – using charts, graphs or other presentation tools.

A data engineer is more interested in the infrastructure and architecture that aid fast and efficient processing of big data. He is closer to a software and hardware engineer than a data scientist, but both the data scientist and engineer are good programmers, database enthusiasts and fast learners of new skills. So it stands to reason that the transition for a typical professional in conformance with the current shift from the computer age to the information age (as I like to think) would be Software Engineer –> Data Engineer –> Data Scientist (if desired). In the ideal scenario, following strict (theoretical) definitions, in time we should not need software engineers anymore, but rather data engineers and scientists as they’ll have the software engineering skills and more.

But back to the real world, where the line is blurred, or sometimes doesn’t exist, and job titles do not matter but job descriptions and requirements do. You are given a problem, and you need to figure out how to solve it, mastering new skills if you have to.

Now in my first blog post I introduced the MOT data set as the data set on which my MSc research project is based. The MOT data set is part of UK’s open data and is available in the public domain. The aim of the project is to explore the data as fully as possible to find out the top faults for vehicles that failed at MOT testing and to be able to display these faults interactively. So let me talk about the progress of work on the MOT data set (that was the reason for this post!). One interesting question was: What are the faults typically detected for different make of cars at MOT testing? The MOT data set contains information that could help to create hierarchies or levels of faults for vehicles that failed at tests. To illustrate the meaning of the hierarchy, if a Ford failed during tests, for example, one may be interested in finding out:

  • Was it the brakes, lightning, tyres, parts of the engine, etc. that was faulty?
  • If it was the engine, what part of it?
  • In turn, what was the exact problem with that part?
  • Finally what is the simple statement of the problem discovered?

Peter at ScraperWiki (closest to a Data Engineer in my opinion by above definition – he might disagree) explored this using the test item detail and item group data sets and produced, using Python, the following tree showing hierarchies of MOT test failures.tree-of-failure_720

What other progress has there been? Well, the test results data has also been factored into the equation. So we now have a reasonable description of the fault discovered for each vehicle that was tested and failed at first MOT testing, and we have these presented in hierarchical format, e.g. for a VOLVO XC70 D OCEAN RACE 7827 that failed during testing, we can see the levels or hierarchies of the problem detected as presented in the MOT data set:

Level 3 Suspension
Level 2 Anti-roll bars
Level 1 Linkage
Description of Fault given in MOT data set has excessive play in a ball joint

Hopefully this will make sense to a mechanical engineer or car designer working for Volvo!

The final aim of the project is a web site to enable users to get information of their interest. This dictates the next steps and so watch this space for updates.

]]>
758223031
Hi, I’m Pius…. https://blog.scraperwiki.com/2015/06/hi-im-pius/ Thu, 04 Jun 2015 09:48:10 +0000 https://blog.scraperwiki.com/?p=758222928 profile photo…and I’m the new thing at ScraperWiki. Yes you heard right, thing, not person or guy or anything human. Since I learnt that real-world entities could be modeled using programming language objects in order to answer questions or make inferences, one weird thing in my brain just interpreted it the other way – that real-world entities are the abstractions and programming language objects are the real thing. So I am an object* rather than a person – enough said.

I have always been intrigued by mathematics and computer programs and this informed my choice of Computer Science with Mathematics as my first degree. It’s amazing how far technology has come in simplifying processes and making life more interesting in general (now is not the time to talk about technology’s negative effects, if any!).

While my first real programming exposure was at a normal and acceptable pace, the same cannot be said about my introduction to databases and data management. My first database-related experience was when I joined a team consulting for a multi-national communications service provider. The application we managed rode on an Oracle database holding data for more than 63 million subscribers, with many tables having about half a billion rows of data! (ok, take away the exclamation mark, that is no longer ‘big’ these days).

It is from this exposure that I developed a real interest in data (data management, data analytics, databases, data manipulation), and became up for opportunities to hone my skills in this area. So when the chance to do a Master’s came along, I chose to do it in Data Science. And as ‘serendipity’ would have it, I ended up doing an internship at ScraperWiki as part of the course. Needless to say, the dataset I’m working on, namely the UK MOT data set,  is ‘big’, and I hope to make the best out of it.

With the size of data comes concerns about the speed of processing the data to derive insights, as well as memory and disk space concerns. ScraperWiki’s team of experts especially Peter have been really helpful in providing tips and tricks in this direction – and of course we’re just starting. Watch this space for developments and updates as this project progresses.

But don’t think I’m all about work and more work, and that’s why I like ScraperWiki’s ‘work hard, play hard’ approach. If you’d like to see more of my other side, do feel free to take me out for food or drink (not tea, as I have enough in the office!) I also enjoy swimming and cycling, although I usually get the chance to only walk or run.

Let me end this by saying the ScraperWiki environment is exactly the kind of work environment I wished for. You are given the independence to use whatever technology you deem fit to accomplish your tasks, and you are surrounded by experts and solution-oriented individuals ever willing to help so you just have this confidence that you can get anything and everything done!

*footnote: we at ScraperWiki do not consider Pius to be an object 🙂

]]>
758222928