How do data science and data engineering differ? And where do they overlap? I agree to a large extent with the answer given here.
A data scientist must be able to ask the right questions – ‘right’ in this context meaning interesting, providing intelligence that can lead to process improvement or greater profitability (you don’t want to invest time and skills finding out why the sky is blue because it won’t make a difference if it was grey!). After getting out the right questions, a data scientist must know how to answer them, so he needs the technical expertise – skills in suitable programming and database technology that can help do statistics, machine learning, and data mining to uncover the answers. Then the last stage is data presentation – using charts, graphs or other presentation tools.
A data engineer is more interested in the infrastructure and architecture that aid fast and efficient processing of big data. He is closer to a software and hardware engineer than a data scientist, but both the data scientist and engineer are good programmers, database enthusiasts and fast learners of new skills. So it stands to reason that the transition for a typical professional in conformance with the current shift from the computer age to the information age (as I like to think) would be Software Engineer –> Data Engineer –> Data Scientist (if desired). In the ideal scenario, following strict (theoretical) definitions, in time we should not need software engineers anymore, but rather data engineers and scientists as they’ll have the software engineering skills and more.
But back to the real world, where the line is blurred, or sometimes doesn’t exist, and job titles do not matter but job descriptions and requirements do. You are given a problem, and you need to figure out how to solve it, mastering new skills if you have to.
Now in my first blog post I introduced the MOT data set as the data set on which my MSc research project is based. The MOT data set is part of UK’s open data and is available in the public domain. The aim of the project is to explore the data as fully as possible to find out the top faults for vehicles that failed at MOT testing and to be able to display these faults interactively. So let me talk about the progress of work on the MOT data set (that was the reason for this post!). One interesting question was: What are the faults typically detected for different make of cars at MOT testing? The MOT data set contains information that could help to create hierarchies or levels of faults for vehicles that failed at tests. To illustrate the meaning of the hierarchy, if a Ford failed during tests, for example, one may be interested in finding out:
- Was it the brakes, lightning, tyres, parts of the engine, etc. that was faulty?
- If it was the engine, what part of it?
- In turn, what was the exact problem with that part?
- Finally what is the simple statement of the problem discovered?
Peter at ScraperWiki (closest to a Data Engineer in my opinion by above definition – he might disagree) explored this using the test item detail and item group data sets and produced, using Python, the following tree showing hierarchies of MOT test failures.
What other progress has there been? Well, the test results data has also been factored into the equation. So we now have a reasonable description of the fault discovered for each vehicle that was tested and failed at first MOT testing, and we have these presented in hierarchical format, e.g. for a VOLVO XC70 D OCEAN RACE 7827 that failed during testing, we can see the levels or hierarchies of the problem detected as presented in the MOT data set:
|Description of Fault given in MOT data set
|has excessive play in a ball joint
Hopefully this will make sense to a mechanical engineer or car designer working for Volvo!
The final aim of the project is a web site to enable users to get information of their interest. This dictates the next steps and so watch this space for updates.