We hear a lot about “Big Data” at ScraperWiki. We’ve always been a bit bemused by the tag since it seems to be used indescriminately. Just what is big data and is there something special I should do with it? Is it even a uniform thing?
I’m giving a workshop on data science next week and one of the topics of interest for the attendees is “Big Data”, so I thought I should investigate in a little more depth what people mean by “Big Data”. Hence I have read Big Data by Viktor Mayer-Schönberger and Kenneth Cukier, subtitled “A Revolution That Will Transform How We Live, Work and Think” – chosen for the large number of reviews it has attracted on Amazon. The subtitle is a guide to their style and exuberance.
Their thesis is that we can define big data, in contrast to earlier “little data”, by three things:
- It’s big but not necessarily that big, their definition for big is that n = all. That is to say that in some domain you take all of the data you can get hold of. They use as one example a study on bout fixing in sumo wrestling, based on data on 64,000 bouts – which would fit comfortably into a spreadsheet. Other data sets discussed are larger such as credit card transaction data, mobile telephony data, Google’s search query data…;
- Big data is messy, it is perhaps incomplete or poorly encoded. We may not have all the data we want from every event, it may be encoded using free text rather than strict categories and so forth;
- Working with big data we must discard an enthusiasm for causality and replace it with correlation. Working with big data we shouldn’t mind too much if our results are just correlations rather than explanations (causation);
- An implicit fourth element is that the analysis you are going to apply to your big data is some form of machine learning.
I have issues with each of these “novel” features:
Scientists have long collected datasets and done calculations that are at the limit (or beyond) their ability to process the data produced. Think protein x-ray crystallography, astronomical data for navigation, the CERN detectors etc etc. You can think of the decadal censuses run by countries such as the US and UK as n = all. Or the data fed to the early LEO computer to calculate the deliveries required for each of their hundreds of teashops. The difference today is that people and companies are able to effortlessly collect a larger quantity of data than ever before. They’re able to collect data without thinking about it first. The idea of n = all is not really a help. The straw man against which it is placed is the selection of a subset of data by sampling.
They say that big data is messy implying that what went before was not. One of the failings of the book is their disregard for those researchers that have gone before. According to them the new big data analysts are comfortable with messiness and uncertainty, unlike those fuddy-duddy statisticians! Small data is messy, scientists and statisticians have long dealt with messy and incomplete data.
The third of their features: we must be comfortable with correlation rather than demand causation. There are many circumstances where correlation is OK, such as when Amazon uses my previous browsing and purchase history to suggest new purchases but the area of machine learning / data mining has long struggled with messiness and causality.
This is not to say nothing has happened in the last 20 or so years regarding data. The ubiquity of computing devices, cheap storage and processing power and the introduction of frameworks like Hadoop are all significant innovations in the last 20 years. But they grow on things that went before, they are not a paradigm shift. Labelling something as ‘big data’, so ill-defined, provides no helpful insight as to how to deal with it.
The book could be described as the “What Google Did Next…” playbook. It opens with Google’s work on flu trends, passes through Google’s translation work and Google Books project. It includes examples from many other players but one gets the impression that it is Google they really like. They are patronising of Amazon for not making full use of the data they glean from their Kindle ebook ecosystem. They pay somewhat cursory attention to issues of data privacy and consent, and have the unusual idea of creating a cadre of algorithmists who would vet the probity of algorithms and applications in the manner of accountants doing audit or data protection officers.
So what is this book good for? It provides a nice range of examples of data analysis and some interesting stories regarding the use to which it has been put. It gives a fair overview of the value of data analysis and some of the risks it presents. It highlights that the term “big data” is used so broadly that it conveys little meaning. This confusion over what is meant by “Big Data” is reflected on the datascience@Berkeley blog which lists definitions of big data from 30 people in the field (here). Finally, it provides me with sufficient cover to make a supportable claim that I am a “Big Data scientist”!
To my mind, the best definition of big data that I’ve seen is that it is like teenage sex…
- Everyone talks about it,
- nobody really knows how to do it,
- everyone thinks everyone else is doing it,
- so everyone claims they are doing it too!