Four data trends to rule them all, the data scientist king to bind them

My favourite soundbite from O’Reilly’s Strata data conference was a definition of big data. John Rauser, Amazon’s main data scientist, said to me that “data is big data when you can’t process it on one machine”. And naturally, small data is data that you can process on one machine.

What’s nice about this definition is it makes it immediately clear that as time passes, big data is getting smaller. And of course, small data is getting bigger. This is linked to four interrelating data technology and business trends.

1. Super Moore’s law for data. Even without any specific new technology, what we can process on “one computer” would be getting larger anyway with Moore’s law applied to processors, RAM and disks.

But that’s not what’s happening, we’re also in the middle of the commoditisation of what was once part of Google’s competitive advantage – distribution of work over clusters of bog standard servers, using things like Hadoop.

Right now you still need to hire special engineers to do that, but it is only a matter of time before it is just a service you buy with your credit card – process any amount of data at any speed, with just a slider that any wealthy business man’s data scientist can drag.

The feeling is, rollercoaster!, we’re going faster than Moore’s law with data right now.

2. Business of big data. The result of the first trend, is that now every company does store and can process as much data as once only the tech giants did. This is very significant strategically and tactically, in very specific ways to each industry. Given the right use of the data (see the next two trends), it changes everything.

We’ve seen this in book selling for years now because Amazon was ahead of the curve. But imagine both basic algorithms, and fancy ones like cunningly used to reproduce images from our visual cortex this week, applied in as yet untouched areas.

3. Collaboration of data. The above two trends do not need the Internet. Even if we were still all locked in isolated corporate data centres, through a freak historic accident preventing the invention of global packet switching, we would still be getting the transistor cheapness of Moore’s law, and we’d still be running clusters (“clouds”) of map/reduce servers with just local networking.

The Internet isn’t about raw CPU power. It’s about collaboration. Collaboration is changing how we work with documents, how we share news, how we keep in touch with our friends, how we build software. Why wouldn’t it change how we work with data? Of course, it already is.

There are quite different ways it can happen more. It can create marketplaces for the commercial exchange of data, more transparent than the existing siloing data resellers. It can create tools for socialising the analysis, visualisation, quality checking and gathering of data. It can allow governments, and corporations, to be radically transparent at opening data in all cases except their unique competitive advantage.

It’s remarkable how old the Internet is, and how badly we collaborate on working with data.

4. The data scientist is king. A term cooked up (so I’m told) a few years ago by the chief data scientists at Linked in and Facebook while in a bar deciding what to name their new teams, this is the latest and newly trending iteration of job titles on what was called a statistician or data analyst.

But it isn’t the same. They’re not just geeks buried away. Yes, a data scientist is a data geek. They love data, and interesting data, more than anything. They know how to program, but in scripting languages and SQL, not hard core software engineering. They understand statistics as if they were brought up by a prior probability.

But they also care about the business, and they know how to communicate. They can give presentations to senior management, making hard stats clear. Volunteer, non-profit data scientists have an unassailable passion for their mission.

Data scientists are the glue, linking data to decisions. Cathy O’Neil gave a fantastic talk at Strata describing these beasts, useful if you are one but didn’t realise it had a buzzy new marketing term, or if you are getting into the business of data and need to hire some.

You can’t make full and accurate use of any of the other trends above, without data scientists.

Tags: big data, data scientist, open data

Trackbacks/Pingbacks

‘Big Data’ in the Big Apple | ScraperWiki Data Blog - September 29, 2011
[…] Go to ScraperWiki.com → ← Four data trends to rule them all, the data scientist king to bind them […]
Scraping New Frontiers | ScraperWiki Data Blog - October 11, 2011
[…] encasing invaluable data is the chasm that must be crossed to reach the new frontiers of ‘Big Data‘. In that sense we’re looking to lead some expeditions in search of this promised land. […]
Meet Data Without Borders | ScraperWiki Data Blog - October 28, 2011
[…] navigating the rugged landscape of ‘Big Data‘, we’ve crossed many barriers, traversed wide plains and battled through choppy […]
Our friendly competitors / partners | ScraperWiki Data Blog - November 28, 2011
[…] It isn’t meant to be complete, the lists of companies are just indicative. It also isn’t meant to be an overall industry overview – covering just the phrase “Big Data” would make it enormous! […]