There is a lot of potential in Big Data, but only if we dial down the hype and focus on building solid analytic foundations.
Big data is a hot topic. Our daily interactions with the Internet -- through laptops, smart phones, tablets, and IoT devices -- generate vast amounts of data that are thought to encode our buying preferences, our political and economic moods, and more. This attracts entrepreneurs who seek new ways to make money with new methods to manage and analyze the sheer volume and distributed nature of the data. Along the way, discussion around the potential benefits of big data analysis seems to have morphed from enthusiasm to almost religious fervor, asserting that that big data analytics will become the dominant method of analysis for almost everything. The apex (or more exactly the nadir) of this adulation may be an article in Wired magazine -- from the editor in chief, no less -- titled The end of theory: the data deluge makes the scientific method obsolete. Let that one sink in -- we didn't need Galileo or Newton or Einstein, we just needed big data analytics.
In what follows, it may seem that I'm just another engineer who doesn't get it, but I'm actually a fan. However, I'm wary of claims that seem too over the top, especially when they start leaking into popular media. I'm also concerned that the expectations being heaped on a rather slender framework of substance will inevitably lead to collapse and disappointment, which would be a shame because it will overshadow the real value that might come from this direction. In the spirit of encouraging healthy debate, here are a few pins in the hyperbole balloon. Several of these ideas are from, or inspired by, Nate Silver's The Signal and the Noise.
Large amounts of data do not guarantee large amounts of accessible information
If you have any physics background, you know that information is related to entropy, not the size of the dataset. As entropy goes up, information goes down. At one extreme, a volume of gas in equilibrium contains, in principle, a huge amount of information -- the positions and momenta of each molecule in the gas. This number easily dwarfs any measure of information on the Internet. And yet we can only extract two numbers from all that data -- the pressure and temperature of the gas. Entropy continues to limit available information as you move away from this extreme, because it is difficult to extract a signal from noisy data. Differential Power Analysis (DPA) is an example where an encryption key can be extracted from power data though clever statistical sampling over many large datasets, but it is not a technique for the faint of heart. DPA requires strong mathematical underpinnings, and from all that analysis and math you are only able to extract one 256-bit encryption key (valuable to be sure, but not "a lot" of information).
As you drop down to lower entropies, it becomes progressively easier to extract information. At some point, it becomes sufficiently easy that whatever information can be found becomes rather obvious, without the need for big data techniques. So the usefulness of big data methods must sit in the Goldilocks zone of entropy, neither "too much" nor "too little." This may still be an interesting area, but it hardly supports the "universal solution" claims made for the method.
A related consideration, well known to statisticians, is the problem of uncontrolled variables. Whenever you perform an analysis on a large dataset, you look for correlation between a small number of variables. But the data may be sensitive to many variables. What you are hoping is that some kind of relationship between the selected variables will dominate all other effects. You can potentially improve the correlation by filtering, especially by requiring certain other variables to be more or less fixed in your sample. But in order to do this, you first have to know what variables you want to control. And you have to remember that the more variables you use to filter, the smaller your sample will become. If you control too much, the sample may be too small to support any meaningful conclusions.
To Page 2 >