Friday, May 13, 2016

Big Data and Magical Thinking

Every business is drowning in data. Every business believes they should make better use of that data. Seemingly a long time ago, Big Data came onto the scene as The Answer. It became a buzzword and a cottage industry. In some places, it simply became a synonym for Hadoop.

The challenge is that simply having more data, or combining all of the business’ data into a common pool or ‘lake’ isn’t by itself going to unlock insights, as if by magic.

Rigor is required in managing the data sources and the meaning of various data elements, and equally rigor is required in applying proper mathematical techniques in analysis of the data and avoidance of misleading conclusions.

Things like curse of dimensionality (applicable to sampling and anomaly detection among other things), misuse of p-values, and implicit assumptions about the shape of probability distribution come to mind as some of the most common omissions.