Big data is “data that exceed the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.”
Ed Dumbill, Program Chair, O’Reilly Strata Conference
The allure of big data is the expectation that new types of data, in huge volumes, will give the product developer new valuable insights to drive product development. But there are downsides and cautions to consider.
As a base line, consider traditional data sources that came of age in a time when data storage, data collection and data processing/analysis were expensive. Great thought went into defining the learning objective, choosing unbiased data sources, defining data structures, ensuring data integrity and choosing the data analysis techniques.
Now those restraints are mostly gone and technology advances have created new types of data, such as
Social media monitoring has received a lot of attention. The premise is that by “listening in”[i] on social media activity, the product developer will identify new insights, unmet needs and valuable product ideas. This identification is based on finding patterns in the data.
As described in another post (Can an Algorithm Replace You) inductive reasoning finds patterns in the data, makes tentative hypotheses and finally presents a theory or rationale that explains the patterns. Theories generated through inductive reasoning are just that: theories. A theory needs to be tested before relying on it. Inductive reasoning is valuable for “exploratory” data analysis and potentially valuable theories can be tested using confirmatory data analysis.
The temptation, and risk, is running with a new product idea before confirming the pattern with your target customer base.
Continuing with the social media monitoring example, bias is another concern with big data. The gold standard in dealing with potential bias is to ensure a simple, random sample from the target population. In social media monitoring, keep in mind that the participants are likely a sub-segment of your target customer base and not representative of the whole. Possibilities for bias include: being more fanatical about your product, early adopters, more technology savvy, and more interested in sharing their opinions. Any pattern you identify needs to be tested for validity on a sample representative of your target customer.
Some people argue that having more data with bias, is more valuable than having a small sample with no bias. But Xiao-Li Meng, Dean of the Graduate School of Arts and Sciences at Harvard University gave a clever demonstration based on estimating a mean to refute that argument[ii].
By the magic of math, the total error in the mean estimate is the sum of variance and bias-squared.
In a simple random sample (SRS), the bias is zero. In a very large sample of biased data (LSBD), the variance is negligible due to the large size of the data set. This means that we can compare total error for each sample by looking at the relative sizes of 1) variance for the SRS and 2) bias for the LSBD. After some math magic, we find the “Big Data Paradox” – the larger the data, the more pronounced the bias[iii].
To illustrate, suppose we have a simple random sample of 100 people in the US. How large does the biased data set have to be to produce a more accurate sample mean than the simple random sample?
The answer: more than 50% of the population, more than 160 million people[iv].
So as Dr. Meng concludes, size does not overcome bias; it just makes the bias more pronounced. Unless you can correct for that bias your big data estimates are going to be off, perhaps way off. One way to correct the bias is to combine (mashup) your biased samples, with unbiased samples.
To avoid going astray
[i] Terlep, Sharon “Focus Groups Fall Out of Favor, Technology-driven tools do a better job of explaining consumer behavior today,” Wall Street Journal, September 19, 2016.
[ii] This was presented in his talk at the acceptance of the Chicago Chapter of the American Statistical Association award for Statistician of the Year 2015-2016. “Statistical Paradises and Paradoxes of Big Data”, https://ww2.amstat.org/misc/XiaoLiMengBDSSG.pdf.
[iii] Meng, Xiao-Li and Xianchao Xie “I Got More Data, My Model is More Refined, but My Estimator is Getting Worse! Am I Just Dumb?,” Econometric Reviews, 2014, Vol 33 Issue 1-4, pages 218-250. This article requires some knowledge of probability theory and statistical modeling.
[iv] This calculation is based on the explanation given in “Statistical Paradises and Paradoxes of Big Data” referenced above, and using r=.1.