Don’t get led astray by the commonly overlooked downsides of big data
Big data is “data that exceed the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures.”
Ed Dumbill, Program Chair, O’Reilly Strata Conference
The Allure of Big Data
The allure of big data is the expectation that new types of data, in huge volumes, will give the product developer new valuable insights to drive product development. But there are downsides and cautions to consider.
As a base line, consider traditional data sources that came of age in a time when data storage, data collection and data processing/analysis were expensive. Great thought went into defining the learning objective, choosing unbiased data sources, defining data structures, ensuring data integrity and choosing the data analysis techniques.
Now those restraints are mostly gone and technology advances have created new types of data, such as
- Machine-generated/sensor data – CDR (call detail records), weblogs, smart meters, manufacturing sensors, equipment logs, trading systems data, cell phone GPS
- Social media and other new formats – customer feedback, micro-blogging sites like Twitter, social media platforms like Facebook, audio, video, photos
Big data is often described by the 4 Vs:
- Volume – Increasing volumes of data being saved; machine generated data is increasing faster than human generated data
- Velocity – Fast inflow into the organization – many call it a firehose
- Variety – Traditional data was well-defined and structured, now formats are ever-changing and unstructured (text, video, audio, etc.)
- Value – Economic value of different types of data varies greatly; the question is how to identify and extract what is valuable
Some key characteristics of big data have implications for forecasting, learning and decision making:
- Big data tends to be a by-product of another activity, not the result of a planned and controlled study, for example
- People spontaneously interacting on Facebook and Twitter through text comments and sharing photos, pins, videos and URLs
- GPS data generated while walking, driving and biking throughout the day
- Random web-surfing, or researching a topic on the internet
- Big data is generally observational, meaning the analyst has not introduced a stimulus of some kind and recorded the response to that stimulus (i.e. an experiment – see an earlier post: The Different Flavors of Experimentation). The analyst merely observes activity and often does not know if some stimulus has triggered this activity
- Being observational, the analyst also generally does not control who is being observed; there is no guarantee of a representative, random sample. In fact, bias is usually present in big data
- Also, as a by-product, big data tends to be unstructured, requiring extensive clean-up and manipulation to transform it into a usable format. Different analysts will perform these activities differently, which potentially introduces more bias
- Data mashups are increasingly common. A data mashup involves combining data from disparate sources. Often this means data sources have different time frames, as well as, different levels of validity, completeness and bias
Pitfall 1: Assuming insights in your big data set apply to your target customer
Social media monitoring has received a lot of attention. The premise is that by “listening in”[i] on social media activity, the product developer will identify new insights, unmet needs and valuable product ideas. This identification is based on finding patterns in the data.
As described in another post (Can an Algorithm Replace You) inductive reasoning finds patterns in the data, makes tentative hypotheses and finally presents a theory or rationale that explains the patterns. Theories generated through inductive reasoning are just that: theories. A theory needs to be tested before relying on it. Inductive reasoning is valuable for “exploratory” data analysis and potentially valuable theories can be tested using confirmatory data analysis.
The temptation, and risk, is running with a new product idea before confirming the pattern with your target customer base.
Pitfall 2: Quantity does not compensate for quality
Continuing with the social media monitoring example, bias is another concern with big data. The gold standard in dealing with potential bias is to ensure a simple, random sample from the target population. In social media monitoring, keep in mind that the participants are likely a sub-segment of your target customer base and not representative of the whole. Possibilities for bias include: being more fanatical about your product, early adopters, more technology savvy, and more interested in sharing their opinions. Any pattern you identify needs to be tested for validity on a sample representative of your target customer.
Some people argue that having more data with bias, is more valuable than having a small sample with no bias. But Xiao-Li Meng, Dean of the Graduate School of Arts and Sciences at Harvard University gave a clever demonstration based on estimating a mean to refute that argument[ii].
By the magic of math, the total error in the mean estimate is the sum of variance and bias-squared.
In a simple random sample (SRS), the bias is zero. In a very large sample of biased data (LSBD), the variance is negligible due to the large size of the data set. This means that we can compare total error for each sample by looking at the relative sizes of 1) variance for the SRS and 2) bias for the LSBD. After some math magic, we find the “Big Data Paradox” – the larger the data, the more pronounced the bias[iii].
To illustrate, suppose we have a simple random sample of 100 people in the US. How large does the biased data set have to be to produce a more accurate sample mean than the simple random sample?
The answer: more than 50% of the population, more than 160 million people[iv].
So as Dr. Meng concludes, size does not overcome bias; it just makes the bias more pronounced. Unless you can correct for that bias your big data estimates are going to be off, perhaps way off. One way to correct the bias is to combine (mashup) your biased samples, with unbiased samples.
The Caution for Product Developers
To avoid going astray
- Don’t be tempted to pick up a pattern in your observed data and run with a new product idea. Test it out to make sure that your hunch holds for your target customer
- Don’t be fooled by the suggestion that it’s a huge data set, so you don’t have to worry about error and bias. Quality is more important than quantity when it comes to making your product development decisions
[i] Terlep, Sharon “Focus Groups Fall Out of Favor, Technology-driven tools do a better job of explaining consumer behavior today,” Wall Street Journal, September 19, 2016.
[ii] This was presented in his talk at the acceptance of the Chicago Chapter of the American Statistical Association award for Statistician of the Year 2015-2016. “Statistical Paradises and Paradoxes of Big Data”, https://ww2.amstat.org/misc/XiaoLiMengBDSSG.pdf.
[iii] Meng, Xiao-Li and Xianchao Xie “I Got More Data, My Model is More Refined, but My Estimator is Getting Worse! Am I Just Dumb?,” Econometric Reviews, 2014, Vol 33 Issue 1-4, pages 218-250. This article requires some knowledge of probability theory and statistical modeling.
[iv] This calculation is based on the explanation given in “Statistical Paradises and Paradoxes of Big Data” referenced above, and using r=.1.