Modern Data-Driven Discovery vs. Scientific Discovery
Modern data-driven discovery has been fueled by tremendous growth in data availability and computing power. Gone are the days when analysts had to be conscientious about using as few data points and computing steps as possible. We’ve traded in our HP programmable calculators and mainframes running FORTRAN for fast, sophisticated, cloud-based systems containing more data than we know what to do with. Along with this shift in technologies and computing environments is a booming interest in a philosophy of reasoning and prediction quite opposite to classical statistical inference. This comparison was discussed in a recent workshop given by Dr. Vladimir Cherkassky i of the University of Minnesota and sponsored by the Chicago Chapter of the American Statistical Association.
Scientific discovery uses deductive reasoning and is based on probability theory.
Modern data-discovery uses inductive reasoning based on patterns in large, complex data sets.
Deductive reasoning is based on the logic that if something is true for a class of things, it is also true for all members of that class ii (top-down). The investigator starts with a theory and makes predictions based on the theory. Controlled experiments are used to determine whether the predictions hold and the theory is confirmed. These experiments are referred to as “confirmatory” data analysis.
Deductive reasoning has been the bedrock of scientific discovery for centuries. New “assured” knowledge is built upon prior knowledge as hypotheses are validated through repeatable experiments. This works well with conceptually simple systems, where a model can describe the behavior of the system and make predictions. The natural sciences are a good example where scientists have been able to discover general laws or “first principles.” In physics, you may be familiar with Newton’s three laws of motion and the law of gravity iii.
Inductive reasoning makes broad generalizations from specific observations (bottom-up). The investigator collects observations, finds patterns in the data, makes tentative hypotheses and finally presents a theory or rationale that explains the patterns. When looking for patterns, having lots of data to sift through is valuable. Thus Big Data has helped spur the growth of inductive reasoning in the search for scientific knowledge
Theories generated through inductive reasoning are not reliable. Inductive reasoning can, however, be used for “exploratory” data analysis and potentially valuable theories can be tested using confirmatory data analysis.
The Problem of Prediction
In the scientific approach, prediction requires system identification – a probabilistic model describing the system. When a system is not complex, this can be accomplished. But when there are many factors affecting a system, it can be very difficult, if not impossible, to build a model adequately describing all the relationships, complete with probability distributions. Making predictions and testing hypotheses may be impossible.
Economics is an example of a complex system. An economy is impacted by many factors, including some that appear to have nothing to do with the economy. For example, natural disasters, such as hurricanes, tsunamis and earthquakes, can strangle an economy. Our current knowledge is too limited, our models too blunt, to allow us to control, or even reliably predict outcomes, of such a complex system.
An alternative approach to prediction is system imitation. There is no need to understand the inner workings of the system. The goal is merely to imitate some aspect of the unknown system by finding and exploiting patterns. Classification is a common task. For example: is an email SPAM or not, is the person in a picture male or female.
These kinds of tasks can be handled by machine learning techniques – algorithms that “learn” from data. As the algorithm makes predictions and compares to the actual answers, the algorithm adapts and improves. There is no “right” algorithm and strong patterns from seemingly spurious correlations iv can raise the question of whether these predictive algorithms can be interpreted or trusted.
The Bottom Line
When systems are too complex to be represented by a well-defined probabilistic model, and prediction rather than understanding is the goal, machine learning techniques such as neural networks and support vector machines may be a useful option.
For true learning, however, the scientific deductive method is still the gold standard.
i See the recent book by Vladimir Cherkassky, Predictive Learning. https://www.amazon.com/Predictive-Learning-Vladimir-Cherkassky/dp/0988986906/ref=sr_1_1?ie=UTF8&qid=1464149112&sr=8-1&keywords=vladimir+cherkassky
ii Bradford, Alina, “Deductive Reasoning vs. Inductive Reasoning,” Live Science, as of 5-19-2016, https://www.livescience.com/21569-deduction-vs-induction.html
iii Jones, Andrew Zimmerman, “Major Laws of Physics,” as of 5-19-2016, https://physics.about.com/od/physics101thebasics/p/PhysicsLaws.htm
iv Spurious correlation describes a correlation between two variables where there is no actual relationship. This is one reason for the caution “correlation does not imply causation.” See this website for some amusing examples of spurious correlations https://www.tylervigen.com/spurious-correlation