Data Dredging: Torturing the Data Until It Confesses
Data Dredging
Data dredging is the practice of analyzing data without a prior hypothesis. It is also known as p-hacking or fishing.
It is dangerous. It generates false positives. It creates noise that looks like signal.
The Infinite Monkey Theorem
If you give a million monkeys typewriters, one will eventually write Shakespeare.
If you test a million variables against each other, one pair will correlate perfectly.
This is not a discovery. It is a mathematical certainty.
How It Happens
A researcher collects a massive dataset. They have hundreds of variables. Diet, exercise, sleep, income, shoe size, favorite color.
They do not have a specific question. They just run a program to see what correlates with cancer.
The program finds a link: "People who wear red shoes have less cancer."
The researcher writes a paper: "Red Shoes Prevent Cancer."
This is data dredging. The correlation is real in the dataset. But it is random noise. It will not replicate in a new group of people.
The Texas Sharpshooter
This is the Texas Sharpshooter Fallacy. The gunman shoots at the barn. Then he paints the target around the bullet holes.
In science, you must paint the target before you shoot. You must state your hypothesis before you look at the data.
The Diagnosis
Modern "Big Data" makes this easier than ever. We have more data than theory.
We must be skeptical of any study that was "exploratory." If they did not predict it, they probably just found a coincidence.