Data-snooping bias

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Contents

[edit] Overview

In statistics, data-snooping bias is a form of statistical bias generated by the misuse of data mining techniques which can lead to bogus results in scientific research. Although data-snooping biases can occur in any field that uses data mining, data snooping biases are a particular concern in finance and medical research, both of which make heavy use of data mining techniques.

In the process of data mining, huge numbers of hypotheses about a single data set can be tested in a very short time, by exhaustively searching for combinations of variables that might show a correlation.

Because conventional tests of statistical significance are based on the probability that an observation arose by chance, it is reasonable to expect that 5% of randomly chosen hypotheses will turn out to be significant at the 5% level, 0.1% will turn out to be significant at the 0.1% significance level, and so on, simply by chance.

Thus, given enough hypotheses tested, it is virtually certain that some of them will appear to be highly statistically significant, even on a data set with no real correlations at all. Researchers who are using data mining techniques can be easily misled by these apparently significant results, even though they are merely chance artifacts.

Data-snooping bias most commonly occurs when researchers have not formed an hypothesis in advance, and therefore are open to any hypothesis suggestions presented by the data; or when researchers narrow the data used in order to reduce the probability of the sample refuting a specific hypothesis.

[edit] Examples

[edit] Example 1: Hypothesis Suggested By Data

In a list of 367 people, at least two will have the same day and month of birth. Suppose Mary and John both celebrate birthdays on August 7.

Data snooping would, by design, try to find additional similarities between Mary and John, such as:

Are they the youngest and the oldest persons in the list?
Have they met in person once? Twice? Three times?
Do their fathers have the same first name, or mothers have the same maiden name?

By going through hundreds or thousands of potential similarities between John and Mary, each having a low probability of being true, we may eventually find proof of virtually any hypothesis.

Perhaps John and Mary are the only two persons in the list who switched minors three times in college, a fact we found out by exhaustively comparing their life's histories. Our data-snooping bias hypothesis can then become, "People born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself very strongly supports that correlation, since no one with a different birthday had switched minors three times in college.

However, when we turn to the larger sample of the general population and attempt to reproduce the results, we find that there is no statistical correlation between August 7 birthdays and changing college minors more than once. The "fact" exists only for a very small, specific sample, not for the public as a whole.

[edit] Example 2: Narrow Sample To Match Hypothesis

Suppose medical researchers examine a pool of data representing 10,000 lung cancer patients. They want to find information that suggests non-smokers who develop lung cancer have a better chance of survival than smokers with lung cancer.

The researchers notice that 90 percent of the patients (9,000) smoked cigarettes. They note that among the smokers, about 5 percent (450) have a specific gene mutation, and of the 5 percent with that mutation, about 4 percent (18 people) went into remission with no chemotherapy.

Of the 10 percent (1,000) of patients who were not smokers, a similar 5 percent (50) have the same gene mutation, and of that 5 percent, 4 people -- 8 percent -- also went into remission with no chemotherapy.

The data, as it stands, suggests that smokers with a specific gene mutation are twice as likely as non-smokers to go into remission without chemotherapy. But the result is not what the researchers desire, so they reduce the sample size to 1,000 patients, to see if that produces different results.

The new data retains the 90 percent smoker rate (900). The new sample also retains the 5 percent (45) gene mutation rate, and roughly 4 percent (2 people) remission rate. Of non-smoking patients in the new sample (100), 5 percent (5) have the same gene mutation, and none into remission without chemotherapy.

Thus, the researchers could claim that there is a statistical chance for non-smokers to go into remission, but no statistical chance for smokers to do so, given the smaller sample.

By reducing the sample size and ignoring probability, causality and statistical significance, the researchers have produced data that seems to bear out some evidence of their premise.

[edit] External links

Personal tools
AD Links