Occupy Math is going to look at a simple piece of math that is ignored or, worse, abused by researchers in many fields. It amounts to an example of ignorance of statistics that leads to publishing results that are bogus and so impossible to replicate. This problem is called the replication crisis because many important results seem to disappear when other researchers try and reproduce them. Occasionally this is the result of actual fraud — but more often ignorance of simple facts about statistics can let you publish a paper whose results cannot be replicated (because its results are actually wrong) without even noticing you’re doing it. There is also a separate problem — it is very difficult to completely describe an experiment, which means that the people trying to reproduce your results may not be doing quite the same experiment. That last is a big problem, but not what Occupy Math is looking at today.
The core message of today’s post is that peer-reviewed results in a top journal are sometimes wrong because we don’t teach statistics properly.
The starting point for this discussion is a practice called p-hacking or data dredging. When you run a statistical test on data, it often returns what is called a p-value. In everyday terms, the p-value is the chance that (in the opinion of the test you are using) your data could have gotten the way they are by chance. Let’s look at a couple of examples. Suppose you flip a coin twenty times and count the number of heads. Then, if the coin is a normal coin, you expect the number of heads to be about ten — but anything in the range 8-12 probably isn’t that surprising. If, on the other hand, you get 18 or more (or two or fewer) heads, then that is evidence that the coin may not be normal. For this situation the p-value is simply the odds that a fair coin would cough up the number of heads you observed. Using a binomial distribution calculator, Occupy Math computed the odds of a fair coin flipping 18 or more heads in 20 flips and its 0.01812% or about 1.8 in 10,000. That 1.8 in 10,000 is the p-value for this situation.
Another important point is that “number of heads” is just one thing you can measure. Anything you can measure generates a test with a different range of reasonable values. Another thing you could measure is the largest number of flips in a row that were the same — something Occupy Math did in Three Probability Puzzles that will Fool You For Sure. This means that there are many tests that you could use on any set of data — something that is key to rest of the post.
If the p-value is small enough (smaller is better), then it is very unlikely that your data resulted from a chance process and you have what is called a statistically significant result. The possibility that your data occurred by chance, is called the “null hypothesis” and when a statistical test rejects the null hypothesis, then it provides evidence that your data did not occur by chance. The problem is that different tests look at different things and rejection of the null hypothesis is not absolute.
In science a p-value of p=0.05 is the biggest value that is thought to be statistically significant. Since 0.05 is one-twentieth, that standard permits one statistical test in twenty to be wrong when it decides to reject the null hypothesis. Particle physics researchers insist on a much smaller p-value before they believe their results: p=0.0000003 or 5-sigma. With that context, lets look at how you do p-hacking.
One of the reasons you really want a statistician at least consulting with your research team is that they know which statistical test is appropriate for different types of data and different questions about those data. Statistical tests come with a lot of warning labels — conditions under which the test will or will not work. Many researchers ignore these warning labels and just apply a familiar test. This isn’t good, but it often comes out okay.
P-hacking happens when you apply many tests looking for one with a good p-value.
Suppose you want to roll three sixes on three dice. That only happens one time in 216 — but if you keep rolling the dice over and over you will get that triple six eventually. Using a large number of statistical tests and reporting the one that gives you the coveted p<0.05 result is exactly the same thing as rolling the dice over and over. It also means that it is very likely that your result is not statistically significant. Since a lot of people pass their basic statistics course and then go back to sleep, they don’t understand that this process is both dishonest and ineffective for figuring out if they have strong results. Occupy Math was once waiting for someone in a hallway (thirty years ago) and found a manual for students in a sociology department listing statistical tests to try to “prove” a result is significant. Occupy Math makes no charge of dishonesty here — the professor who compiled this manual was almost certainly ignorant rather than intentionally teaching unethical behavior. Part of the problem is that there is incredible pressure to publish good (significant) results. This means that there is a strong incentive to p-hack your results. In the presence of this incentive people can even p-hack their data unintentionally.
How on earth can you unintentionally p-hack your data?
Occupy Math is on record that one of the most powerful techniques available in mathematics is to change your point of view until a problem becomes easy — or at least possible. He stands firmly by this view; it has saved his bacon many times. Math, however, is an ethereal realm in which we are not working with data; in particular, there is no “noise” in a mathematical proof — just logic. When a scientist is working with data from the real world, that data has noise and bias. It is very likely that there are viewpoints from which the natural noise in the data has a strong bias. Suppose, for example, you measure how often people are either above the speed limit or not above the speed limit. During a traffic jam, that coin flips all tails. During normal driving, that coin flips 3/4th heads. The model that says that people are trying to drive the speed limit and so are above and below it with fairly even chances is just wrong. Worse — if the radar gun you’re using is worn out, it may report the speeds with an error of plus or minus ten: noise in the data.
Let’s think for a minute about what biases and errors — some obvious some less than obvious — can do. Statistical tests performed from the point of view that highlights a data bias can create a false appearance of statistical significance. You may report a cure for cancer but, instead, you may have detected a flaw in one of your measuring devices. And yet a researcher can think, in all honesty, that they have found the signal in the data that supports their hypothesis and justifies their hard work and grant funding. The search for an effective point of view permits disguised p-hacking to take place. The name data dredging, linked at the top of the post, is a derisive way of speaking of the search for a point of view that makes the data look good. This leads to a cardinal law of experimental design for research:
You must choose your analysis technique before you take your data!
To be more precise — the analysis technique used to test for significance must be chosen ahead of time. Occupy Math does not only work on theorem-and-proof mathematics; he also does experiments with digital evolution and these require statistical tests. Now Occupy Math slices his data many different ways and displays many points of view, but only after he has demonstrated statistical significance with a pre-chosen test. The way you can tell that you’ve got a really solid result is that all the secondary viewpoints you use to get more perspective on your results are also significant. One of many tests significant? Result is probably bogus. First pre-chosen test and several follow-up tests all significant? Result is probably real. The word “probably” is why science insists on replication of results and why the replication crisis is such a big deal.
Are there other factors that work against published results?
Of course there are — otherwise Occupy Math would not have asked the question. The first is that journals prefer to publish startling breakthroughs over routine progress. Unexpected results generate more reads and more citations which are the currency of scientific publication. One big reason for a result to be startling is that it goes counter to the current state of knowledge of science in some way — in other words, it disagrees with everybody. There are examples where results like this have been right — and led to Nobel Prizes — in spite of early derision and mocking. A standard example of this is Barbara McClintock’s discovery of transposable elements. For every Barbara McClintock, there are twenty completely bogus papers that were statistical flukes or (probably unintentional) p-hacks. The strong publication bias for exciting results leads to a slight bias in favor of wrong results.
The second factor — and this is a huge one — is that people cannot publish negative results. Suppose that a prominent scientist gives a lecture at a national meeting where he outlines a possible explanation for a really interesting phenomenon, a mechanism for spontaneous remission of cancer, say. Then a lot of people are going to try to test his idea (heck — that’s why he put it in the lecture — to get a lot of free help). If his idea is wrong and 37 teams start trying to validate it, then, even without p-hacking, the odds are 85% that at least one team will get a P<0.05 result. This is just another version of rolling the dice until you get three sixes.
This means that a well-known, well-received, but incorrect hypothesis will almost certainly be validated in a publication that cannot be replicated!
Occupy Math hopes that this week’s post has given you some perspective on how researchers should do experiments and why they might not. Another point this post tries to make is that even publications in Science and Nature need to be taken with a grain of salt until others have found independent support for a given result. A better understanding of statistics and the parts of it that speak to experimental design would make science a lot more solid. In other words, employ a statistician and listen to him. Occupy Math’s statistical colleagues all have stories about the people that did the whole experiment, produced a mare’s nest of inappropriate data, and then came to them asking for help sorting out their results. Pre-planning with a statistical professional saves time, money, and improves results. Probably.
The unwillingness to publish negative results is another serious problem with the way we do science now. If negative results were available, then a lot of people wouldn’t try experiments that are a waste of time and money. The funding structure also rewards positive, surprising results. When Occupy Math was discussing this post with his editor, she observed “The incentive structure seems wrong.” She surely got that right. Do you have examples of places where a little more math would save piles of time and money and make science more effective? Do not be shy: please comment or tweet!
I hope to see you here again,
University of Guelph,
Department of Mathematics and Statistics