Statisticians to the Barricades
Taking a Stand on P Less Than 0.05
While generalizations about any group are always perilous, it is probably fair to say that statisticians are not usually thought of as prone to vociferous public protest. Yet, when recently invited to sign on to a paper protesting the way scientists decide whether a finding is “statistically significant,” more than 800 scientists whose work depends on statistical modeling rushed to join. One signatory called the commentary, which appeared in the prestigious scientific journal Nature, a “surgical strike against thoughtless testing of statistical significance.”
The problem that got all of these scientists so worked up strikes at the very heart of how scientists decide whether or not something is to be believed as “true.” And on that score, scientists and statisticians are ready to do battle against the way “truth” has been determined in science for almost a century.
It’s called “probability testing” and is expressed by something called the “p-value.” Papers published in medical and other scientific journals almost always have a statement like “the difference was statistically significant” or “the association was not significant” somewhere in the results section. And whether a finding is determined to be “significant” or not often determines whether it gets published and whether popular media report about it to the public. It may be a study testing whether a new drug works or one that examines whether there is an association between what we eat and how long we live. That “significance” standard is almost always used to decide if what has been found is a real finding or just the result of mere chance.
How do scientists and statisticians determine what is significant or not? First, they take all of the data the study has churned out and submit them to a variety of ever more complicated mathematical formulas, many of which require sophisticated software packages to compute. Then, they take the results of these calculations and use another set of equations to come up with the all-important p-value. Finally, they hold their collective breaths, hoping that the fruits of all their years of work will get them the coveted p-value that is less than 0.05. So critical is that cutoff that we have seen graduate students and research fellows become deliriously happy when an experiment of theirs yields a p-value of 0.01, while others descend into utter despair if their p-value is only 0.06. What in the world is all of this about?
Stick with us now for a bit of elementary statistics, which we are going to lay out as simply as we can. Many statistics books begin with the tossing of coins. We are told that if you flip an ordinary, unadulterated quarter in the air, there is an equal chance it will land head-side or tail-side up. That is, there is a 50% (or 0.5) chance it will be heads and a 50% chance it will be tails. We are also told that because each toss is an independent event, those 50-50 odds are the same every time you toss the coin.
Now let’s say I have never read a statistics textbook and I have a theory—a hypothesis—that when a coin is tossed in the air it is more likely that it will land heads up than tails up. I decide to do an experiment to find out if my hypothesis is true and I toss a quarter in the air. Lo and behold, it comes up heads. Can I conclude now that heads is the most likely outcome of a coin toss? Of course not, because I have only tossed the coin in the air one time, or in statistical terms, my sample size is only one (abbreviated by statisticians as n=1, where n stands for sample size). So let’s toss the coin again. It’s heads again. Two out of two times I get heads. Now can I say the expected results of most coin tosses is heads? The answer is still no, of course, because even with a sample size of two (n=2), I just haven’t tossed enough times to come to a meaningful conclusion. Getting two heads in a row is still within the realm of mere chance.
Now we all know that if we toss that quarter hundreds of times, we are going to get some heads and some tails, and two questions arise. First of all, how many times do we have to toss that coin before we can decide if heads is the most common outcome or not? And second, what if we toss that coin 100 times and we get 52 heads and 48 tails. Does that mean that we have proven that heads is more likely than tails?
Fortunately, somebody must have done a lot of coin tosses sometime in the distant past. We’ll invent Dr. Smith. She decided to do 100 tosses 10 separate times and record the results for each of the ten experimental trials. The results varied from one trial to the next. One time it was 53 heads and 47 tails, another time it was 51 tails and 49 heads, and yet another it was a wild 56 tails and 44 heads. Dr. Smith was at first perplexed, but she then decided to just add up all the heads and tails across all 1000 times she tossed the coin and…you know the result. Five hundred heads and five hundred tails. She concludes, correctly, that while there is random variation from trial to trial, if you toss coins enough times you discover that heads and tails are equally likely. So Dr. Smith has shown that my hypothesis is totally incorrect: heads are not more likely than tails.
The result of all of this experimenting by Dr. Smith is that we can make a very simple mathematical statement about coin tosses. What is the probability of getting heads each time we flip a coin? It is 0.5 and if we use “p” to stand for “probability” we can say with great confidence that in the case of getting heads when we toss a quarter in the air, p=0.5. That means that thanks to Dr. Smith, if you are dying to know what the probability of getting heads when you flip a coin, you don’t have to toss it one thousand times to find out. You can just look it up; in this case, it is a certain and reliable p=0.5.
Why is everyone so preoccupied with coin tosses? Nobody’s health or safety depends on whether or not p=0.5 for coin tosses (pace athletes who flip coins to decide who goes first). The reason is simply that it is so easy to explain probability using this example and also so uncontroversial. Anyone can do this experiment, and no one doubts it.
Science is Messy
When we get to science that matters, of course, it gets more complicated and a lot messier. We easily dismissed the hypothesis that heads are more likely than tails. What happens, however, when we want to decide if a new drug works better than an identically appearing sugar pill (placebo) to treat patients with a medical condition, let’s say chronic fatigue syndrome (CFS; also known as myalgic encephalomyelitis)? Once again, if we give the new drug to one person and she gets better, it’s pretty obvious that this n=1 experiment tells us very little. Maybe she would have gotten better on placebo too. And the new drug has a few adverse side effects, so we want to be pretty sure it really works before we recommend it.
We have to do something much more extensive, analogous to tossing the coin a hundred times. In this fictional example, we will select 100 people with CFS and randomly assign 50 of them to get the new drug and 50 to get placebo. After 12 weeks of taking drug or placebo, we see who is better and who isn’t. To make things simpler than real life, we will assume that none of these 100 people drop out of the study before the 12-week endpoint.
If the new drug is really no better than placebo, we would expect the response rates to be close to equal. If one group has 26/50 responders and the other 25/50, common sense tells us that difference is so small it is unlikely the drug is really any better than a placebo. If on the other hand one group has 50 out of 50 responders and the other group has 2 out of 50 responders, we hardly need fancy statistics to conclude that 50 is so much more than 2 it can’t just be chance, there must be a real difference between drug and placebo.
Unfortunately, nature is rarely so accommodating as to give us such clear-cut outcomes. In fact, in this hypothetical experiment, our actual results are that in the new drug group, 32 of the 50 people (64%) with CFS are better at 12 weeks and in the placebo group 25 of the 50 (50%) people with CFS are better. Since 32 is a bigger number than 25, we might conclude that the drug is indeed better than placebo and start recommending that people with CFS try it.
But remember that with the coin toss example there was variability in the results each time we did 100 coin tosses. What if we repeat the experiment with the new drug and placebo on another 100 patients with CFS and this time we get different results? People with CFS are interested in getting a treatment that works for them as soon as possible and hope that it won’t take multiple experiments with thousands of patients to figure out if the new drug works.
This is where statistics become so useful. Without going into the technical details, we can use statistical formulas on our drug study outcome data and calculate the probability that our findings are simply chance. In the 1930’s, the great statistician R.A. Fisher decided we should set the standard for statistical significance at a level of probability (or p) that is less than 0.05. A p-value less than 0.05 (usually written as p<0.05) means that there is less than a 5% probability that a finding is mere chance.
Returning to our new drug for CFS example, if 32/50 patients respond to the new drug versus 25/25 to placebo, the p-value is just a hair less than 0.05. With p<0.05, we declare statistical significance, write up our results, and publish them in a prestigious journal. Then, The New York Times and Washington Post write stories about the new drug that has just been shown to cure CFS. Doctors begin prescribing the new drug and a pharmaceutical company makes a fortune.
Step back for a second, however, and consider what would have happened if one fewer person had responded to the new drug. If only 31/50 people responded to the new drug and 25/25 to placebo, the p value turns out to be 0.09. Just one fewer responder changes the conclusion from “the result is statistically significant” to “the result is not statistically significant.” Now, it is regarded as a “negative” study, the paper about the study cannot be published in a top-tier journal, newspapers do not notice, and the drug never hits the market.
Here, then, we begin to see what has upset the statisticians so much. Somehow, the p<0.05 standard got to be regarded as a kind of magical sign that a result is true. Our CFS study example, however, shows that this is really kind of ridiculous. A drug can’t go from working to not working just because one fewer person out of 50 failed to respond to it. Yet that is exactly what scientists who write papers and journal editors who decide whether to publish them have used as their guide for almost a century. And why did R. A. Fisher pick 0.05 in the first place? It turns out that decision just felt right to him at the time. What difference does it make if there is a 5% or 6% likelihood that a finding is mere chance? And yet, ever since Fisher’s pronouncement, which he himself warned should not be taken as gospel, if the results of an experiment give a p<0.06 or p<0.07 or even p<0.051, the finding is said to be “not statistically significant.”
More than a decade ago physician-scientist John Ioannidis wrote in a famous article titled “Why Most Published Research Findings are False,” that “Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values”.
“Let’s be clear about what we must stop:” the Nature article signed by more than 800 scientists declares, “we should never conclude that there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as y0.05.”
A real example of this problem can be seen in the much-discussed recent study supposedly showing that eating eggs increases the risk for heart disease and death. Everyone who read about the study immediately complained about it. First, they pointed out, we were told that because their yolks are high in cholesterol, we shouldn’t eat too many eggs or we will clog up our arteries and get heart attacks and strokes. Then, for the past few years, new studies supposedly showed that isn’t correct and in fact there is no relationship between how much cholesterol is in the food we eat and risk for heart attacks and strokes. But the paper in the Journal of the American Medical Association that appeared two months ago states “higher consumption of dietary cholesterol or eggs was significantly associated with a higher risk of CVD [cardiovascular disease] and all-cause mortality, in a dose-response manner” (p. 1088). There is that word “significantly” again. Indeed, the risk for dying did increase the more eggs the people in the study reported eating, and the statistical test used to test the association between egg consumption and heart disease and death did once again make it below that magical p<0.05 cut off.
There are many reasons to be cautious when interpreting the study, however. Here’s one: before you stop eating eggs just consider this: each additional half egg eaten per day increased the risk for death by only 1.93%. So let’s say 500 people who ate one egg a day died. That goes up to 509 if they had eaten one and a half eggs a day. However, when the other sources of dietary cholesterol, like red meat, are taken into account, the association between egg consumption and death no longer even reached the p<0.05 level. Now it isn’t so clear exactly how dangerous eating eggs really is, especially when we keep in mind that there are other studies that find no relationship between eating eggs and bad outcomes.
Our made up CFS drug study and the real-life egg study tell us that declaring something is “significant” because the p-value is less than 0.05 can give us a false sense of confidence that absolute truth has been discovered. Like the rest of us, scientists and scientific journal editors do not like uncertainty. We all want to be able to say that a research study, especially one that is expensive and involves many volunteers, either does or doesn’t prove something to be true. And over the years since R.A. Fisher picked 0.05 as the standard for significance, everyone has assumed that we actually have a test for truth.
In fact, as we see in our CFS example, the mere difference of one person out of fifty responding or not responding to medication changes whether or not that 0.05 threshold is met. And in the egg study, it is really impossible to say with any certainty whether eating eggs is really risky enough to make anyone stop eating them.
Statisticians have developed other ways to evaluate whether the results of an experiment are likely to be important. We won’t go into the details now, but they are based on assessing the size of the finding or how large the difference is between the outcomes of two or more groups in an experiment. These methods, which include things like odds and risk ratios, absolute differences, and number needed to treat or harm, are used in conjunction with p-values to get a much more nuanced and accurate idea of the significance of findings.
And so the Nature article scientists conclude “we must learn to embrace uncertainty.” That is not an easy thing for scientists or any of us to accept. We understandably wish that every experiment gives us a solid answer, like p=0.5 for coin tosses. Unfortunately, no single study of importance almost ever gives us a definite answer. The best we can say about the egg situation is that eating a lot of them might be risky for some people who also eat a lot of other high cholesterol foods and have other risk factors for heart disease. Maybe a three-egg omelet every day is too much, but a couple of eggs once or twice a week perhaps doesn’t increase risk enough to matter. And that is where we will have to leave it until the next study looking at egg consumption and heart disease risks comes out and either clarifies or muddles the situation further.
A great statistician started the whole 0.05 thing, and now eminent statisticians are telling us we are taking it much too seriously. The poet Walt Whitman understood this when he said, “I like the scientific spirit—the holding off, the being sure but not too sure, the willingness to surrender ideas when the evidence is against them.”
 The p-values in this purely hypothetical example, for those who are interested, are calculated from simple two by two chi square tests with one degree of freedom.
 Walt Whitman as quoted in Wineapple B: “I have let Whitman alone.” New York Review of Books, April 18, 2019, p. 20.