The hunt for p-values less than 0.05 has left many of science’s roadways riddled with potholes.

More than 50% of them in the biomedical literature are wrong, according to one reckoning, maybe 30% or more wherever it’s used, according to another. It’s high, but we can’t know for sure how high, according to a third. It’s definitely part of the reason more than half of studies in psychology can’t be replicated. And the lowest estimate of wrong-ness I’ve seen is 14% in biomedical research – which is still pretty high. (Details **here**, **here**, and **here**.)

The American Statistical Association (ASA) is trying to stem the tide of misuse and misinterpretation. They issued a **statement on p-values** this year. It’s the first time they ever took a position on a statistical practice. They did so because, they said, it’s an important cause of science’s reproducibility crisis.

Perhaps the ASA’s intervention will help stop the p-value’s seemingly unstoppable advance in the sciences. In psychology, according to a study by Hubbard and Ryan **in 2000**, statistical hypothesis testing – testing that calculates statistical significance – was being used by around 17% of studies in major journals in the 1920s. It pretty much exploded in the 1940s and 1950s, spreading to 85% of studies by 1960, and passing 90% in the 1970s.

How did this get so out of hand? Hubbard and Ryan argue it’s because the p-value was simple and appealing for researchers, and there was “widespread unawareness” of the test’s limitations. There is no simple alternative to replace it with, and some argued for it fiercely. So it was easier to let it take over than fight it. Hubbard and Ryan call out “the failure of professional statisticians to effectively assist in debunking the appeal of these tests”.

The ASA takes a similar line: the “bright line” of p-values at 0.05 is taught because the scientific community and journals use it so much. And they use it so much because that’s what they’re taught.

The result is a bumpy ride in the literature. Here’s my choice of the top 5 things to keep in mind to avoid p-value potholes.

**1. “Significant” in “statistically significant” doesn’t mean “important”.**

You can have a statistically significant p-value of an utterly trivial difference – say, getting better from a week-long cold 10 minutes faster. You could call that “a statistically significant difference”, but it’s no reason to be impressed.

Back in Shakespeare’s day, **significance** still didn’t have the connotation of importance. “Signify” only referred to meaning something. And as **Regina Nuzzo explains**, that’s all the developers of these tests meant in the 1920s, too: a p-value less than 0.05 just signified a result worth studying further.

**2. A p-value is only a piece of a puzzle: it cannot prove whether a hypothesis is true or not.**

This gets to the heart of misuse of p-values. The **ASA statement** could not be clearer on this:

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

A statistical hypothesis test is measuring an actual result against a theoretical expectation. It can’t know if the hypothesis is true or not: it just assumes that it is *not* true (a null hypothesis). And then it measures whether or not the result is far enough away from this theoretical null, to be worth more attention.

Steve Goodman says the most common misconception about a p-value of 0.05 is that “the null hypothesis has only a 5% chance of being true” [**PDF**]. You can’t rest a case on it alone, that’s for sure. It’s hard to even nail exactly what the p-value’s implications actually are – which doesn’t mean that it’s always useless [**PDF**]. So where does that leave us?

There isn’t a simple answer. The best ways to analyze data are specific to the situation. But in general, you need to be looking for several things:

*Methodological quality of the research:* No amount of statistics in the world can make up for a study that is the wrong design for the question you care about – or that’s poorly done. What or who is included can skew the value of the results too, if you want to apply the knowledge to another situation.

*Effect size: *You need to understand exactly what is being measured and how big the apparent effect is to be able to get a result into perspective.

*Understanding the uncertainty: *If there’s a p-value, you need to know exactly what it was, not only that it is under 0.05 – is it *just* under, or are there more zeros? (Or how much it is over 0.05.) But even that’s not enough. In fact, you don’t really need the p-value. You need better ways to understand the uncertainty of the estimate: and that means standard deviations, margin of error, or confidence/credible intervals. (More on this **here**, **here**, and **here**.)

*More than one study:* Certainty doesn’t come from a one-off – and especially not from a surprising one. This is why we need systematic reviews and meta-analysis. (More on that **here**, and problems to look out for **here**.)

Some argue, on the other hand, that “the answer” is simply to have a more stringent level of statistical significance than 0.05 (which is the 95% mark, or 1 in 20). Particle physicists have taken this the furthest, expecting to get a **5 sigma result** (and at least twice) before being sure. In p-value terms, that would be **0.0000003**, or 1 in 3.5 million.

Very high levels are going to be unachievable for many kinds of research. The limits to the feasible number you can have in a study for something as complicated as the effects of a drug on human beings wouldn’t come close to enough certainty anyway. That said, Goodman is one of a **large group** who believe support for lowering the threshold to <0.005 has achieved critical mass now – at least for claims of new effects.

Bayesian statistics offer more options, with the ability to incorporate what’s known about the probability of a hypothesis being true into analysis. (More on that **here**.)

**3. More is not necessarily better: more questions or bigger datasets increase the chances of p-value potholes. **

The more common an event is, the more likely it is to reach p <0.05 in a bigger dataset. That’s good news for not throwing babies out with bathwater. But it’s bad news for fishing out more false alarms and trivial differences.

The more tests are run on a data set, the higher the risk of p-value false alarms gets. There are tests to try to account for this. (More on this **here**.)

An alternative here is for researchers to use the **false discovery rate** (FDR), which is one way of trying to achieve what people think the test for statistical significance does. That said, Andrew Gelman described the FDR as just “trying to make the Bayesian omelette without breaking the eggs” [**PDF**].

As if this whole road isn’t already hard enough to negotiate, an awful lot of researchers are putting a lot of effort into digging potholes for the rest of us. It’s so common, that many don’t even realize what they are doing is wrong.

It’s called p-hacking or data-dredging: hunting for p-values <0.05 in any dataset without a justified, pre-specified rationale for it – and without safeguards and being open about what they’ve analyzed. More on that and post hoc analyses **here**. Christie Aschwanden and FiveThirtyEight have provided a **great interactive tool** for you to see how you can p-hack your glory, too.

**4. A p-value higher than 0.05 could be an absence of evidence – not evidence of absence.**

This one is tricky terrain, too. People often choose the statistical hypothesis test as their main analysis – but then want to have their cake and eat it too if the result isn’t statistically significant. Matthew Hankins nails this practice of trying to disguise a non-statistically different result “as something more interesting” **here**.

On the other hand, if it’s important or possible that something is making a difference, you need something stronger than non-significance in a single study too (especially if it’s a small one). More on what it takes to be convincing about this **here**.

**5. Some potholes are deliberately hidden: shining the light only on p’s less than 0.05.**

This is called selective outcome reporting, and it adds outcome reporting bias – a “minus” for a study’s overall reliability. You can see it often in **abstracts**, where p-values <0.05 for factors that weren’t even the study’s primary question are highlighted – and the fact that the primary question came up short isn’t even mentioned.

In the biomedical literature, the number of studies reporting p-values in the abstract rose from **7% in 1990 to over 15% in 2014**, almost always claiming at least one p-value below 0.05 – and that’s not a good sign.

This is not always easy to spot, as researchers sometimes go to considerable lengths to hide it. Clinical trial hypotheses and planned outcome assessment are meant to be published before the trial is done and analyzed to prevent this – as well as make it obvious when a trial’s results are not published at all. (More on this at **All Trials**.) But even that isn’t enough to end the practice of biased selective reporting.

Ben Goldacre and colleagues from Oxford’s Centre for Evidence-Based Medicine are systematically studying and calling out outcome-switching in clinical trials. You can read more about this at **COMPARE**.

Pre-registering research is spreading from trials into other fields as well: read more about the campaign for **registered reports**. You can see the impact of unpublished negative results in psychology in an example in **Neuroskeptic’s post** this week.

In many ways, once you get the hang of it, the types of potholes that are out in plain sight are easier to handle – just like potholes in real life.

But especially because so many are hidden, it’s better to always go slowly and look for more solid roadway.

Wherever there are p-values, there can always be potholes.

~~~~

*And what are they measuring? Check out my 6 Tips for Deciphering Outcomes in Health Studies.*

*[Update] On 3 December 2017, I added a sentence on the support for a p-value of <0.005, published by David Benjamin and colleagues in September 2017.*

*More Absolutely Maybe posts related to statistical significance:*

**Statistical Significance and Its Part in Science’s Downfalls**

**Mind Your “ps”, RRs and NNTs: On Good Statistics Behavior**

**Biomedical Research: Believe It or Not?**

*And all Absolutely Maybe listicles.*

*[Update]: On 27 April I changed “measuring an actual result against a theoretical set of data” to “against a theoretical expectation”, after a comment by Doug Fletcher. It’s hard to come up with a description or conceptual analogy that works if you take it literally, but I hope this is better. (Thanks, Doug!) And I revisited this after a comment by Steve Taylor, and simplified the following sentence: instead of saying “It doesn’t know if the hypothesis it is ‘testing’ is true or not”, it then said “It can’t know if the hypothesis is true”. (Thanks, Steve!)*

*The cartoons are my own (CC-NC-ND-SA license). *(More cartoons at

*The real “signifying nothing” speech is from Shakespeare’s Macbeth.*

*The road with potholes at the top of this post is in Iceland: the photo was taken by Hansueli Krapf ( via Wikimedia Commons).*

*The flooded road with a car caught by a covered pothole is in Russia: the photo was taken by Ilya Plekhanov ( via Wikimedia Commons).*

** The thoughts Hilda Bastian expresses here at ***Absolutely Maybe*** are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.*

]]>

P is for pandemonium. And a bit of that broke out recently when a psychology journal banned p*–*values and more, declaring the whole process of significance testing “invalid”.

There’s a good roundup of views about this development from statisticians over at the **Royal Statistical Society**.

Meanwhile, the **American Statistical Association** urged the journal’s editors – and anyone else who is concerned – to wait for their upcoming statement on p*–*values and drawing inferences. Two dozen “distinguished statistical professionals” are at work on it. It’ll be a page turner for sure!

It’s a culture shock for many, that statistical significance does not prove or disprove a hypothesis. Many of us were taught that it does. But it feels kind of silly once you think about it. How could there be a single statistical test that could deliver all that? (If you’re interested in reading more about hypothesis testing, **I wrote about it here**.)

There’s no doubt that statistical significance is widely misunderstood and misused. But to go from treating it like it’s infallible to banning it totally sure feels extreme. If we stopped using every statistic that’s widely misunderstood, there wouldn’t be much left!

We have a similar situation with relative risk (RR) increases/reductions. It’s a statistic that we encounter from every direction in daily life: “Our Biggest Ever Sale – 50% off storewide!”

I called it risk’s magnifying glass, and being careful when you use or encounter it was my choice for number 1 in a **post on keeping risks in perspective**.

You need to be careful with how you use and react to RRs, but the RR is an essential statistical tool. For one thing, it’s a key way to calculate an estimate of risk for people.

Let’s use a totally made up example (reduced to just the risk bit of the data). Say your risk of having a heart attack is 0.1% and mine is 10%. If a study showed that by knitting for half an hour a day, people could on average reduce their risk of a heart attack by 50%, for you that would mean your absolute risk (AR) would reduce from 0.1% down by half to 0.05%. But mine could drop by a full 5%.

Or, to put just the AR reduction part of the information into natural frequencies: My risk of having a heart attack would be estimated to be about 10 out of 100 without knitting. With daily knitting, my chances of having a heart attack might go down to about 5 out of 100.

Without using the RR, I could only know the average for people who might not be at all like me. If the studies only included people with a much lower risk than me, and all I heard was their small AR reduction, I would think it wouldn’t make much difference to me.

Yet, just because it can be misused, the RR gets vilified wholesale by many people. Some even advise people to disregard any clinical research that even *mentions* the relative risk. RRs should be available – just used properly, with enough data there to be able to keep things in perspective.

But here’s the rub: how do we keep perspective? How can we communicate statistical results in a way that people can understand them well enough for them to be helpful?

People want easy, one-size-fits-all answers to do this. That same urgent desire for something simple and quick to grasp is exactly what got us into this mess with p*–*values. Null hypothesis testing **“opened up the arcane domain of statistical calculation”** by providing an apparent shortcut to “the truth”.

Which brings me to another statistic that seems simple to many people, too: the NNT. That stands for “number needed to treat”, but you can also use it for other purposes (like number needed to harm, or **number needed to screen**).

NNTs have been used to communicate results from clinical trials since the late 1980s, although there’s always been a lot of pushback about them, too. You get an NNT by taking the average absolute risk reduction (ARR) (that 5% drop I got in my heart attack risk by knitting) and inverting it (1/ARR).

In this case, that’s 1 divided by 0.05 (5%): an NNT of 20. It’s a way to always get to an expression, “1 in x”. It’s called “number needed to treat” because it’s framed along the lines of, you need to treat 20 people to prevent 1 heart attack. A population perspective on “1 in x” for clinicians.

This means you end up with a collection of “1 in x”s for and as **binary outcomes** (yes/no answers), but not in continuous form, like blood pressure changes. The “x”* *changes from outcome to outcome for the same trial(s) – see for **example here**: the multiple results of possible benefits and harms end up 1 in 14 this, 1 in 43 that, 1 in 28 the other, and so on.

The level of certainty **won’t be the same** for these different outcomes – different amounts of data usually apply to individual outcomes. That’s usually stripped away, though, when people communicate the NNT. I’ve **explained here** why explaining the range around any outcome is important. You can do that with confidence intervals around NNTs – indeed, **Doug Altman recommended** we always do it.

But he also said it’s difficult to explain when the result isn’t statistically significant. I don’t think it’s difficult: I think an intelligible explanation is well nigh impossible. Because the NNT inverts the ARR, instead of describing it directly, one of the ends of the range is infinity (**Altman explains why**).

In practice, the NNT usually comes with the impression of randomness – **like this graphic**. As though health care is always a lottery, and people’s adherence and responses to treatment don’t vary. We’ll come back to this later.

This week, I wrote **a short post at MedPage Today** calling the NNT an overhyped and confusing statistic that’s not suitable for communicating with patients and consumers. David Newman, from **thennt.com**, replied **there in defense** of its value in the wider picture of communicating with patients.

How did I arrive at my conclusion? Let’s start with what we’re trying to achieve. Let’s assume we’re *not* trying to persuade or dissuade people from using a treatment because of our own interest or ideological position.

Then we want people to be able to:

- interpret data correctly
- get their situation and choices in perspective
- understand what’s involved to achieve their goal
- make decisions in accordance with their own values.

Is communicating NNTs the best basis for achieving understanding and good perspective?

There are several ways we can go about getting a handle on answers to that with research. We can do comparative studies the same way we do for other interventions. That has particular limitations in communication, though. This goes well beyond the issues of the group of people involved likely being even more unrepresentative than usual. (The participation rates tend to be low, and a high rate of dropouts is common.)

The studies tend to be in very artificial circumstances: the situation tends to be hypothetical, so the same cognitive burdens from stress don’t apply. They tend to focus on exemplar statements with only 1 or 2 numbers – whereas in real life there are multiple pieces of data to juggle, and that in the context of a lot of other information that could be new:

*There’s a chance it could affect liver function? What’s that? What could that do to me and what should I do if it happens? Do you mean I take two tablets a day with one meal, or one tablet with two meals? Before or after?…*

Then there’s the quality of both the NNT intervention and what it’s compared with. That’s far from uniform. And you don’t want people adding lots of confounders – it doesn’t help us, for example, to consider a simple pictorial rendition of an NNT against a convoluted wordy, number-packed, written description of another statistic.

It’s not easy to make communication interventions comparable. Just because someone is doing a study on communication, you can’t rest assured they’re communication ninjas.

With those provisos in mind, let’s look at the randomized studies of the “1 in x”/NNT mode of presenting risks. For the question of accurately interpreting numbers, I’ve kept “1 in x” and NNT trials as one category.

The wording used in the interventions varies so much between the trials, that they didn’t didn’t end up to me as very distinctly different ways of presenting numbers. People can **experience the words differently**, though, with the flipped perspective involved. If the “1 in x” trials were excluded, the numbers would drop, but still be substantial.

There’s no good up-to-date systematic review on this. The best review is by **Elie Akl and colleagues**. They concluded that ARs are better understood than NNTs, with no difference between consumers and health professionals. But their last search for relevant studies was in 2007.

There have now been several different groups of researchers, studying people in different countries, in different ways, and the direction of the findings stays bad news for the NNT/”1 in x”. I found 11 reports of trials among patients or consumers, with 26,879 participants, that have results relevant to understandability. I think it’s strong evidence now: NNTs are harder to understand than other statistics you can use to communicate the results of trials.

Details on the batches of trials I’ll be discussing, including links, are at the end of the post. But here’s a shorter overview. The first batch includes ones I could find that I believe meet the criteria of the Akl review. Except for studies with medical students, which I’ve kept separate. Akl defines them as consumers, but I don’t think they fit together.

I found 6 trials in this first batch, involving 20,419 people (mostly from one study with 16,133 people). None involve the stress of real life decisions (although there are some with clinic patients), and none with more than 2 data cases. The participants predominantly skewed to the well-educated, self-selecting to participate in a communication study. I don’t think we could expect performance as good as this in real life.

In no trial did NNTs get the best result for the questions about interpretation and understanding. In some comparisons, NNTs were no worse than some or all other options, and, rarely, they edged out another statistic. But the NNT is definitely at the bottom of the class, scoring very badly sometimes even on the least challenging multiple choice questions possible.

When people were asked about their preferences, only a small minority gave the NNT first place.

Was there anything else that gave a consistently stellar performance? No. On balance, though, everything else was better than the NNT/”1 in x”.

Batch 2 includes 2 studies with a total of 265 preclinical/medical students. One of the studies conclude that the medical students polarized around the question of whether the NNT was useful. They had already been engaged with the NNT in their professional training.

Batch 3 includes studies that compare NNTs with a time-to-event measure (but from the same clinical trial data). The events were hip fractures and heart attacks. To me, these are relevant to the question of which statistics from trials are best for conveying the chances of benefit to patients, even though they don’t seem to be comparisons of interest for the Akl review.

There were 3 of these studies, with 4,890 people – all in Nordic countries. The results here were more mixed, more complicated, and included little data on accuracy of interpretation. NNT didn’t outperform time-to-event. The only trial that tested interpretation of NNT showed poor understanding.

Batch 4 includes studies that look at whether giving people NNTs of different sizes affects their evaluation of a therapy’s potential benefits. This is relevant to understanding, it seems to me, if it can identify whether people understand the implications of NNTs. We get a repeat study here from batch 3, plus 3 new ones, an extra 2,771 people.

Whether an NNT was low or high didn’t affect decision making on average in 2 of the 3 trials, but people were sensitive to variations in the other statistics. The third trial only compared different NNTs to each other. Then they explained NNT in detail: 24% of people changed their minds about their decisions.

With so many people deeply invested in NNTs, we need an up-to-date systematic review.

What I would like to help me get to the next level on risk statistics, though, probably can’t come from trials on NNTs. Rather, it will come from trials looking at building on ways to improve the statistics that have a head start with people.

And we need to get closer to research that is less rooted in hypothetical decision making. Here’s a trial comparing risk statistics for real. Charlotte Harmsen and colleagues published it **in 2014**. It was an arduous trial for them to pull off, by the looks of it. Respect!

They compared AR and a time-to-event outcome, prolongation of life (time-to-death from heart disease): 25% filled a statin prescription when informed by AR (the simpler-to-understand inverse of the NNT) – and only 5.4% of the others.

**Gerd Gigerenzer**, as well as **Gary Brase**, make persuasive cases that using frequencies is “ecologically rational” – that, in effect, this is closer to how our minds work with information. They go further with some evidence to back them up, to suggest that we can improve people’s ability to accurately interpret statistics by using frequencies.

**Stephen Senn** argues that we need not NNTs, but to use modeling if we want to generate robust data on clinical relevance from trials. Exploring that seems worthwhile to me. As does addressing specific issues such as how we can communicate **low probability events** in ways we’ve only touched on.

We also urgently need fundamental qualitative and social science work – especially to better understand what might be harmful. There are clues already that NNTs in particular might be misleading. To go so far in communicating randomness, that people think all treatments are **just a lottery **because we obscure patient variance, has the potential for harm.

On being the “1” who experiences harm: *“You could always tend to say, ‘It’s not going to be me’,”* said a woman in a qualitative study by **Marilyn Schapira** and colleagues. **Ensuring we communicate variance **– at least through confidence interval ranges – may help to counteract people thinking it doesn’t apply to them, because they’re uniquely invulnerable, at lower risk of harm than the average person.

A qualitative study with clinicians by **Robert Froud** and colleagues signals to me reasons to be a little concerned about NNTs in practice. If clinicians like it because they think it’s straightforward and strip it completely of complexity, then we may have a problem: *“[NNT]: far more effective than bloomin’ confidence intervals and all the rest of it!”*

Perhaps most of all, we need more people to stop encouraging the illusion that communicating statistics to patients is simple.

*[Update, 3 June 2017]: Diogo Mendes and colleagues published a review of articles using NNTs and concluded that NNTs are frequently incompletely and mis-reported, which can make them uninterpretable or misleading – and 29% don’t follow the basic recommendations for calculation.*

~~~~

*The NNT was developed in part to get around the complexity of understanding of the odds ratio (OR). I tackle this and the paternalism of manipulating with statistics here.*

*Want to learn more about understanding risks? There’s some evidence this book by Steven Woloshin, Lisa Schwartz, and Gilbert Welch is effective: Know Your Chances.*

*Recommended reviews/overviews:*

*T he cartoons are my own (CC-NC license). (More at Statistically Funny.)*

*Declaration of interests: My current areas of research do not include evaluating the merits of risk communication. My most recent publication in that field was published in 2013. I used to be the Coordinating Editor of the Cochrane group responsible for the Akl review, but that was well before this review came into being. I have co-authored with Akl and other authors of that review, as well as having undertaken a systematic review in risk communication many years ago with Adrian Edwards. Steven Woloshin and Lisa Schwartz are current colleagues of mine, as I’m a faculty member of NIH Medicine in the Media, which they organize. Although I peer review in this field, I was not involved with any of the trials in this post.*

*I have commented on the Akl review at PubMed Commons.*

*The batches of studies I discussed:*

Batch 1, with key findings related to interpretation/understanding of risk:

**Grimes (1999)**: 633 women, USA, not only native English speakers. Compared a rate per 1,000 with a “1 in*x”*. 73% correctly chose which was higher when it was the rate; 56% when it was “1 in*x*“.**Sheridan (2003)**: 357 clinic patients, USA. Compared RR, AR, NNT, and a combination. Choose which of 2 treatments provides more benefit: RR 60% correct; AR 42%; NNT 30%. Their conclusion: “NNT is often misinterpreted by patients and should not be used alone to communicate risk to patients”.**Berry (2006)**: 268 women, UK. For adverse effects of two drugs, compared RR, AR, “NNT,” with and without baseline risk information provided. Participants’ ratings of ease of judging risk. No difference when baseline risk provided. Without baseline risk, no difference between RR and “NNT”, but AR better than both. AR mean rating 2.22 (SD 1.28); “NNT” 1.85 (1.10).**Cuite (2008)**: 16,133 people, USA. Tested whether people could make sense of operations with comparisons (*which of 2 is bigger?*), trade-offs, adding, halving, tripling, and understanding sequence (*10% of people benefited, 3% of those*…). Compared frequency, percentage, “1 in*x*“. Incorrect answers outnumbered “don’t know” answers. Overall accuracy rates: frequency 55%, percentages 55%, “1 in*x*” 45%. Much higher for the comparison question: 81% mean, without major differences.**Carling (2009)**: 2,978 people, mostly USA and Norway. Comparisons of natural frequency, percentages, RR, AR, NNT, TNT (tablets needed to take). Accuracy (inadequate data reported): accuracy between 65% and 75%: 75% was for natural frequency. (Most preferred presentation: natural frequency (31%), RR (30%), percentages (20%), NNT (10%), AR (5%), TNT (3%).)**Selinger (2013)**: 50 clinic patients, Australia. Compared RR, AR, NNT, and a graphical presentation. Participants’ ratings of understandability: RR (94%), AR (88%), graphic (74%), NNT (48%). (Most preferred presentation: RR (48%), AR (20%), NNT (4%), graphic (28%).)

Batch 2 includes studies in medical students, who are counted as lay people by the Akl review. Two studies with 265 students:

**Sheridan (2002)**: 62 medical students, USA. Given baseline risk information for a hypothetical disease, plus RR, AR, NNT, or combination. Correct interpretation: 25% for NNT compared with 75% for the non-NNT.**Chao (2003)**, also reported by**Studts (2005)**: 203 pre-clinical medical students, USA. RR, AR, NNT, absolute survival benefit. Round 1 given 1; round 2 given all 4. Non-quantitative data: absolute survival benefit was the easiest to understand. Quantitative data: The NNT met with a response they categorize as “polarized”. (NNT rated as most helpful by 22-25%, in pairwise comparison rated more helpful than AR and RR.)

Batch 3 includes studies that compare NNTs for one outcome with a completely different outcome (but from the same clinical trial data). Three studies, 4,890 people:

**Christensen (2003)**: 967 general population sample, Denmark. Compared NNT with a time-to-event measure (how long till hip fracture). Assumed the time-to-hip-fracture is virtually universally understandable. Multiple choice question for accurate interpretation of NNT: 18% answered correctly.**Halvorsen (2007)**: 2,754 people, Norway. Compared NNT (22 words with 2 numbers), time-to-event measure (how long till heart attack) – version 1 with 44 words and 2 numbers, version 2 with 56 words and 4 numbers. No difference in participants’ ratings of understanding.**Stovring (2008)**: 1,169 general population sample, Denmark. Compared hypothetical acceptance of a treatment in 2 rounds. Round 1 was AR, RR, NNT, or time-to-event measure (how long till heart attack); round 2 was after seeing all with a graphic explanation. No statistically significant difference in concordance.

Batch 4 includes studies that look at whether giving people NNTs of different sizes affects their (hypothetical) decision making. We get a repeat study here from batch 3, plus 3 new ones, an extra 2,771 people:

**Kristiansen (2002)**: 675 general population sample, Denmark. The magnitude of NNT was either 10, 25, 50, 100, 200, or 400. Around 80% of people would be willing to take the hypothetical drug discussed, regardless of the level of the NNT.**Christensen (2003)***(repeat appearance)*: 967 general population sample, Denmark. Compared NNT with a time-to-event measure (how long till hip fracture). This included NNTs of 10, 50, 100, or 400, and different duration to hip fracture of 1 month, 6 months, 1 year, or 4 years. Differences in the time measure changed people’s opinions; differences in NNTs did not.**Halvorsen (2005)**: 1,201 general population sample, Norway. Magnitude of NNTs of 50, 100, 200, 400, 800, 1,600. Other variants included disease, treatment costs. Participants decided whether or not they would hypothetically consent to a therapy before and after explanation of the NNT. Comprehension of NNT was not assessed. Proportions hypothetically consenting ranged from 76% when the NNT was to 67% when it was 1,600. 24% changed their decision after the NNT was explained in detail.**Gyrd-Hansen (2011)**: 895 people from the general population, Denmark. AR, RR, NNT or time-to-event measure (death of a heart attack). Participants were sensitive to changes in scale of AR, RR, and time-to-event, but not of the NNT.

** The thoughts Hilda Bastian expresses here at ***Absolutely Maybe*** are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.*

]]>

Imagine if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false. Oh, the things we would know! Unrealistic to expect such wizardry though, huh?

Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and *abracadabra*! Get a “*p *value” over or under 0.05 and you can be ** 95% certain **it’s either a fluke or it isn’t. You can eliminate the play of chance! You can separate the signal from the noise!

Except that you can’t. That’s not really what testing for statistical significance does. And therein lies the rub.

Testing for statistical significance estimates the probability of getting at least that result *if* the study hypothesis is assumed not to be true. It can’t on its own prove whether this assumption was right, or whether the results would be the same in different circumstances. It provides a limited picture of probability, taking limited information about the data into account and giving only “yes” or “no” as options.

What’s more, the finding of statistical significance itself can be a “fluke,” and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data. You can read more about that here.

Statistical significance testing can easily sound as though it sorts the wheat from the chaff, telling you what’s “true” and what isn’t. But it can’t do that on its own. What’s more, “significant” doesn’t mean it’s important either. A sliver of an effect can reach the less-than-5% threshold. We’ll come back to what all this means practically shortly.

The common approach to statistical significance testing seemed so simple to grasp, though, and so easy to do even before there were computers, that it took the science world by storm. As Stephen Stigler explains in his piece on Fisher and the 5% level, “it opened the arcane domain of statistical calculation to a world of experimenters and research workers”.

But it also led to something of an avalanche of abuses. The over-simplistic approach to statistical significance has a lot for which to answer. As John Ioannidis points out here, this is a serious player in science’s failure to replicate results.

Before we go any further, I need to ‘fess up. I’m not a statistician but I’ve been explaining statistical concepts for a long time. I took the easy way out on this subject for the longest time, too. But I now think the perpetuation of the over-simplified ways of explaining this in so much training is a major part of the problem.

The need for us to get better at communicating the complexity of what statistical significance does and does not mean burst forth in question time at our panel on numbers at the recent annual meeting of the National Association of Science Writers in Florida.

Fellow statistics enthusiast and SciAm blogger Kathleen Raven organized and led the panel of me, SciAm mathematician blogger Evelyn Lamb, statistics professor Regina Nuzzo, and mathematician John Allen Paulos. Raven is organizing an ongoing blog called Noise and Numbers, around this fun-loving science-writing crew. (My slides for that day are here.)

Two of the points I was making there are relevant to this issue. Firstly, the need to avoid over-precision and take confidence intervals or standard deviations into account. When you have the data for the confidence intervals, you have a better picture than statistical significance’s *p* value can possibly provide. It’s far more interesting and far more intuitive, too. You can learn more about these concepts here and here.

Secondly, it’s important to not consider the information from one study in isolation, a topic I go into here. One study on its own is rarely going to provide “the” answer.

Which brings us at last to Thomas Bayes, the mathematician and minister from the 1700s whose thinking is critical to debates about calculating and interpreting probability. Bayes argued that we should consider our prior knowledge when we consider probabilities, not just count the frequency of the specific data set in front of us against a fixed, unvarying quantity regardless of the question.

You can read more about Bayesian statistics here on the Wikipedia. An example given there goes like this: suppose someone told you they were speaking to someone. The chances the person was a woman might ordinarily be 50%. But if they said they were speaking to someone with long hair, then that knowledge could increase the probability that the person is a woman. And you could calculate a new probability based on that knowledge.

Statisticians are often characterized as either Bayesians or frequentists. The statistician doing the ward rounds in my cartoon at the top of this post is definitely a Bayesian!

An absolute hewing to *p <0.05* (or 0.001) no matter what would be classically frequentist. Important reasons for being concerned to do this are the weakness of much of our prior knowledge – and the knowledge that people can be very biased and may play fast and loose with data if there aren’t fixed goal posts.

Bayesianism has risen and fallen several times, but increasing statistical sophistication and computer power is enabling it to come to the fore in the 21st century. Nor is everyone in one or the other camp: there’s a lot of “fusion” thinking.

Valen Johnson has just argued in PNAS (Proceedings of the National Academy of Sciences in the USA) that Bayesian methods for calculating statistical significance have evolved to the point that they are ready to influence practise. The implication, according to Johnson, is that the threshold for statistical significance needs to be ratcheted much, much lower – more like 0.005 than 0.05. Gulp. The implications of that for sample sizes needed for clinical studies would be drastic.

It doesn’t really all come down to where the threshold for a *p *value is, though. Statistically significant findings may be important or not for a variety of reasons. One rule of thumb is that when a result does achieve that numerical level, the data are showing something, but it always needs to be embedded in a consideration of more than that. Factors such as how big and important the apparent effect is, and whether or not the confidence intervals suggest the estimate is an extreme long shot or not matter too.

What the debate about the level of statistical significance doesn’t mean, though, is that not being statistically significant is irrelevant. Data that aren’t reaching statistical significance are too weak to reach any conclusion. But just as being statistically significant doesn’t mean something is necessarily “true,” not having enough evidence doesn’t necessarily prove that something is “false.” More on that here.

The debate about Bayesians versus frequentists and hypothesis testing is a vivid reminder that the field of statistics is dynamic – just like other parts of science. Not every statistician will see things the same way. Theories and practises will be contested, knowledge is going to develop. There are many ways to interrogate data and interpret their meaning, and it makes little sense to look at data through the lens of only one measure. The* p* value is not one number to rule them all.

*Update 7 March 2016:*

*The American Statistical Association released 6 statements of principle about p-values:*

- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

~~~~

*Click on Statistics in the cloud tag to the right to see other posts that are relevant.*

A good book free online to help with understanding health statistics is Know Your Chances by Steve Woloshin, Lisa Schwartz and Gilbert Welch.

See also Steven Goodman’s A Dirty Dozen: Twelve *P*-Value Misconceptions. Gerd Gigerenzer tackles the many limitations and “wishful thinking” about simple hypothesis and significance tests in his article, Mindless statistics. The Wikipedia is a good place to start to learn more too. Another good article on understanding probabilities is by Gerd Gigerenzer and Adrian Edwards here.

Relevant posts on Statistically Funny are:

- You will meet too much false precision
- Nervously approaching significance
- Don’t worry…it’s just a standard deviation
- Alleged effects include howling

The Statistically-Funny cartoons are my original work (Creative Commons, non-commercial, share-alike license).

The picture of the portrait claiming to depict Thomas Bayes is from Wikimedia Commons.

**The thoughts Hilda Bastian expresses here at Absolutely Maybe are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.*

]]>