Published by Emerging Technologies Laboratory · via ETL Newswire
Science· 

Why Clinical Research Has a Statistics Problem That Peer Review Keeps Passing

From p-value worship to underpowered trials, a pattern of numerical misuse is baked into the incentive structures of biomedical publishing, and spotting it requires skills most reviewers were never formally taught.

By Dr. Maya Iyer, Staff Reporter · Science Desk

There is a particular kind of sentence that appears in the discussion sections of clinical papers with remarkable frequency: 'Although the result did not reach statistical significance, a trend toward improvement was observed.' Read that carefully. A trend toward improvement. In most cases what the authors mean is that they ran an experiment, the data did not support rejecting the null hypothesis, and they would prefer you not notice.

This is not fraud. It is something subtler and in some ways more corrosive: statistical language being used to soften a null result into something publishable. And peer review, in its current form, is not reliably catching it.

The roots of this problem are structural. Most medical school curricula dedicate a modest block of time to biostatistics, enough for students to learn what a p-value is but not necessarily enough to understand what it is not. The p-value is not the probability that the null hypothesis is true. It is not a measure of effect size. It is not a signal of clinical relevance. It is a conditional probability statement about data, given an assumed null, and its 0.05 threshold was chosen by convention, not by any derivation from biology or medicine.

Yet the field organized itself around that threshold for decades. Trials were powered to hit it. Journals used it as a publication filter. Careers were shaped by whether results cleared it. The incentive to cross the line at any cost produced a literature with well-documented irregularities: selective outcome reporting, post-hoc subgroup fishing, and the routine omission of confidence intervals wide enough to contain both clinical benefit and clinical harm.

The American Statistical Association issued a statement on p-values in 2016, followed by a more expansive guidance document in 2019, both of which urged researchers to move away from binary significance testing toward reporting effect sizes with uncertainty intervals and grounding conclusions in the totality of evidence. The reception in clinical journals has been uneven at best. Some high-impact publications now require confidence intervals. Others still print phrases like 'highly significant' to describe a p of 0.003 without noting that the effect size in question was too small to matter at the bedside.

Underpowering compounds everything. A trial enrolled to detect a large effect will, if the true effect is moderate, return an inconclusive result that gets reported as a trend or a pilot finding, then cited by a meta-analysis that treats it as a data point with equal standing to adequately powered work. The meta-analysis carries the ghost of the original miscalculation forward.

Computational tools have not solved this. Software packages will calculate a p-value for any dataset regardless of whether the question being asked is coherent, the sample was representative, or the outcome measure was pre-registered. Automation made calculation easier without making interpretation better.

What would improve the situation is not a single correction but several operating simultaneously: mandatory pre-registration of primary outcomes before enrollment closes, statistical review by someone with dedicated training rather than a generalist reviewer reading the methods section in forty minutes, and a cultural shift in what counts as a publishable result. A well-executed null finding from an adequately powered trial is informative. A trend in a forty-patient study is closer to a hypothesis than a result.

Readers of clinical research are not always equipped to make that distinction, because the papers themselves often do not make it. That is the erosion worth naming: not a dramatic collapse, but a slow narrowing of the gap between what the numbers say and what the prose claims they say, until the two feel interchangeable.

Reporting by Dr. Maya Iyer, Staff Reporter, for the Science desk · ETL Newswire staff
Read more at the source

This release was originally distributed via ETL Newswire. Visit ETL Newswire for the full story, related releases, and contact information.

Visit ETL Newswire →