FEATURE
imply causality from nonrandomized
data can readily fail to recognize and label
random fluctuations. This tendency is
apparent in studies examining health-care
interventions in populations that have
high-risk disease characteristics. Because
the “high-risk” classification often implies
outlier values, individuals initially identi-
fied by their outlier values will likely have
lower values on remeasurement, with or
without intervention.
“If presenters take the time to explain
the story their data are telling, rather
than using jargon and statistics, it’s much
easier to avoid this kind of narrative
fallacy,” Dr. Machado said. Translating
statistical observations into simple lan-
guage (e.g., “as generations go on, people
who are worse off tend to improve and
those who are better off tend to worsen”)
lowers the likelihood of misinterpreta-
tion, she added.
The Problem With P Values
In scientific literature, a p value of less
than 0.05 generally is set as a benchmark
to determine whether findings are “statis-
tically significant,” but readers can make
the mistake of conflating that number
with clinical significance, according to
Grzegorz S. Nowakowski, MD, from the
Mayo Clinic in Rochester, Minnesota. “P
values don’t tell you anything about clini-
cal benefit. They only tell you how likely
your results are to be true and not a play
of chance,” he explained. Dr. Nowakowski
also serves as a co-chair of the American
Society of Hematology (ASH) Working
Group on Innovations in Clinical Trials.
Allan Hackshaw, PhD, an epidemi-
ologist at Cancer Research UK and the
University College London Cancer Trials
Centre who teaches clinical trial design and
consults with trialists, agreed. “A p value
just addresses the question, ‘Could the ob-
served result be a chance or spurious find-
ing in this particular trial, when in reality
the intervention is completely ineffective?’”
The answer to this question is always “yes,”
he said, but people tend not to ask this
question when interpreting p values.
In the same way that one could flip a
coin 10 times in a row and come up with
heads every time, despite there being
nothing wrong with the coin, a p value of
<0.05 does not necessarily indicate that
a treatment is effective. The commonly
used cutoff value of 0.05 means that one
illegitimate effect (false-positive result) is
expected in every 20 comparisons.
“The American Statistical Association
clearly states that p values should not be
used in making clinical decisions – or any
sort of decisions – and yet, that’s exactly
what journals and registration agencies
do,” said Dr. Tannock, who has worked
throughout his career to improve the
quality and reporting of clinical trials.
“They should be using effect size and
some measure of value.”
Researchers can conduct a trial in
ASHClinicalNews.org
2,000 patients and identify a difference of
a few days in survival, he offered as an ex-
ample. “It might be statistically significant,
but it’s not clinically important. More-
over, those individuals selected for the
trial were quite possibly heavily selected
to have high performance status,” Dr.
Tannock explained. “When you try to see
this same small difference in the general
patient population, the effect is smaller
and the toxicity is higher.”
“While we could conclude falsely that
a treatment is effective when actually
it is not, there also are examples where
there are clearly large benefits but, with a
p value just above 0.05, the authors may
conclude that there is no effect, and this is
plainly wrong,” added Dr. Hackshaw.
When results just miss statistical sig-
nificance, assessing the evidence requires
great care. “We all have different feelings
about the data,” Dr. Nowakowski said.
“I may see the data as being potentially
marginal, while someone else might see a
potentially huge benefit. There’s always a
degree of subjectivity, and this nuance is
often lost in transmission.”
Depending on the study design, trials
can be fragile, Dr. Tannock noted. “Some-
times it takes only moving two or three
patients from one side to the other, from
positive to negative, and you can com-
pletely lose the trial’s significance.”
Other points of weakness exist but
can go unnoticed by the average reader.
For example, the inclusion of multiple
comparisons and endpoints increases the
likelihood of erroneous inferences. Also,
large biases in a trial’s design or conduct
might partially or fully explain the ob-
served treatment benefit, and these reveal
themselves only after a careful review of
an article’s methods section.
Case in Point: Bad Blood
As any trialist can attest, designing, run-
ning, and interpreting a trial that produces
clinically meaningful and clinically sound
results is not easy. There are many oppor-
tunities for misinterpretation, as evidenced
by the 2016 case of a study of red blood
cell transfusions from younger and older
patients.
First, JAMA Internal Medicine pub-
lished a study from a team of Canadian
researchers that suggested that red blood
cell transfusions from younger donors
and from female donors were statisti-
cally more likely to increase mortality
in recipients. 1 Using a time-dependent
survival model and data from 30,503
transfusion recipients, they determined
that patients who received blood from
donors aged 17 to 19.9 years had an 8
percent higher mortality risk than those
receiving blood from donors aged 40 to
49.9 years (adjusted hazard ratio [HR]
= 1.08; 95% CI 1.06-1.10; p<0.001).
Similarly, an 8 percent increase in risk
of death was noted for those receiving
blood transfusions from female donors
compared with male donors (adjusted
HR=1.08; 95% CI 1.06-1.09; p<0.001).
This publication was soon followed by
an observational, matched-cohort study
published in Blood the same year, where-
in investigators found no associations
between blood donor age and mortality
among 136,639 transfusion recipients. 2
“The original researchers assumed that
the risk of death and the risk of multiple
transfusions were linear, when they really
were not,” explained Alan E. Mast, MD,
PhD, from the BloodCenter of Wisconsin
and a co-chair of the ASH Working Group
on Innovations in Clinical Trials. “The
data curved because the risk of getting
multiple transfusions increased the likeli-
hood of dying, but the risk of getting a
different transfusion from a young blood
donor increases over that time differently
than the risk of dying.”
In light of these discrepant findings,
investigators at the Karolinska Institute in
Stockholm conducted their own analysis,
using methods similar to the those in
the Canadian study but taking a different
approach to control more rigorously for
potential confounding variables associated
with the total number of units transfused. 3
Their findings: Neither donor age nor
sex was associated with recipient survival.
“Any comparison between common and
less common categories of transfusions will
inevitably be confounded by the number of
transfusions, which drives the probability
of receiving the less common blood com-
ponents,” the authors concluded.
“When you assume linearity between a
covariate and the dependent variable, you
are essentially averaging out the effect,”
Dr. Machado explained. “When people
receive multiple transfusions and there is
a true nonlinear effect, in a way, you are
attributing to each transfusion the average
effect of all transfusions.”
“[This case] is a good example of
researchers coming out and asserting
something, and their findings got a lot
of attention, but when other researchers
went back and used different statistical
techniques, they found it just wasn’t true,”
Dr. Mast added.
Torturing Data Into Confession
Misinterpretation of data can typically
be attributed to eagerness to transmit
findings. “We have many attractive new
agents and therapies that we would like
to move quickly to the clinic and to
patients, so it’s a tough balance to design
the most applicable studies that show
the true benefit but also to find ways to
finalize the study faster and move it to
clinical practice more quickly,” said Dr.
Nowakowski.
This tension between caution and en-
thusiasm plays out in the murky waters of
subgroup analyses.
It is common practice in clinical trials
to see whether treatment effects vary
according to specified patient or disease
characteristics, but the rigor of subgroup
analyses also varies, and most readers
aren’t prepared to spot the differences. In
the best-case scenarios, subgroup analy-
ses show homogeneity of effect across
multiple subgroups. Problems arise when
these analyses are used as “fishing expedi-
tions” in trials where no overall treatment
effect is found.
“We all have
different feelings
about the data.
... There’s always
a degree of
subjectivity, and
this nuance is
often lost in
transmission.”
—GRZEGORZ S. NOWAKOWSKI, MD
“There is no ‘standard’ approach for
subgroup analyses,” said Dr. Hackshaw.
He suggests that running an interaction
test alone – which can compare whether
a treatment effect is different between
subgroups, such as males and females, for
example – is insufficient. A safer practice
is to run both an interaction test and a
test for heterogeneity, because the latter
assesses whether the effects in the sub-
groups differ from the overall effect.
“Many researchers do multiple sub-
group analyses, often encouraged or
requested by journals, and few allow for
the multiplicity, so chance effects could
arise,” he continued. “Requiring both tests
to ‘pass’ would strengthen the evidence
when faced with multiplicity.”
Beyond these numerical evaluations
of a subgroup analysis, Dr. Hackshaw
said, “there also needs to be biological
plausibility and corroborating evidence
from independent studies when claiming
a subgroup effect.”
Also, by their nature, subgroup
analyses are based on smaller numbers
of patients and events, running the risk
that the balance in baseline characteristics
achieved by randomization might be lost.
“The hope is that the subgroups are
all consistent, but even failing to show
heterogeneity doesn’t prove that it doesn’t
exist,” Dr. Tannock commented. “We
should beware of subgroup analyses
because the trials are powered for the
overall effect; subgroup analyses can
show random fluctuations that can be
highly misleading.” If there is a subgroup
of interest, he added, a separate trial
ASH Clinical News
31