Get More Sophisticated
When considering p values, we need not be
constricted by “rigid thinking” according to Dr.
Kyriacou. Instead, consider the options. One such
option is Bayesian Inference, which is particularly
useful when, for example, the data show an association between a particular exposure and a specific
health-related outcome. Utilizing this method, the
investigator can infer the possibility of a causal
relationship using the data in conjunction with data
from prior similar studies.
Bayes factors have been proposed as more principled replacements for p values. Think of them as
a mini deadlift test, measuring the strength of the
relative evidence. Technically (and pardon the jargon), they are a weighted average likelihood ratio,
representing the weight of evidence for competing
hypotheses. Bayes factors represent the degree to
which the data shift the relative odds between two
hypotheses.
According to Dr. Kyriacou, calculating Bayes
factors requires more computational steps than the
p value and the technique is not as widely known.
Additionally, using Bayesian inference requires that
prior information is known and precisely quantified, and Bayes factors have been criticized as
being biased towards the null hypothesis in small
samples.
Regardless of the technique(s) used, experts
tend to agree that automatic dichotomized hypothesis testing using a prearranged level of significance (i.e., p < 0.05) needs to be supplemented
(not supplanted) by more sophisticated methods,
which might include effect sizes and a range of
other tests; plus, some straightforward scientific
reasoning and judgment, even if that
injects a certain
subjectivity. Or, as
Statisticians
Stone and Pocock
Outline a ‘Post
put it: “A p value is
p < 0.05 Era’ in
no substitute for a
New Scientific
brain.”8 They used
Statement
that statement as
a reminder that
After decades of
interpretation of a
debate, the American
Statistical Association
seemingly “positive”
(ASA) has released a
trial rests on more
statement on signifithan just a significance and p-values. To
cant p value.
discuss what a ‘postp < 0.05 era’ will look
Cherry-picking
like, CSWN Executive
p Values
Editor Rick McGuire
talks with Ron WasserEver notice how the
stein, PhD, first-author
abstract for a ‘posiof the new document
tive’ journal article
and the executive direcoften doesn’t show
tor of the ASA.
any non-significant
results? Some may
call it keeping an
abstract to the
journal’s specified
34
CardioSource WorldNews
While you would be hard
pressed to find much statistical
slang in Urban Dictionary, the
term “p-hacking” was added
by someone calling him/herself
“PProf” in January 2012.
length, but the reality is it’s cheating.
John P. A. Ioannidis, MD, DSc,
has been decrying
bad statistics in
biomedical research for a long
time now, starting with a 2005
article refreshingly
titled: “Why most
published research
findings are false.”9
Dr. Ioannidis is
co-director of the
Meta-Research
Innovation Center at Stanford (METRICS) and
holds the C.F. Rehnborg Chair in Disease Prevention at Stanford University.
In a recent JAMA article, Dr. Ioannidis and
colleagues studied how p values are reported in abstracts and full text of biomedical research articles
over the past 25 years.10 They used automated text
mining to identify more than 4.5 million p values in
1.6 million Medline abstracts and some 3.4 million
p values from more than 385,000 full-text articles.
In addition, the researchers manually assessed
reporting of p values in 1,000 sample abstracts
(analyzing only the 796 that reported empirical
data) and 100 full-text articles.
The resulting abstract proved to be the most
technical and dense opening this writer has seen
in a long time, but the bottom line was clear: they
found a strong “selection bias” towards significant p
values in the abstracts versus the text of the study.
They also found in abstracts that p values of 0.001
or less were “far more commonly reported than
values of 0.05.
Abstracts “appear to provide a somewhat distorted picture of the evidence,” wrote the authors,
particularly as “many readers focus primarily on
the abstracts.” The tendency to cherry pick lower
p values was particularly evident in meta-analyses
and reviews, “segments of the literature [that] are
influential in clinical medicine and practice,” and in
core medical journals that also carry extra influence.
Dr. Pocock suggested that this practice is less
of a problem in cardiology than in other fields,
particularly in the major journals. “I hope they
would never let you get away with it,” he said. “At
the same time, we don’t want to deny what is often
called ‘exploratory data analysis.’ We want to look
at new ideas, we want to look at secondary endpoints, and at subgroups and get ideas for future
research, but if you do p values at that exploratory
realm, they are more used as descriptive feelers to
see if something is worth taking seriously as opposed to leading to direct conclusions.”
Furthermore, the average p values reported overall are getting lower (more significant). This, Chavalarias et al. acknowledge, may be a result of big
data offering larger sample sizes. Maybe, but one
statistician we admire finds it extremely unlikely
that big data will drive p-values down en masse.
More likely: the fact that p values are getting lower,
which Chavalarias and colleagues say “may reflect a
combination of increasing pressure to deliver (ever
more) significant results in the competitive publishor-perish scientific environment as well as the
recent conduct of more studies that test a very large
number of hypotheses and thus can reach lower p
values simply by chance.”10
They concluded their study by saying that the
p value < 0.05 has “lost its discriminating ability
for separating false from true hypotheses; more
stringent p value thresholds are probably warranted
across scientific fields.”
What the authors do not suggest is that the p
value be abandoned, but rather that they not be
reported in isolation: “Articles should include effect
sizes and uncertainty metrics.”
P-hacking
While you would be hard pressed to find much
statistical slang in Urban Dictionary, the term “phacking” was added by someone calling him/herself PProf in Jan. 2012.
The definition: “Exploiting—perhaps unconsciously—researcher degrees of freedom until
p < 0.05.” The examples clarify what they mean:
“That finding seems to have been obtained through
p-hacking, the authors dropped one of the conditions
so that the overall p-value would be less than .05,” or
“She is a p-hacker, she always monitors data while it is
being collected.”
(One other statistical term found in this authoritative source is “statsporn: An arbitrarily
detailed statistical breakdown of information which
provides no greater understanding but fills reports
or especially weak assignments designed primarily
to give the reader something to look at.” As in, “This
report is not great—is there no statsporn we can fill it
with?”)
Some investigators have suggested that when
reported p values cluster around the 0.041 to 0.049
range, p-hacking may be to blame.
May 2016