Get More Sophisticated When considering p values, we need not be constricted by “rigid thinking” according to Dr. Kyriacou. Instead, consider the options. One such option is Bayesian Inference, which is particularly useful when, for example, the data show an association between a particular exposure and a specific health-related outcome. Utilizing this method, the investigator can infer the possibility of a causal relationship using the data in conjunction with data from prior similar studies. Bayes factors have been proposed as more principled replacements for p values. Think of them as a mini deadlift test, measuring the strength of the relative evidence. Technically (and pardon the jargon), they are a weighted average likelihood ratio, representing the weight of evidence for competing hypotheses. Bayes factors represent the degree to which the data shift the relative odds between two hypotheses. According to Dr. Kyriacou, calculating Bayes factors requires more computational steps than the p value and the technique is not as widely known. Additionally, using Bayesian inference requires that prior information is known and precisely quantified, and Bayes factors have been criticized as being biased towards the null hypothesis in small samples. Regardless of the technique(s) used, experts tend to agree that automatic dichotomized hypothesis testing using a prearranged level of significance (i.e., p < 0.05) needs to be supplemented (not supplanted) by more sophisticated methods, which might include effect sizes and a range of other tests; plus, some straightforward scientific reasoning and judgment, even if that injects a certain subjectivity. Or, as Statisticians Stone and Pocock Outline a ‘Post put it: “A p value is p < 0.05 Era’ in no substitute for a New Scientific brain.”8 They used Statement that statement as a reminder that After decades of interpretation of a debate, the American Statistical Association seemingly “positive” (ASA) has released a trial rests on more statement on signifithan just a significance and p-values. To cant p value. discuss what a ‘postp < 0.05 era’ will look Cherry-picking like, CSWN Executive p Values Editor Rick McGuire talks with Ron WasserEver notice how the stein, PhD, first-author abstract for a ‘posiof the new document tive’ journal article and the executive direcoften doesn’t show tor of the ASA. any non-significant results? Some may call it keeping an abstract to the journal’s specified 34 CardioSource WorldNews While you would be hard pressed to find much statistical slang in Urban Dictionary, the term “p-hacking” was added by someone calling him/herself “PProf” in January 2012. length, but the reality is it’s cheating. John P. A. Ioannidis, MD, DSc, has been decrying bad statistics in biomedical research for a long time now, starting with a 2005 article refreshingly titled: “Why most published research findings are false.”9 Dr. Ioannidis is co-director of the Meta-Research Innovation Center at Stanford (METRICS) and holds the C.F. Rehnborg Chair in Disease Prevention at Stanford University. In a recent JAMA article, Dr. Ioannidis and colleagues studied how p values are reported in abstracts and full text of biomedical research articles over the past 25 years.10 They used automated text mining to identify more than 4.5 million p values in 1.6 million Medline abstracts and some 3.4 million p values from more than 385,000 full-text articles. In addition, the researchers manually assessed reporting of p values in 1,000 sample abstracts (analyzing only the 796 that reported empirical data) and 100 full-text articles. The resulting abstract proved to be the most technical and dense opening this writer has seen in a long time, but the bottom line was clear: they found a strong “selection bias” towards significant p values in the abstracts versus the text of the study. They also found in abstracts that p values of 0.001 or less were “far more commonly reported than values of 0.05. Abstracts “appear to provide a somewhat distorted picture of the evidence,” wrote the authors, particularly as “many readers focus primarily on the abstracts.” The tendency to cherry pick lower p values was particularly evident in meta-analyses and reviews, “segments of the literature [that] are influential in clinical medicine and practice,” and in core medical journals that also carry extra influence. Dr. Pocock suggested that this practice is less of a problem in cardiology than in other fields, particularly in the major journals. “I hope they would never let you get away with it,” he said. “At the same time, we don’t want to deny what is often called ‘exploratory data analysis.’ We want to look at new ideas, we want to look at secondary endpoints, and at subgroups and get ideas for future research, but if you do p values at that exploratory realm, they are more used as descriptive feelers to see if something is worth taking seriously as opposed to leading to direct conclusions.” Furthermore, the average p values reported overall are getting lower (more significant). This, Chavalarias et al. acknowledge, may be a result of big data offering larger sample sizes. Maybe, but one statistician we admire finds it extremely unlikely that big data will drive p-values down en masse. More likely: the fact that p values are getting lower, which Chavalarias and colleagues say “may reflect a combination of increasing pressure to deliver (ever more) significant results in the competitive publishor-perish scientific environment as well as the recent conduct of more studies that test a very large number of hypotheses and thus can reach lower p values simply by chance.”10 They concluded their study by saying that the p value < 0.05 has “lost its discriminating ability for separating false from true hypotheses; more stringent p value thresholds are probably warranted across scientific fields.” What the authors do not suggest is that the p value be abandoned, but rather that they not be reported in isolation: “Articles should include effect sizes and uncertainty metrics.” P-hacking While you would be hard pressed to find much statistical slang in Urban Dictionary, the term “phacking” was added by someone calling him/herself PProf in Jan. 2012. The definition: “Exploiting—perhaps unconsciously—researcher degrees of freedom until p < 0.05.” The examples clarify what they mean: “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05,” or “She is a p-hacker, she always monitors data while it is being collected.” (One other statistical term found in this authoritative source is “statsporn: An arbitrarily detailed statistical breakdown of information which provides no greater understanding but fills reports or especially weak assignments designed primarily to give the reader something to look at.” As in, “This report is not great—is there no statsporn we can fill it with?”) Some investigators have suggested that when reported p values cluster around the 0.041 to 0.049 range, p-hacking may be to blame. May 2016