ASH Clinical News ACN_5.5_full_issue_web | Page 33

FEATURE imply causality from nonrandomized data can readily fail to recognize and label random fluctuations. This tendency is apparent in studies examining health-care interventions in populations that have high-risk disease characteristics. Because the “high-risk” classification often implies outlier values, individuals initially identi- fied by their outlier values will likely have lower values on remeasurement, with or without intervention. “If presenters take the time to explain the story their data are telling, rather than using jargon and statistics, it’s much easier to avoid this kind of narrative fallacy,” Dr. Machado said. Translating statistical observations into simple lan- guage (e.g., “as generations go on, people who are worse off tend to improve and those who are better off tend to worsen”) lowers the likelihood of misinterpreta- tion, she added. The Problem With P Values In scientific literature, a p value of less than 0.05 generally is set as a benchmark to determine whether findings are “statis- tically significant,” but readers can make the mistake of conflating that number with clinical significance, according to Grzegorz S. Nowakowski, MD, from the Mayo Clinic in Rochester, Minnesota. “P values don’t tell you anything about clini- cal benefit. They only tell you how likely your results are to be true and not a play of chance,” he explained. Dr. Nowakowski also serves as a co-chair of the American Society of Hematology (ASH) Working Group on Innovations in Clinical Trials. Allan Hackshaw, PhD, an epidemi- ologist at Cancer Research UK and the University College London Cancer Trials Centre who teaches clinical trial design and consults with trialists, agreed. “A p value just addresses the question, ‘Could the ob- served result be a chance or spurious find- ing in this particular trial, when in reality the intervention is completely ineffective?’” The answer to this question is always “yes,” he said, but people tend not to ask this question when interpreting p values. In the same way that one could flip a coin 10 times in a row and come up with heads every time, despite there being nothing wrong with the coin, a p value of <0.05 does not necessarily indicate that a treatment is effective. The commonly used cutoff value of 0.05 means that one illegitimate effect (false-positive result) is expected in every 20 comparisons. “The American Statistical Association clearly states that p values should not be used in making clinical decisions – or any sort of decisions – and yet, that’s exactly what journals and registration agencies do,” said Dr. Tannock, who has worked throughout his career to improve the quality and reporting of clinical trials. “They should be using effect size and some measure of value.” Researchers can conduct a trial in ASHClinicalNews.org 2,000 patients and identify a difference of a few days in survival, he offered as an ex- ample. “It might be statistically significant, but it’s not clinically important. More- over, those individuals selected for the trial were quite possibly heavily selected to have high performance status,” Dr. Tannock explained. “When you try to see this same small difference in the general patient population, the effect is smaller and the toxicity is higher.” “While we could conclude falsely that a treatment is effective when actually it is not, there also are examples where there are clearly large benefits but, with a p value just above 0.05, the authors may conclude that there is no effect, and this is plainly wrong,” added Dr. Hackshaw. When results just miss statistical sig- nificance, assessing the evidence requires great care. “We all have different feelings about the data,” Dr. Nowakowski said. “I may see the data as being potentially marginal, while someone else might see a potentially huge benefit. There’s always a degree of subjectivity, and this nuance is often lost in transmission.” Depending on the study design, trials can be fragile, Dr. Tannock noted. “Some- times it takes only moving two or three patients from one side to the other, from positive to negative, and you can com- pletely lose the trial’s significance.” Other points of weakness exist but can go unnoticed by the average reader. For example, the inclusion of multiple comparisons and endpoints increases the likelihood of erroneous inferences. Also, large biases in a trial’s design or conduct might partially or fully explain the ob- served treatment benefit, and these reveal themselves only after a careful review of an article’s methods section. Case in Point: Bad Blood As any trialist can attest, designing, run- ning, and interpreting a trial that produces clinically meaningful and clinically sound results is not easy. There are many oppor- tunities for misinterpretation, as evidenced by the 2016 case of a study of red blood cell transfusions from younger and older patients. First, JAMA Internal Medicine pub- lished a study from a team of Canadian researchers that suggested that red blood cell transfusions from younger donors and from female donors were statisti- cally more likely to increase mortality in recipients. 1 Using a time-dependent survival model and data from 30,503 transfusion recipients, they determined that patients who received blood from donors aged 17 to 19.9 years had an 8 percent higher mortality risk than those receiving blood from donors aged 40 to 49.9 years (adjusted hazard ratio [HR] = 1.08; 95% CI 1.06-1.10; p<0.001). Similarly, an 8 percent increase in risk of death was noted for those receiving blood transfusions from female donors compared with male donors (adjusted HR=1.08; 95% CI 1.06-1.09; p<0.001). This publication was soon followed by an observational, matched-cohort study published in Blood the same year, where- in investigators found no associations between blood donor age and mortality among 136,639 transfusion recipients. 2 “The original researchers assumed that the risk of death and the risk of multiple transfusions were linear, when they really were not,” explained Alan E. Mast, MD, PhD, from the BloodCenter of Wisconsin and a co-chair of the ASH Working Group on Innovations in Clinical Trials. “The data curved because the risk of getting multiple transfusions increased the likeli- hood of dying, but the risk of getting a different transfusion from a young blood donor increases over that time differently than the risk of dying.” In light of these discrepant findings, investigators at the Karolinska Institute in Stockholm conducted their own analysis, using methods similar to the those in the Canadian study but taking a different approach to control more rigorously for potential confounding variables associated with the total number of units transfused. 3 Their findings: Neither donor age nor sex was associated with recipient survival. “Any comparison between common and less common categories of transfusions will inevitably be confounded by the number of transfusions, which drives the probability of receiving the less common blood com- ponents,” the authors concluded. “When you assume linearity between a covariate and the dependent variable, you are essentially averaging out the effect,” Dr. Machado explained. “When people receive multiple transfusions and there is a true nonlinear effect, in a way, you are attributing to each transfusion the average effect of all transfusions.” “[This case] is a good example of researchers coming out and asserting something, and their findings got a lot of attention, but when other researchers went back and used different statistical techniques, they found it just wasn’t true,” Dr. Mast added. Torturing Data Into Confession Misinterpretation of data can typically be attributed to eagerness to transmit findings. “We have many attractive new agents and therapies that we would like to move quickly to the clinic and to patients, so it’s a tough balance to design the most applicable studies that show the true benefit but also to find ways to finalize the study faster and move it to clinical practice more quickly,” said Dr. Nowakowski. This tension between caution and en- thusiasm plays out in the murky waters of subgroup analyses. It is common practice in clinical trials to see whether treatment effects vary according to specified patient or disease characteristics, but the rigor of subgroup analyses also varies, and most readers aren’t prepared to spot the differences. In the best-case scenarios, subgroup analy- ses show homogeneity of effect across multiple subgroups. Problems arise when these analyses are used as “fishing expedi- tions” in trials where no overall treatment effect is found. “We all have different feelings about the data. ... There’s always a degree of subjectivity, and this nuance is often lost in transmission.” —GRZEGORZ S. NOWAKOWSKI, MD “There is no ‘standard’ approach for subgroup analyses,” said Dr. Hackshaw. He suggests that running an interaction test alone – which can compare whether a treatment effect is different between subgroups, such as males and females, for example – is insufficient. A safer practice is to run both an interaction test and a test for heterogeneity, because the latter assesses whether the effects in the sub- groups differ from the overall effect. “Many researchers do multiple sub- group analyses, often encouraged or requested by journals, and few allow for the multiplicity, so chance effects could arise,” he continued. “Requiring both tests to ‘pass’ would strengthen the evidence when faced with multiplicity.” Beyond these numerical evaluations of a subgroup analysis, Dr. Hackshaw said, “there also needs to be biological plausibility and corroborating evidence from independent studies when claiming a subgroup effect.” Also, by their nature, subgroup analyses are based on smaller numbers of patients and events, running the risk that the balance in baseline characteristics achieved by randomization might be lost. “The hope is that the subgroups are all consistent, but even failing to show heterogeneity doesn’t prove that it doesn’t exist,” Dr. Tannock commented. “We should beware of subgroup analyses because the trials are powered for the overall effect; subgroup analyses can show random fluctuations that can be highly misleading.” If there is a subgroup of interest, he added, a separate trial ASH Clinical News 31