Name:
George Tomlinson, PhD, discusses how to use an article about hypothesis testing and appropriate interpretation of P values.
Description:
George Tomlinson, PhD, discusses how to use an article about hypothesis testing and appropriate interpretation of P values.
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/e10e25a4-e536-49c0-b02c-44eace017f2a/thumbnails/e10e25a4-e536-49c0-b02c-44eace017f2a.jpg?sv=2019-02-02&sr=c&sig=4f5XM2bXC2kTpL0%2BMhbUNSpIo4bCC3LqGYiptZfkJHY%3D&st=2022-05-27T18%3A54%3A19Z&se=2022-05-27T22%3A59%3A19Z&sp=r
Duration:
T00H12M38S
Embed URL:
https://stream.cadmore.media/player/e10e25a4-e536-49c0-b02c-44eace017f2a
Content URL:
https://asa1cadmoremedia.blob.core.windows.net/asset-0e929d14-ebaa-42aa-abdd-e01f4f01d755/16207281.mp3
Upload Date:
2022-02-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[ Music ] >> Hello, and welcome to JAMAevidence, our monthly podcast focused on core issues in evidence-based medicine. I'm Dr. Demetrios Kyriacou from Northwestern University. Today we are discussing hypothesis testing with Dr. George Tomlinson. Welcome, Dr. Tomlinson.
>> Thank you. >> Could you tell us a little bit about hypothesis testing in clinical trials? >> Of course. So in clinical trials there's usually a question about whether an intervention improves outcomes in patients. A trial's run, data are collected. At the end of the trial the question is did the intervention improve outcomes in patients, and one -- among the many possible things that could happen, you know, one of them is that findings could purely be a result of chance.
Now, you're hoping that the findings might be a result of the treatment actually improving outcomes in patients. The trial could have been run with problems so the results are result of bias. But hypothesis testing deals primarily with the question could the results that you see be due to chance, or how likely is it that chance could give results like the ones you've seen. If chance is a primary and very clear explanation for what you saw, then you probably don't have a treatment that's beneficial. >> Thank you. What are the main problems with hypothesis testing from a researcher's perspective, and then from a clinician's perspective?
>> Well, when you're planning a clinical trial under the usual setting, where you have a null hypothesis, the treatment has no effect on outcomes, when you're planning the study you need to figure how big the study should be. In a hypothesis testing setting you need to usually say, well, you know, what is the real effect going to be, what do I think the real effect is, or what is the minimally clinically important effect, and you design the study around that, so that when at the end of the study you carry out your hypothesis test, you've got a good chance of -- I'm not sure if you'll get into the details of the technique yet, but you've got a good chance of rejecting the hypothesis if, in fact, there is a treatment effect.
The problem is before the study starts you don't really know a lot of things about patient outcomes. You maybe don't know the portion of patients who will benefit in either group, you don't know how many people -- you don't actually know the treatment effect either. So the hypothesis testing setting forces you to make some decisions at the very beginning when you're planning the trial on things you often don't have very good data about. The whole hypothesis testing approach to clinical trials puts some strong restrictions on what you can do with the data during the trial. So that's one problem with it from a researcher's point of view; it forces you to sort of make assumptions about things you maybe don't have pretty good data about because that's why carrying out the trial.
From the interpretation point of view-- was that your second question? >> Yes. >> Yeah, the interpretation is -- there's a very long-winded technical definition of a p-value that is not usually what people like to think of it as. You know, I think things are getting better. There's been some recent clamor about the problems with p-values and what they really mean and their limitation. So I think maybe researchers today are more aware of those problems than they used to be. But there's been evidence shown that people misinterpret p-values and they think a p-value is the probability, for example, that the treatment works, which is not what it is at all.
So those are a couple of problems in the early design stages of a trial using hypothesis testing and then in interpretation. There are plenty of others. >> All right. So getting to the clinical interpretation of a trial, as a clinician, sometimes I have difficulty accepting that a p-value of .051 is different from a p-value of .049. So if it was .049, then it shows that the drug, for instance, works, but if it's 051, then we assume that the drug doesn't work.
Could you elaborate a little bit on that? >> Well, I can tell from the tone of your voice asking that question that's something that you don't think should be done, and I entirely agree. There's a -- I'm not sure if it's a published paper or a preprint by Andrew Gelman at Columbia titled something like the difference between significant and nonsignificant is not itself significant, and this is what you just said. You know, p-values of .049 and .051 are essentially the same level of evidence against a null hypothesis. If you're going to be in a very strict setting where you need to make a decision, you know, are we going to say this treatment works or it doesn't work, then you need a threshold like that, but that whole statement, does it or doesn't work, again, assumes that the treatment effect is at 0 or it's some other value, that there's no intermediate value.
And I really think that a more nuanced view is needed in that, you know, p-values are on a continuum, .05 is in the middle of those two values you gave, but it means essentially the same thing. This is not just me, obviously, but there's a whole body of work out there that saying that people should not just focus on the p-value, but the actual effect size; what was the actual risk difference or relative risk comparing the two interventions. Whether it was significant at the strict threshold or not is only one piece of the evidence you need to use to decide, as a regulatory body whether this should be something that's approved or whether as a clinician you're going to believe that the treatment's beneficial.
>> Thank you, Doctor Tomlinson. Another question I have is does hypothesis testing take on a different context with noninferiority or equivalence trials? >> In one sense it does. The mathematics or the statistics are really very, very similar when you write out the pieces, but the interpretation is quite different. In my experience it's pretty uncommon for clinical medicine to run equivalence studies. They might be used to show that two drugs have a similar availability, you know, and the difference -- you know, one drug to be higher or lower than the other, either direction would it be a problem, but mostly studies that are not superiority studies are aiming to show noninferiority, meaning there's a new intervention which you'd like to use because it's perhaps cheaper or more tolerated by the patient, but it might not be quite as good and you're willing to trade off some minor harm on the outcome side for all the benefits on the affordability, tolerability, and so on.
So whereas in a standard superiority study your starting point, your null hypothesis is that the two treatments have the same effect and you're hoping to show that that hypothesis isn't tenable, given the data being observed, that the data are not consistent with there being no benefit. The roles are reversed for the hypotheses in a noninferiority study, in the sense that your null hypothesis, your starting point, is that the new intervention is actually worse and you'd like to disprove that and come up maybe converse conclusion that the new intervention isn't worse.
Now, that's really the main difference in the hypothesis testing setting, but there are a couple of other things that make it a more difficult question. One is defining worse. Now, when we say that two treatments have the same benefit in a superiority trial we know that 0 is 0 is 0. Saying that one treatment is no better than another means that the outcomes are identical in the two groups. Now, to say that one treatment is not worse than another is a clinical judgment. Is a half a percent increase in mortality essentially the same; is a quarter of a percent increase in mortality the same, who knows, and that becomes a clinical judgment, and that clinical judgment becomes a key part of the hypothesis testing in the way that the null hypothesis of zero benefit isn't part of the hypothesis testing in a superiority trial.
The other thing is the key difference from a clinician's point of view in designing a noninferiority trial, you need to think very hard to say what it means to say that two things are similar enough that you can ignore the difference between them. >> Often as clinicians we read studies for clinical trials that look at several different outcomes and present multiple tests and p-values. Is hypothesis testing altered in the context of multiple testing within the same study or trial?
>> That's another question that people have been asking for a long time, and this is not an easy question to answer. So there's a concept to what's called a comparison-wise error rate. So if you just make one comparison and you use, say, an alpha of .05 to delineate significant findings from nonsignificant findings, if there's really no difference between your two treatments in a clinical trial, there's a 5% chance that you're going to come out and say there is a difference, and that's the comparison-wise alpha, it's for that one comparison, your probability of making the incorrect conclusion, saying there's a difference when there is [inaudible] 5%.
So that's fine. If you then go on to a second outcome and say for this outcome I'm going to set alpha at .05, and you test the hypothesis, you reject it if the p-value is less than .05. Again, you're in exactly the same situation as the first question; you got a 5% chance of rejecting the null hypothesis when, in fact, there's no benefit of the treatment for that outcome. So if you look question by question, there's no problem with doing multiple hypothesis testing, and if each question stands alone and is an important question in its own right, some people suggest that it's fine, you don't need to worry about the fact that you have multiple questions.
The problem arises when you have a number of questions and you're going to make a decision that a treatment works if any of the questions is answered in the affirmative. So if you run a trial where you're looking at, say, quality of life and you're looking at mortality and you're looking at hospital stay, and you're thinking if any of those things is beneficial, then I'm happy to say this treatment is a benefit for the patients, then amongst that family of hypotheses, or that family of outcomes that you're going to compare, there's obviously an increased chance that at least one of them is going to show a benefit when, in fact, the treatments are identical in terms of their efficacy.
So in the setting where you have an undifferentiated set of hypotheses and you're going to claim that treatment works, if any of them is rejected, then there's a problem because you're going to falsely claim the treatment's beneficial more than 5% of the time. If you want to make sure that only becomes 5% of the time that you make that statement about the group of hypotheses, you need to think of what's called the family-wise error rate. So for that family of hypotheses you need to do something to correct the p-value so that amongst the whole set there's only a 5% chance that you'll say that one of them works, if the treatment is completely effective.
So the reason this is important to distinguish is that, you know, sometimes people will publish a paper on a clinical trial and there may be questions from a reviewer about multiple hypotheses, and then they might publish a second paper on a bunch of secondary outcomes. It's very rare for the second set of reviewers to go back and tell the authors to correct all the p-values for the comparisons in the first paper as well. The point is you need to think about what is the collection of hypotheses that I need to, if I want to, control the error rate for; is it all the hypotheses coming out of a trial, is it the secondary hypotheses, is it a group of hypotheses about clinical outcomes, a group of hypotheses about cost outcomes.
And it becomes very difficult to know how large to make the scope for that correction, and for that reason my personal view is not to correct the p-values, but to have a very small number of primary questions than a number of questions which are more exploratory. >> So you put it in the context of what are the most important questions, and then the secondary questions? >> Yeah. That really sidesteps the issue of correcting for multiple comparisons. And, you know, most trials that I've seen and been involved with do go that route; they have a primary question they're trying to answer that if that question isn't answered in the positive, then the rest of the questions aren't really as important.
And the problem is if you start correcting p-values using any of the methods out there, it becomes very difficult to know how far to take that. You know, for example, I was just re-reviewing, working on a revision of a paper that was going to JAMA and there's one primary outcome, one secondary outcome, there's a collection of exploratory outcomes in one table, and then an appendix that is the collection of resource used outcomes. Now, I don't really know how many of those comparisons I need to pool into this family if I'm going to correct my p-values, so I'd rather not do that, and then if I publish the raw p-values, anybody that wants to do that can do that him or herself.
>> Dr. Tomlinson, is there a difference in the use of hypothesis testing in clinical trials versus observational studies?