Name:
Jing Cao, PhD, discusses multiple comparison procedures.
Description:
Jing Cao, PhD, discusses multiple comparison procedures.
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/5f71f5a7-1a73-44e3-99ad-633f0d98f699/thumbnails/5f71f5a7-1a73-44e3-99ad-633f0d98f699.jpg?sv=2019-02-02&sr=c&sig=ExQRz49p1oO5lQA%2BrShJ%2Bwd7cF4vbldEZ0T9uNFQb2U%3D&st=2022-05-27T19%3A07%3A47Z&se=2022-05-27T23%3A12%3A47Z&sp=r
Duration:
T00H22M47S
Embed URL:
https://stream.cadmore.media/player/5f71f5a7-1a73-44e3-99ad-633f0d98f699
Content URL:
https://asa1cadmoremedia.blob.core.windows.net/asset-2fb68f27-d8d7-4066-958c-7d108b35bffd/18822562.mp3
Upload Date:
2022-02-23T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[ Music ] >> Computer systems are so powerful these days that investigators can assemble huge databases and easily perform a multitude of statistical tests. In doing so, it's easy to come up with false positive results. How this can happen is not easily understood when using statistical jargon such as null and alternate hypotheses, family-wise error rates, etcetera. But there is an easy way to understand these concepts, and that's to explain them in terms of shooting arrows and the probability of hitting some target. In this JAMAevidence podcast, we talk about a common statistical problem that occurs very frequently in the medical literature.
That of performing statistical comparisons many times and the risk of arriving at false positive conclusions. Today, we discuss these concepts in terms that anyone can understand. We're here to talk about the complexities of multiple comparison procedures. And that was described in an article published in the August sixth, 2014 issue of ''JAMA''. Today, in the JAMAevidence podcast, we have the expert on the topic who wrote the article. So, why don't we begin by having you introduce yourself? >> My name is Jing Cao. I'm Associate Professor of Statistics at Southern Methodist University.
>> We start with a traditional explanation of the risk for making false conclusions when performing too many statistical tests on the same data, looking at the same research questions. Bear with me. The easy explanation comes later. >> In the traditional application of hypothesis testing, one has a single primary outcome. And the goal is to define procedure for testing the hypothesis about that outcome. And because of the randomness, we want to control the chance of falsely concluding there is a significant effect.
So, the traditional type I error and the corresponding P-value are used to guard that. So, that there's only a very small chance of falsely concluding there is a significant effect when, in fact, there is none such significant effect. >> When discussing statistics, there's a tendency to get lost in the terminology. Terminology that's confusing because of awkward language. The use of double negatives and the like. So, before we get too deep into this discussion, let's define some terms.
These definitions come from the JAMAevidence glossary of statistical terminology found at jamaevidence.com. A null hypothesis is a statement used in statistics asserting that no true difference exists between comparison groups. False positives occur when somebody doesn't have the problem you're looking for, but when you do a statistical test, it comes up positive for the disorder. Type I error is also called alpha error. This is an error created by rejecting a null hypothesis when, in fact, it's true. In other words, investigators might conclude that an association exists among variables when it actually doesn't.
Type II error is also called a beta error. That's an error created by accepting a null hypothesis when it's actually false. In other words, investigators might conclude that no association exists among variables when, in fact, there is an association. The false discovery rate is the expected proportion of false positives observed among all discoveries. The family-wise error rate is the probability that quantifies the risk of making any false positive imprints by a group or family of tests.
Another way to look at this is what the probability is of having at least one positive test if one performs a multitude of statistical tests. So, these are the technical terms that you may be familiar with. But they're abstract concepts. How can you understand all this in terms that are easy to conceptualize? Think in terms of firing an arrow at a target. The analogy is that firing an arrow at a target is like performing a statistical test. The arrow has some probability of hitting that target. And its opposite or complement is a probability of missing the target.
This is how Doctor Cao explains it. >> Now, let me take archery, for example. Archery means a person shoots an arrow at a target. Okay. So, I want to know if that person is amateur or professional. The null hypothesis is null effect. This person is amateur. Know nothing about shooting an arrow. The alternative is oh, this person is a professional. Is really good at shooting an arrow.
Now, under the null hypothesis, a person is amateur. There is a five percent chance the person will hit the target if he shoots randomly. Just try to shoot, this amateur. That is what is called the type I error. Now, if this person shoots quite a few arrows. Okay. Not just one arrow. If he shoots one arrow, this amateur have a five percent of chance hitting the target. You can imagine if we let this person shoot more than one arrow, dozens, now then the chance of at least one arrow hitting the target will no longer be five percent.
Right. That's intuitive. So, that means if you allow this person shooting arrow on and on and on, random chance will allow eventually at least one arrow shooting the target. So, that is the intuition that when you have more than one test. When you have multiple tests, the chance of making, at the least one discovery, of having at least one arrow hitting the target, it will be more than five percent.
Actually, as the number of tests, as the number of arrows you allow increase, then the chance of making at least one hit is more than five percent. Will increase with the number of comparisons or test. >> So, if we have the amateur and that amateur fires an arrow at a target, and has a five percent probability of hitting it, just cause, and 95% of missing it just cause he or she's an amateur.
What's that probability of hitting the target if he fires the arrow twice? >> So, what we see is if this person, at least one, that is the complement event of missing both. Right. So, hitting one, five percent. Missing one is 95%. Right. Missing one is 95%. If missing both, it will be one minus .95 squared. Right. So, if we get that, we will have one minus .95.
And then, square it. So, hitting at least one is .0975. In other words, after rounding, hitting at least one is about 10% as opposed to five percent. So, that's double the individual error rate. So, do we see that? >> So, essentially, if I look at the math in your paper, it winds up being 10%. >> Right. Exactly. >> Okay. So, what happens is if they fire two arrows, instead of a five percent chance of hitting the target, now it's 10%.
And if they fire three, based on the formula that you provided in your paper which his one minus .95 cubed, it's 14%. >> Exactly. If we increase it, then it goes to 40%. If we. >> Right. >> Include the 20, it's way beyond 50%. It's 64%. And you can see the consequence. >> So, that is a fantastic explanation for why you need to correct for multiple comparisons.
Because if you keep firing arrows or you keep looking at the data, you're eventually going to hit something that looks like it's statistically significant. Or at least, if you fire the arrows, you'll eventually hit the target. And if you keep testing the same data, you'll eventually find something that appears to be statistically significant, but it only is appearing that way by chance. So, now knowing that we can't just keep looking at data and doing test after test after test, looking for significance, how do you correct for multiple comparisons?
>> If multiple comparisons are involved in the study, we no longer work with what is called individual error rates. Now, we define something called family-wise error rate. Now, let me start with definition. Then, I will give you a vivid example. This family-wise error rate is the probability of making at least one false discovery. Okay. So, that is probability of making at least one false discovery among all the tests you are considering.
Now, in the archery example, this family-wise error rate, it is probability of having at least one arrow hitting the target when the person shoots a bunch of arrows. In this archery example, the null hypothesis is, this person is amateur. And this alternative hypothesis is this person is professional. Now, if this person only gets to shoot one arrow. So, if we only allow this person to shoot one arrow, then we only control this what is called individual error rate.
And you can imagine this individual error rate can be translated into setting the target pretty big. Say one meter in diameter. So, that the chance of one arrow hitting the target is five percent. Okay. Now, then if we let the person shoot like 10 arrows, for example. And then, we say as long as there is one arrow hitting the target, then we say this person is not amateur, is a professional. Then, in order to control this family-wise error rate.
That is at least one arrow out of 10 arrows hitting the target. We need to have a smaller target instead of using this one-meter in diameter target, probably we should use a quarter meter in diameter. Smaller target. Such that you know, the individual shoot, the individual arrow shoot, you know, hitting the target is much smaller than five percent. But you know, when you think about 10 arrows all together, then at least one arrow shooting the smaller target will be controlled at five percent.
So, this individual error rate can be think of as a big target in the archery example. Now, in order to control this family-wise error rate which is at least one arrow out of 10 arrows hitting the target, the target needs to be smaller. Smaller means we cannot use this five percent individual error rate. Now, in the level of individual error, you need to use a smaller target so that the overall error rate is still controlled at five percent.
>. So, shooting arrows is a good way to explain the effect of performing multiple tests on the same data when looking at the same research question. If somebody is shooting arrows at a target who doesn't have any expertise in archery, there's a five percent chance that that target will get hit when one arrow is fired. The probability of hitting the target goes up with each successive arrow. In fact, if 20 arrows are fired, there's a 64% chance that the target will be hit at least once. The same thing happens when performing a number of statistical tests within one experiment.
If enough tests are done because those tests typically have a five percent chance of having a false positive result. If enough tests are performed, one of them will eventually be positive by chance alone. It's not uncommon for investigators to report that one positive finding when trying to identify an association between some risk factor and a clinical outcome. Thus, it's really important to consider if multiple statistical tests were done when analyzing research findings. It's okay to do multiple comparisons if one accounts for the possibility that false positive findings might occur.
This is usually done by adjusting the threshold using a procedure called the Bonferroni Method. The threshold for what considers a positive test is changed. For example, instead of setting up the threshold that a value of .05, which also means there's a five percent probability of having a false positive test. If one does 20 tests, the correction done to establish the threshold of positivity is .05 divided by 20 or .0025. Which also means that there's a .25 probability of having a false positive for any one test.
Using our archery example, if one wanted to try to identify who might be an expert at archery compared to someone who has no experience, and they are both to fire 20 arrows at the target. By making the target itself much smaller, there'd be very little likelihood that the amateur would hit it, and a much higher probability that the expert would hit it at least once. When multiple comparisons are made in an experiment, do you always need to adjust for multiple comparisons? Or are there certain conditions where you do have to, and other conditions where you don't?
>> Yes and no. Now, that means if your tests are used to address one question. Like the archery example. If the person's allowed to shoot different arrows. And you see, all the arrows, all the arrows shooting are used to judge whether this person is amateur or not, then we do need to adjust multiple comparison. But if like this person shoots a arrow, and then this person shoots a basketball, and then this person plays some other game.
So, there are different tests. But each test are used to address a different question. It's not about archery. Another test is about basketball. And there's another one about maybe another game. You know. Hitting the tennis ball or something. Then, if they are not related to answer the same question, then we don't need to adjust it, though it's multiple tests. But each of them respectively answer a different question.
In this case, we don't need to adjust it. It's only when multiple tests are used to address the same question, we do need to consider multiple comparison. >> So, it's okay for you not to correct for multiple comparisons if different research questions are being answered. Say you have a big database and in one analysis, you choose to look at mortality from some medication. And in another analysis, you choose to look at complications from that analysis. If you did multiple statistical analyses looking at mortality as an outcome, and then multiple statistical analyses looking at complications as an outcome.
Each of those analyses would have to have corrections for multiple comparisons. But if you did one test looking at the relationship between a risk factor and mortality, and another test looking at the relationship between some risk factors and complications. Multiple comparison procedures are not needed because the outcomes analyzed are different. You only need to do multiple comparison adjustments if many statistical tests are performed looking at the same research question. I'd like to thank Doctor Jing Cao from the Department of Statistics at Methodist University in Dallas, Texas, for talking with us today on this podcast.
And also, for writing the article on multiple comparisons published in the August sixth, 2014, issue of ''JAMA''. What you heard from Doctor Cao is that you can envision multiple comparisons as an archer firing an arrow at a target. And if they keep firing arrows, they'll eventually hit the target. And one way to adjust for it, that increased ability to hit the target by randomly firing arrows. Make the target smaller just like we make the T-value smaller in multiple comparisons procedures.
This episode was produced by Shelly Stephens. Our audio team here at the JAMA Network includes Jesse McQuarters, Daniel Morrow, Maylyn Martinez from the University of Chicago, Lisa Hardin, and Mike Berkwits, the Deputy Editor for Electronic Media at the JAMA Network. I'm Ed Livingston, Deputy Editor for Clinical Reviews and Education at JAMA. Thanks for listening. [ Music ] And today, in the JAMA Clinical Reviews podcast.
And today, on JAMA Clinical Reviews. And today, in JAMA. In JAMA.
This episode was produced by Shelly Stephens. Our audio team here at the JAMA Network includes Jesse McQuarters, Daniel Morrow, Maylyn Martinez from the University of Chicago, Lisa Hardin, and Mike Berkwitz, the Deputy Editor for Electronic Media here at the JAMA Network. I'm Ed Livingston, Deputy Editor for Clinical Reviews and Education at JAMA. Thanks for listening. I'm.