Name:
William J. Meurer, MD, MS, and Juliana Tolles, MD, MHS, discuss the use of logistic regression model diagnostics to determine how well a model predicts outcomes.
Description:
William J. Meurer, MD, MS, and Juliana Tolles, MD, MHS, discuss the use of logistic regression model diagnostics to determine how well a model predicts outcomes.
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/23b169bc-680a-44b5-b9b1-7be06c83e21e/thumbnails/23b169bc-680a-44b5-b9b1-7be06c83e21e.jpg?sv=2019-02-02&sr=c&sig=DuRtQfinMC%2BfnTGRVV0jkgQgd5GS7aXsdjr2i0jl3bM%3D&st=2025-01-03T01%3A17%3A52Z&se=2025-01-03T05%3A22%3A52Z&sp=r
Duration:
T00H32M47S
Embed URL:
https://stream.cadmore.media/player/23b169bc-680a-44b5-b9b1-7be06c83e21e
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/23b169bc-680a-44b5-b9b1-7be06c83e21e/18777439.mp3?sv=2019-02-02&sr=c&sig=hHyDMsNhk8su2Pv6p7SwICKXeu%2B88ce3qCU%2FJd6rKAc%3D&st=2025-01-03T01%3A17%3A52Z&se=2025-01-03T03%3A22%3A52Z&sp=r
Upload Date:
2022-02-28T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
[ Music ] >> Logistic regression is one of the most commonly used statistical methods in the medical literature. Very few clinical readers understand it very well. The purpose of the "JAMA" "Guide to Statistics and Methods" articles is to explain complicated statistical methods using language clinicians can understand. It's important for clinician readers of the medical literature to have a basic understanding of the methodoligies used in medical research so that they can appreciate the strengths and limitations of the studies they're reading about.
In this "JAMAevidence" podcast, we spoke with authors of an article published in "JAMA" in 2017 entitled "Logistic Regression Diagnostics Understanding How Well a Model Predicts Outcomes." They are Drs. William Meuror and Julianna Tolles from the University of Michigan and UCLA. [ Music ] We're here to talk about logistic regression and logistic regression diagnostics. So can we start with a explanation of what logistic regression is, how it differs from the usual ordinary linear regression people are familiar with, and why that difference exists?
>> In logistic regression, unlike ordinary linear regression, you're trying to predict binary outcome, which is a type of outcome we're often interested in medicine, you know, whether someone lived or died, was hospitalized or not. And so unlike in linear regression, we may be more familiar with, where you're predicting more of a continous outcome, like a blood pressure measured in millimeters of mercury. >> And when you do logistic regression, the process of doing the regression is different. It's the maximum likelihood explanation instead of just leasts-squared regression.
And learners of logistic regression struggle with the difference. And in least-squared regression, you're finding how close a line is to any individual point. And when you want to see how well your model fit to the data, you just look at the square of the distance between the fitted line and the data points and see how big or small that difference is. That's not the case with logistic regression. Could you explain the basics of how you know a model's working or not when you do logistic regression?
>> So there are a variety of ways to assess the fit of a logistic regression model. What basically ti does is it comes up with a predictive probability based on the set of predictors for each observation in the data set. So if there's a 60-year-old male who received aspirin, and you're looking at the probability that they're going to have a stroke within one year, there's amount of that prediction that is kind of associated with being 60, being male, and the fact that they got aspirin.
In the data set, it's going to come up with that, that temporary predictive probability for that individual. And when it's assessing how well things fit, basically, it looks at people who have events, who really have events, who would have been predicted to have events, and usually it uses a predictive probability cutoff of about 0.5. So if somebody did have a stroke, and the model predicted that their likelihood of having a stroke was 50% or higher, then that would be counted as a correct prediction, or that the observed and expected are the same.
And this is what then goes on to create the area under the ROC curve, or C statistic, by basically giving us a summary measure as to whether the model is providing a good predicted estimate of risk, and that the observed risk is well concordant with that. >> Let me take that a step back. Let's start by defining a receiver operating curve.
>> How would you sort of explain it in lay terms is that it's basically at each cutoff point, the chance that you correctly rank two individuals, one being higher risk than the lower-risk one, and that you do that correctly each cutoff point. Will, do you have a thought? >> I think I was more under the impression of the idea of all of the possible cutoffs for -- you know, as I talked about that cutoff of 0.5, how well it classifies based on different cutoffs of the summary of all of the predicted probabiliites.
I know that many clinicians are more familiar with the use of the ROC in the more direct case, when you're looking at studies of diagnostic accuracy with sensitivity versus 1 minus the specificity. On the other end, it's going to say everybody has the outcome based on some set of a sum of the predictors. And in the middle, it's looking for the point where there's that balance between overcalling and undercalling. And the mass of that, that is under the curve, is this summary measure that ranks from, you know, 0 to 1, of how good the model is at discriminating.
Whereas, a model that predicts completely by chance would have a under the ROC curve of 0.5, but models that we see often in logistic regression have values of 0.7, 0.8, and values over 0.9 are considered excellent. So when this is being used to assess whether the diagnostic tool -- or in this case, a prediction rule -- is useful, that can sometimes be used as a rule of thumb.
>> Logistic regression diagnostics are based in large part on calibration and discrimination of these models. So could you define for us what calibration is and what discrimination is? >> Sure. Discrimination is the ability to assign someone to the correct group. So, you know, you're able to predict that you assigned them to the outcome that they actually have. You know, whether it's a bird or a plane, correctly classify them. Calibration has to do with more quantitatively assigning the correct level of risk.
So maybe you correctly are able to classify people, but you overestimate their risk for an outcome. You know, you said it was 80% rather than 75% in terms of the probability of the outcome. So it has to do with kind of two different ways of thinking about what you're predicting. >> And how do you use discrimination and calibration in a research paper to assess the effectiveness of a logistic regression model, or some sort of predictive model?
>> A lot of it depends on the goals of the research. Sometimes it's very important for prognostication purposes to identify patients into higher and lower-risk categories, whereas the precise likelihood or proportion who are going to experience the outcome is less clinically relevant. So discrimination is often more useful for the types of models applicable to clinicians in clinical care.
However, the model has to have some degree of calibration, as it can be very hard to make predictions when you're on either end. If almost everybody is going to do very well, or almost everyone is going to do very poorly, it's hard for the models to improve upon that because the natural history is so good. So if it's making small predictions, but your classifications are finding somebody who's at 1% risk is your low-risk category, and finding somebody who's at 1.2% risk is your high-risk category, then your model may discriminate well, but because of the calibration not finding pepole at a clinically meaningful increment of risk, it may not be as helpful.
So it often depends on the context, but mostly the context in medicine owes to getting the risk right. There are certain applications where it's important in large data sets to get the calibration right. An example is risk adjustment, which is used for, say, mortality, which is used to ensure that hospitals have some way of accounting for perhaps handling sicker populations without being penalized by payers.
So in those sorts of applications, calibration can be very important. >> How do you interpret the -- so, oftentimes models and the effect of adding or subtracting variables to, say, risk prediction models is assessed by changes in the C statistic, which is the same, basically, as the area under the curve. How does that work? >> I guess to sort of answer the utility question, I do sometimes feel that perhaps the C statistic is a little bit divorced from the clinical reality of how we're trying to use a model to discriminate.
You know, a lot of these sort of models that are designed to screen people into ultra low-risk categories, like the PERC rule for pulmonary embolism, you're really only using the model at a single cut point in practice clinically. And for these screening sort of tests, you would like it to be extremely sensitive and maybe not that specific, you know? So the C statistic can be hard to interpret in terms of understanding how useful this is to me clinically, because it really depends on how I intend to use the model.
>> And I would just add that because of its features, both in looking at a logistica regression model or looking at other sort of tests of diagnostic accuracy, it is rare that we as clinicians are in a position where we are weighting sensitivity and specificity equally. A lot of times, as Dr. Tolles notes about the PERC rule, we really want to be sure that we are going to pick up anybody who may have a devastating condition so that we are willing to accept, and sort of rank, sensitivity higher than we are specificity, whereas the ROC curve kind of looks at them similarly.
And I think, in terms of model building, when we look at how fit changes by adding and subtracting variables, typically other methods, like likelihood ratio tests, or Bayesion information criterion, are used to assess whether additional predictors are adding more value to the model than they are in terms of adding noise, with some penalty knowing that if you add lots and lots of variables, likely you can explain more of the variability in the process that you're observing.
>> Yeah, that's a critically important point because a lot of people use changes in C statistic to determine if a model is improving or not based when variables are added or subtracted. And that's not an effective way to do it. It's done in the literature all the time, but as you pointed out, using some kind of likelihood-based statistic like the BIC or the AIC is really a better way to do that. >> I would also add that if you're in a space where it is not really in the logistic regression space, but it's a little easier to understand in terms of maybe a diagnostic accuracy test, one of the ways that -- and even within logistic regression, this potentially can be done.
Another thing that can be important to look at is how you're changing the misclassification in your model. How many -- you kow, which of the observations are sort of shifting based on the addition of a variable that we're -- how many more predictions end up being good based on the addition? How many are accurate in terms of true positives? But how many -- what is the cost in terms of how many additional false positives that that additional variable is bringing into the model?
So that's another way to think about it at least conceptually that, at some level, that's what all of these types of summary measures are trying to do get more true positives into the model without putting in too many false positives. >> I don't remember exactly what you said, but you said earlier something about risk adjustment and the need to have it well -- was it calibrative or discriminative model? >> In that case, I did say that it's important for that to be well calibrated.
>> The article you wrote in the March 14, 2017 issue of "JAMA" about logistic regression diagnostics was based on another "JAMA" publication that used these techniques. You found that the sample article by Zemeck [assumed spelling], which was published in the March 8, 2016 of "JAMA," would be a good one for investigators who want to use logistic regression to serve as an example for how to describe their methods and show their results. Can you tell us about that? >> When we wrote this article, we had the companion article from the journal that we were sort of using it as an illustrative example of in the article by Zemeck, where they were trying to identify which pediatric patients with concussion were ultimately going to have prolonged post-concussion symptoms, which is important for prognosis, for developing treatments, and a variety of other things.
And the process that they went through in that article, the types of plots and the other types of diagnostics that they provided really aided the understanding for both methodologically savvy and well-trained readers, but I think also for readers who aren't as deeply familiar with these methods by having good data visualization and showing both measures of calibration and discrimination by kind of showing how many people were predicted to be in a risk category, and then what the ultimate event rates were in those risk categories.
So I think if you're in a position where you're reviewing a paper or you're looking at a paper or you're thinking of writing a paper, I think looking at that article by Zemeck and considering using it as a model for how to set up plots and other measures to really help back up the story of whether this is a useful model I think can really be very helpful, you know, to science and the clinical community. >> Logistic regression analyses are expressed in odds ratios, and odds ratios are kind of difficult for clinicians to understand.
Can you explain them for us? >> The way that -- I think part of it is giving this in sort of the simpler example of, you know, if you have 3-to-1 odds. So that is this idea of if your proportions add up to 100%, maybe you have a 75% risk of death versus a 25% risk, and that would be, you know, 3 to 1.
And that is the odds of mortality under one condition, noting that, you know, you have to have one or the other since it's a binary state, so 75% risk of death. So then you get an odds. But then if you're looking at a different proportion of death or risk of death on the absolute scale, let's say that that's somebody with 25%, then you have this 1-to-3 odds, and that's, you know, again, this is sometimes how people who do betting and gambling think.
But sometimes we think more about that you have a 75% chance of dying and a 25% chance of dying. If you look at that on the odds ratio scale, the odds on the top, the 75 risk, of death, 3 to 1. The odds on the bottom of this, the denominator, 1 to 3. So that gives you an odds ratio of 9. But you know that 75 divided by 25 is 3. So that's a risk ratio of 3.
What the problem is and why people, I think, get hung up on this is that when the, you know, the event rates are somewhere in the range between 20 and 80%, the odds ratio diverges quite a bit. But if the risks and the absolute proportions that you're looking at are smaller, like, say a 2% versus a 3% risk, then the odds ratios and the risk ratios actually look a lot more similar. So I think at least having a sense as to where they come from and where they diverge can help you understand them more.
I think people often do have a follow-up question of why you would use an odds ratio at all when it does seem to have this element of taking away context. And in terms of an unadjusted case, I would absolutely agree that the actual event rates are far more useful. However, when you start to build logistic regression models, because of the way that the variables and the outcomes and the predictors are transformed, the output necessarily comes on this odds ratio scale, and while you can convert things back to predicted probabilities for each predictor, that can get a little unwieldly because you have to set all the ones equal and, you know, it can be done and sometimes it should be done, but it just is a little bit more complicated.
So I think because the odds ratio is weird, I think it's reasonable for people to ask, "Why would we ever use it?" And a lot of the reason that we use it is because the output in logistic regression is most cleanly summarized using odds ratios. >> What about risk ratios? Aren't they easier to understand? >> I would definitely agree that I think risk ratios, when presented, are more interpretable.
But I think also some degree of -- and now I'm not talking really in the model assessment, but just to help the reader in terms of calibration is also helpful in that you need to know if the risk is going from 1% to 1.001%. Depending on the size of your data set, that may be -- that's maybe a 1% increase in risk, but it's still a small absolute increase in risk. So I think risk ratios, if they can be -- and oftentimes they can be -- incorporated into reports can be more helpful to the readers, but you do also just need to give them some roadmap to understand where they are and how much additional risk is really being conferred by that predictor.
Because that's another thing that sometimes in these multivariable models one loses sight of, that you do have this additional -- you know, let's say you're doing a prediction for disability after stroke, and you have age and stroke severity. You can find potentially interesting additional predictors, but it turns out age and stroke severity tend to predict a lot. So even pretty interesting risk ratios for additional things maybe representing relatively small absolute changes in risk.
So I think to whatever degree the authors can help put things onto a more intuitive scale, like risk, but both in terms of the absolute risk and the risk ratios, I think that tremendously helps readers and the scientific community. >> So this may be a little technical. But you can't just use logistic regression for any data you have laying around. The statistical models have these assumptions, and those assumptions must be met before the technique can be applied. Can you tell us about those assumptions?
>> Some of the assumptions that are there for regular linear regression are present. So you've got to assume that your predictor variables, the independent variables, don't have a high degree of collinearity, meaning that they're highly correlated, because if you put two things in your independent predictors that are highly correlated, then there can be real problems with your model in estimating the coefficients associated with it because tiny changes in the inputs can change how much the model is attributing the outcomes to those variables.
So an example would be sort of, like, lactate and measures of shock. You know, that you expect those to be correlated. So if you put them both in your model, you may have difficulty estimating individual facts. So it's best to select things that you think are clinically valuable but not highly correlated. And there are diagnostics you can look at to see if you've accidentally included two things that are collinear. >> Yeah, that's a big deal. Investigators tend not to think about collinearity or correlation between variables, but it's really common in the papers we see at "JAMA," where two variables run together and cause all kinds of problems for regression analysis.
And when you have collinearity, you can't really write up meaningful conclusions from the analysis. And it's really important not to have variables that are correlated in the same regression equation. And that's something investigators should be looking for before they do a regression. What about other assumptions? >> Another assumption that I have is that if you're assuming that this linear relationships holds across the whole range of the values for your variable. So one that can be quite problematic is if people just include age as a predictor and just include it as the entire range of ages that they're examing.
You're assuming that the change in odds is the same across the whole range. But you could imagine, for something, some sort of outcome, you wouln't expect the change in risk to be the same as your patients progress from age 30 to age 40 as it would be when they progress from age 70 to age 80. So one solution to that is to actually divide up your continuous variable into blocks and make it a categorical variable, to eliminate that assumption that the relationship is linear across the entire range.
>> And to build on that, one of the, you know, I would say, why wouldn't we always categorize? Sometimes you can do transformations, like age maybe. You want to use age squared or age cubed. And that, when you're building the model, uses less parameters to describe the data. And when you can use fewer parameters, typically it is good, particularly if you don't have a huge number of events. So there are some reasons to try to be judicious in the number of predictors you put in a model, and there is a balance between, you know, using categorization of a continuous variable.
But a lot of times, when you're doing epidemiologic research and medicine, you just really don't know what sort of form that some of these relationships are going to have. And with age, there are definitely non-linear associations with almost everything. So you just have to be careful. And that's part of why you would see age being categorized at times. >> I think that's a very important point of discussion because I tell investigators, "Avoid categorizing continuous variables as much as possible," because you lose power by doing that.
And also, you have these funny things happening at the transitions of categories. So if you break up lactate to above 10, below 10, is 9 different than 11? But they're in two different categories, two very different categories. So as a general rule, I advise them not to do that, though the way -- what you just explained in terms of a need to categorize -- because the changes at different ages have different clinical ramifications -- is a very legitimate reason to categorize a otherwise continuous variable.
That was a great point. >> And certainly, I would agree. There are a lot of types of scales that I would say are what we would call ordinal non-interval so that a 1-point jump in lactate from 12 to 13 is very different from that 1-poiint jump in lactate from 2 to 3. That being said, most of the observations you're going to see are probably going to be between 1 and 4 anyway.
So there may be limited problems associated with using it continuously because the majority of the action is occurring in a relatively contracted part of the scale. And this is something that one should think about when assessing the work and looking at the thoughtfulness, and really why the investigators and authors made the decisions they did about these things. And the more transparency they can have in their methods as to why they made those choices, the better.
But in general, it is better to use all the information. But you're doing a trade off as to, is it better to use the information continuously, or understand that you're going to be more violating an assumption of regression in general because the relationship is very non-linear at the end? So that's always something to think about. >> So something else that comes up all the time is interaction.
Can you explain interaction terms for us? >> When I try to think of interaction terms in sort of a linear regression space, let's say I'm predicting blood pressure, and if there's an interaction term between age and gender, you know, you can think of older males are going to have their blood pressure become higher at a greater rate than older females. But amongst younger people, males and females may have a very similar rate of increase of their blood pressure or expected blood pressure.
And i think when you start to -- it's hard enough to understand it in linear regression. But then when you move that step to having this binary outcome, I think one of the things that helps me kind of think about that is that, you know, under the hood of the logistic regression is you are coming up with these predicted probabilities for real observations as to what they are going to have for the outcome. And sometimes in some papers I've given plots for, you know, here's the predicted probability of death for males versus females with an interaction term for age, and you can start to see the risk diverge if there is that presence of an interaction.
So I think interaction terms are important to consider as a way of explaining data sets and explaining observations, but they can be hard to understand in general, and they can become harder to understand even in logistic regression. So another important thing to keep in mind and just -- and try to think through -- oftentimes I think graphics tend to help me visualize these sorts of relationships when possible.
>> When you're looking for interactions, do you just run the model with a bunch of interaction terms and just see which are significant or not, or do you do it prospectively, thinking, "Well, I think, you know, knowing what I know about physiology and medicine, that there should be an interaction between sex and blood pressure, so I will put that in my model"? Or how do you go about deciding what interactions to look for and which ones are important and which ones are significant? >> My take on it would be just like you would build other terms in the model.
Because of course the more variables you put in, the more chance that you're just sort of findinig some noise in the data. But let me back up and say it. The way I would do it is I would place them in the model based on your clinical understanding and opinion that there should be an interaction, just the way you would choose terms for the model that were not interaction terms. >> Logistic regression is a very commonly used statistical methodology in the medical literature. It's used when there's a binary outcome, something like life versus death, complication or not.
Logistic regression has many assumptions that must be met before it can be used, and these include that there's no collinearity or correlation between two variables in the same regression equation. When reading articles using logistic regression, you should look to see if investigators have examined for the presence or absence of collinearity. Logistic regression also yields odds ratios, and they're a little bit cumbersome to interpret. It's important to know that odds ratios approximate risk ratios, and it's risk ratios that we want to know for use in clinical medicine.
When the probability of an outcome is low, odds ratios and risk ratios are about the same. But when the probability of the outcome is large, say more than a 15% event rate, then the odds ratios obtained from logistic regression no longer approximate risk ratios, and the odds ratios derived from logistic regression should not be relied upon when the outcome is more common than about 15%. Logistic regression is also used all the time for predictive modeling. And when you're doing so, the diagnostics for those models is really important.
And you should be looking at a model's calibration or the ability of a model to predict what an actual event rate is. You also should be looking at its discrimination, which is the ability of the model to predict which indivdiual is more likely to have the outcome than another. Oftentimes, the ability of a model to discriminate between individuals who do or do not have an event is summarized using the C statistic. A C statistic of 0.5 means that the model doesn't discriminate at all. And if it's 1.0, it discriminates perfectly.
Usually, C statistics are on the order of about 0.6 or 0.7, suggesting not particularly great discrimination. But if a C statistic is around 0.9, the model discriminates fairly well. It's important also to remember that just because a model discriminates doesn't mean it's well calibrated. And all the C statistic tells you is that when somebody has an event is compared to someone who does not have an event, the probability derived from that model of having the event is higher than not having the event.
those probabilities can be very far away from the real probability of the event, and taht real probability should be assessed by looking at the model's calibration. I'd like to thank Drs. Mueror and Tolles for speaking with us today, and most especially for writing the "JAMA" "Guide to Statistics and Methods" article that appeared in the March 14, 2017 issue of "JAMA" entitled "Logistic Regression Diagnostics Understanding How Well a Model Predicts Outcomes." That article also appears in "JAMAevidence" in the "JAMA" "Guide to Statistics and Methods." This episode was produced by Daniel Morrow.
Our audio team here at the "JAMA" network includes Jessie McCorters and Shelley Stephens, Maylyn Martinez from the University of Chicago, Lisa Harden, and Mike Berkowitz, the Deputy Editor for Electonic Media at the "JAMA" network. I'm Ed Livingston, Deputy Editor for Clinical Reviews and Education at "JAMA." Thanks for listening.