Name:
JAMAevidence - Heather G. Allore, PhD, MS, FGSA, discusses the use of latent class analysis to identify hidden clinical phenotypes.
Description:
JAMAevidence - Heather G. Allore, PhD, MS, FGSA, discusses the use of latent class analysis to identify hidden clinical phenotypes.
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/eee8f9f7-9b21-4301-8030-7b5fda3b5cd8/thumbnails/eee8f9f7-9b21-4301-8030-7b5fda3b5cd8.jpeg?sv=2019-02-02&sr=c&sig=ZZ8dNT6V4mnutlG8wQdgNArh6m7DNifjcM%2Be3KwjDVc%3D&st=2025-05-11T17%3A43%3A21Z&se=2025-05-11T21%3A48%3A21Z&sp=r
Duration:
T00H17M33S
Embed URL:
https://stream.cadmore.media/player/eee8f9f7-9b21-4301-8030-7b5fda3b5cd8
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/eee8f9f7-9b21-4301-8030-7b5fda3b5cd8/allore_cut.mp3?sv=2019-02-02&sr=c&sig=ppKCHGlGzW5OxkkdJYHtG%2FYBr5diaeEZPeuD9Pm1Pbc%3D&st=2025-05-11T17%3A43%3A21Z&se=2025-05-11T19%3A48%3A21Z&sp=r
Upload Date:
2022-10-03T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
>> Hello, and welcome to JAMAevidence, our podcast series based on core issues in evidence-based medicine. Today we will be discussing using latent class analysis to identify hidden clinical phenotypes. I'm your host, Dr. Roger Lewis, Statistical Editor for JAMA and Co-editor of the JAMA Guide to Statistics and Methods series. I'm here with Professor Heather Aloor [phonetic] and I'd like her to get a chance to introduce herself. >> Thank you very much. My name is Heather Aloor and I am a Professor at the Yale School of Medicine in the Department of Internal Medicine, Geriatrics, and also a Professor of Biostatistics at the Yale School of Public Health.
>> Thank you, Dr. Aloor and welcome to our podcast. So let's start with a general introduction to the topic. In general terms, what is latent class analysis? >> A latent variable is an unobserved variable. We believe there's some sort of construct. We could think of maybe frailty in older adults, and we don't have an exact measure of frailty, but we think that there's elements that describe frailty that we might be able to observe, so we may be using in a cross-sectional manner these observed variables to try and get at a construct that we might believe in, but we can't directly measure.
So 'latent' basically is unobservable, but we believe they exist. >> So the variable that we're thinking of is a variable that groups patients into categories. Is that fair? >> They can be into categories, so either some sort of trait. It could be longitudinal, if you want to say how might those patients change over time? And there's really this assumption that there is a finite number of subpopulations in the group that we're studying.
>> And those subpopulations are what are called classes, is that correct? >> Yes, that's right. They're classes. You might use the word groups, but since it's called latent class analysis we can term them as classes. >> Great. Now you used in the article that you published an example from JAMA Cardiology in which this technique was used. Can you tell us a little bit about that example? >> That example used a group-based trajectory method or model and, in that, they wanted to look over time -- which was twenty years in that particular study -- of how the different participants in this large study might change over time.
And to see if they either, all followed the same pattern so then there would really just be one class and a regular regression model could capture that, or if there were really subpopulations that may have different courses over time. And therefore we could help identify these classes and that might give information for people in the future to be able to distinguish these classes since their outcome was adverse changes in cardiac structure and ventricular function.
>> So, Dr. Aloor, tell me a little bit about how the analysis process actually works. In other words, how does the process or the investigator determine the number of classes and in which class each participant is actually a member? >> That's a great question, and that's usually the first question everyone asks me. So let's think first if we had just a linear regression model, then that's one class. So we would basically first start by fitting a single class and this is especially a helpful technique if we have heterogeneity in our cohort or our population that we're studying.
And so, first, we'd fit one class and then we're going to use various fitting statistics such as a Bayesian information criteria, or Akaike information criteria -- information that will help us decide how well that first model is fitting our data. Then we'll add another class. And possibly then we're going to start to put some curvature to see -- maybe it's not just a linear relationship. Maybe it has one curvature point, so it's quadratic, or maybe it actually goes up and down and has another bend.
So maybe it's actually a cubic effect. So we're adding classes and we're changing, basically is it just an intercept? Is it just a line? Is there a curve? Is it a curve that goes up and down so it's cubic? We're really trying to fit the data and we're using information criteria as well as some other fit statistics such as the proportions of the participants that have a probability of that group assignment.
We want that to be high. So we're using this sort of information as well as something that I can't overemphasize, which is it needs to be just not mathematically guided. It needs to really have that biologic or clinical underpinning of what you're trying to study because we can overfit data and then that's not useful. So usually we put some minimum proportion of the participants in the smallest class so we're not overfitting, but part of this is a decision of, do these classes make clinical sense?
And we don't want to have spurious classes or trajectories simply when we have a lot of data. So then, when we are happy with a model with -- let's just say there's three classes or trajectories to make this a simple example -- then we want to, again, not only be looking at that group of those posteriors, so afterwards. We want that as a group to generally be very high. That's why there's a probabilistic model.
Each person gets a probability of being on each trajectory. Now some people, that probability may be incredibly close to zero, and the one that they best fit on is very close to 0.99. That would be an ideal model where you really have almost no probability of being on some models and very high with another. And so we want to have that group have a very high probability as a group, and then we look for individuals that basically may be following one curve, and sometimes something happens to them.
So they have a stroke or some sort of accident, and so they would maybe fall to another curve or they're no longer really part of a group because some unusual event happened to them. And so we want to see how often this is happening in the data to really see, is the model that we fit capturing the individuals that we're studying? So then each person gets these probabilities and then they're assigned based on their highest probability, and we consider it a poor fit for an individual if they have a probability of less than 0.7. And so that's the steps that we undertake in a nutshell.
>> Dr. Aloor, how would you describe the process for determining the number of classes when one is doing a cross-sectional study with latent class analysis? >> So this, of course, is driven by the hypothesis that the investigator is interested in. This approach is only appropriate when they really believe that there is some sort of heterogeneity in the data. We would typically start with two classes, what we're observing, our observed variables, and seeing how that falls into two classes.
And then we would look at some fit statistics, generally around information criteria, whether it's a Bayesian or Akaike or other criteria. And then we continue to add another class. We let the data estimate these, and then once we start to see in that information criteria that the model overall, with a number of classes, is no longer improving and there are certain standards that we use to evaluate this.
Then we have got into our best fitting model for that data set. >> Great. Now I know that the work that you actually do focuses on the trajectories of patients over time, their clinical trajectories, and how those trajectories might differ. So how is the use of latent class analysis different when one is looking at trajectories over time as opposed to cross-sectional data on patients? >> That's very exciting for a lot of people, when they're really saying patients don't seem to have these uniform trajectories of how they're doing, their response to therapies, how they're recovering after a surgery.
And that's why they really can be very interested in the longitudinal or the trajectory element of this approach. So they generally also, just like in the cross-sectional approach, they think that their patients or their participants in their study don't all follow the same path over time. And so we use a very similar approach where we start, in this case, with one class.
Do they just follow just like a regression would tell us? And then we continue to add classes using the same information criteria that I just spoke about, and then we, again, look using the same sort of criteria to determine when we've hit our best fitting model. So now we are basically doing the same thing, but we're doing it over time. >> It's obvious that one of the uses of this technique is to help us understand groupings of patients who share characteristics that wouldn't have otherwise been obvious to us.
But what are some of the limitations of the approach? Or what are some of the things that our readers ought to be aware of when they're reading a study that has used latent class analysis? >> Every statistical model has limitations and every method does. I'm afraid latent class analysis does not get to escape without these. These are probabilistic versus deterministic models. Right? So every person in that study, in that data set, they're getting a probability that they are part of a class.
They get to have very small probabilities of some classes, so you really have to look at your data and see, is it not only fitting the groups, but is it fitting the individuals? And so that is really the first thing I would tell everyone to do, to make sure it's still making sense, that it's following the clinical and biologic and social underpinnings. And, like everything in this case, selecting that number of classes is very important.
So you could end up overfitting the data, or underfitting the data, and that's why we use that approach of really trying to figure out how much information there is and not overfitting. Then you have to consider, also, if you have a very homogeneous cohort -- so they all had the same conditions, maybe they all had the same therapeutic interventions -- that what you're estimating, what your results are may only actually apply to very similar groups, just as in any sort of statistical model.
And that if you looked at, maybe, a nationally representative group of people, you might have other latent classes result. And, of course, you always have to consider, if you're looking over time, you have to be very mindful of missing data. Is your cohort followed up over time? There are methods that you can be simultaneously modeling the missingness or dropping out or deaths, if that's important to the study as well.
>> What would you expect to happen if latent class analysis was applied first to one study evaluating a particular patient population, and then applied separately to a much larger study that seemingly was studying the same population, but had just a much larger number of observations? >> If they essentially arose from the same population, so let's say you had all the people in one state and your first study selected a thousand of them.
And then your second study was really looking at all the other people except those thousand individuals and they arose and were sampled in a way so it was really representative, that first group. Then I would expect that those classes that emerge should be representative of the larger study or group of people that arose from the same population. >> Now let's come back to the question about how this approach was used in the example that you cited in your article.
Can you tell us a little bit about how this was applied and what the authors actually found? >> Well, the authors used this longitudinal approach, a group-based trajectory model, and they wanted to categorize the participants of their large study that was conducted over twenty years. So they measured the urine, albumin, creatinine, over twenty years at years five, fifteen, twenty, twenty-five, and thirty. So an important part here is they were each five years apart.
Okay? So we're getting equal intervals over time. And they wanted to find out, what were the trajectories? How did those evolve over time? And then each person would be assigned to one of those trajectories. Now you could characterize people, so not only are they getting assigned a trajectory, but you can start to characterize other elements, sociodemographic, or clinical factors. What's the likelihood of being on a trajectory?
And then they associated those trajectories with adverse changes in the cardiac structure and ventricular function. This was approach that they could say -- are there more than one group? How many groups? And they found five classes or five trajectories. And does everybody have a equal chance of these adverse events, or is some of these groups at higher risk, and can we then start to see if these urine, albumin, creatinine ratios are on a certain trajectory.
Then maybe we can look at people in the future and find those at greatest risk to try and intervene earlier in their life. >> Dr. Aloor, how do you handle the problem of missing data when you're doing analyses of trajectories over a long period of time, and patients may be lost to followup? >> We know those with chronic conditions may increase their severity. We know that accidents can happen to people. We know that, as they're aging, their likelihood of death and dropping out increases.
And so what I recommend is that people jointly model the missingness, whether it's through death or drop-out with the trajectories. And so then you can look at those simultaneously, and they're co-estimated, as well as characterize each of those trajectories so that you can convince yourself and the readers that there is not a healthy survivor of fact.
And also it may reveal to you if there's a trajectory or a class that has particularly poor outcomes. Those are individuals that may benefit most from future care. >> This is Roger Lewis and I'd like to thank our guest today, Professor Heather Aloor at Yale. More information about this topic is available at the JAMA Guide to Statistics and Methods series which is published in JAMA and on our website, jamaevidence.com. This episode was produced by Jesse McQuarters and Shelly Stephens at the JAMA Network.
The audio team here also includes Daniel Morrow, Lisa Hardin, Audrey Forman, and Mary Lynn Ferkaluk. Dr. Robert Golub is the JAMA Executive Deputy Editor. To follow this and other JAMA Network podcasts, please visit us online at jamanetworkaudio.com. Thanks for listening. [ Music ]