Name:
Yun Liu, PhD, discusses how to read an article that uses machine learning.
Description:
Yun Liu, PhD, discusses how to read an article that uses machine learning.
Thumbnail URL:
https://cadmoremediastorage.blob.core.windows.net/0461ea54-d3fe-465a-8ba8-7432baf5b4aa/thumbnails/0461ea54-d3fe-465a-8ba8-7432baf5b4aa.jpg?sv=2019-02-02&sr=c&sig=arYFLV2OLgokvrfEzNADD7K9ipFz2FLXhtEjdua4QU0%3D&st=2024-09-08T23%3A26%3A53Z&se=2024-09-09T03%3A31%3A53Z&sp=r
Duration:
T00H25M29S
Embed URL:
https://stream.cadmore.media/player/0461ea54-d3fe-465a-8ba8-7432baf5b4aa
Content URL:
https://cadmoreoriginalmedia.blob.core.windows.net/0461ea54-d3fe-465a-8ba8-7432baf5b4aa/18737009.mp3?sv=2019-02-02&sr=c&sig=49CHkrF5srrZEfyuugfzHPqzDZm9jATjwhc%2BV6PbTIk%3D&st=2024-09-08T23%3A26%3A53Z&se=2024-09-09T01%3A31%3A53Z&sp=r
Upload Date:
2022-02-28T00:00:00.0000000
Transcript:
Language: EN.
Segment:0 .
>> Hello, I'm Gordon Guyatt. I'm going to be conducting a discussion today with Professor Liu about machine learning, and in particular about our user's guide to the medical literature addressing machine learning for diagnosis and what clinicians need to know to understand articles that present information about machine learning and diagnosis. Welcome, Professor Liu.
>> Thank you, Dr. Guyatt. It's great to be here. >> Great to have you. So I'll start off. I'm sure most of our audience has a notion of what machine learning is, but it may not be -- and this includes me -- perfectly accurate. Can you tell us what exactly is machine learning? And in particular, how does it work with respect to diagnostic testing? >> That's a great question.
First of all, there are three terms that recently are very common in literature and tend to be used a little bit interchangeably. The three terms are "artificial intelligence," "machine learning," and "deep learning." Artificial Intelligence is a way to create tools that can perform what we think are intelligent tasks. Machine learning is one way to create such a tool by having an algorithm learn from data instead of being explicitly programmed to do something.
Deep learning is a kind of machine learning algorithm that learns from data using very many layers of computation. All of these tools have started to be used for diagnostics, very commonly using medical images, but also in other fields. And the way they do so is by processing complex input data and providing an output that is some kind of interpretation of the input data. >> So it sounded like that one of these terms was a subcategory for the other, which was a subcategory of the third, or can we think of them all as synonymous?
>> The terms are indeed subcategories of each other successively, where deep learning is a subcategory of machine learning, and machine learning is a kind of a subcategory of artificial intelligence. For our purposes, we can think of them as being synonymous. >> How long has machine learning been around? >> Oh, gosh, it's been around for a really long time. And what has made machine learning in particular, deep learning -- very much more popular more recently was the fact that we have a lot more powerful computers nowadays and we have a lot more data than we used to, especially data that's stored in a digital format.
>> It's been around for a long time. What has made it now -- we have the points that you mentioned -- now make it feasible to do things like use it for diagnostic testing, which were not feasible before we had all the data and all the powerful computation. Have I got it right? >> Yes, that's exactly right. >> Okay, great. Okay, so tell me, what exactly are the domains of diagnostic testing to which the deep learning applies?
Radiology for sure. Anything beyond or other than radiology? >> Radiology, definitely, I think was one of the earliest applications where machine learning and deep learning were applied to, in particular because radiology is a very much imaging-centric specialty, and it was also an early one to become digital. And so the digitization of films and other imaging studies in radiology really helped drive computer algorithms to analyze these images.
Nowadays, I think a few very popular applications of deep learning are in ophthalmology where fundus photography as well as OCT images are popular targets for application of these algorithms. Another domain has been in pathology, where we are gradually starting to see more and more places start to adopt digital pathology, which is where glass slides are digitized by high-resolution scanners. >> In each of these areas, is the idea to replace the human look at the eye or the slide or the X-ray, or somehow to complement it?
>> Yes, I think definitely the goal is always to complement the existing workflow. As to how that's done, I think, depends on the exact workflow in terms of what will work best to help the workflow be more efficient or be higher quality. So for example, one could imagine that, without any human intervention, an algorithm first helps to screen incoming images for ones that may have urgent findings for prioritized review.
So in some sense, that's working with the doctor but before the doctor actually sees the image. Another example is, while the doctor is reviewing the image, that the algorithm will help to provide some hints or to help highlight certain things that may be easily missed. One last version is if one were to apply the algorithm after the doctor has reviewed the image and essentially as an over-read, like how radiology tends to do this.
>> So it then would be a check to say, "Please have another look"? Is that the idea? >> That's correct. It can be used in many different ways, just depending on what the workflow requires. >> Although the scenario in your article actually has a fourth, which is kind of a replacement, where you don't have enough clinicians to do the readings, and so might use it as a replacement, or did I misunderstand? >> That's a great question. So in this particular case, the scenario highlights a situation where there are not enough graders to help review images to detect diabetic eye disease.
So in this scenario, because there weren't graders available to grade the images, the algorithm is taking the spot of that. In reality, what we find tends to happen is that, when these images are flagged for positive findings, that it tends to be the case that they are then referred to an ophthalmologist, who then follows up with the patient and actually looks at the eyes of the patient. So in some sense, the first step has been replaced by using an algorithm to more quickly process images, but then that helps to save time so that the ophthalmologist doesn't need to look at as many images.
>> When you are using it, as you've described, you're really using it as a screening test. Would this be accurate? >> That's correct. In this setting, it is essentially a screening test. >> In which case, you'd be ready to, at least to some extent, sacrifice specificity for sensitivity. You want your sensitivity to be as close to 100% as possible, and this reduction in specificity would just mean more images would have to be checked by the human reviewer.
>> That's correct. In the screening setting, sensitivity being high is more important to catch all the potential cases for human review. >> Great. You start by saying machine learning diagnostic methods require studies similar to other diagnostic tests. What are those standards? >> Yes. In this particular case, if you use machine learning to create a sort of diagnostic test, then it is indeed just like any other diagnostic test and has to be evaluated in a similar way.
Amongst these evaluations are, A, are the results of the diagnostic test valid; B, what were the results; and C, will having these results help me in caring for my patients? So all three elements of this have to be evaluated as per any other diagnostic test. >> And what about the first one, are the results of the diagnostic test valid? What should clinicians be looking for there? >> Regarding whether the test results are valid, the primary questions there are first to understand what the reference standard was and whether it was independent and blinded with respect to what the diagnostic test was.
So this ensures that the so-called ground truth that the machine learning diagnostic test is being evaluated against is both independent and unbiased. The second part to this part of the evaluation is to ask, did the sample contain enough diversity across the full spectrum of patients that you are interested in? This is to ensure that the patient population that you are looking at is representative and represented in the particular study that you're looking at.
The third element is whether there was a completely independent validation set. This is key because, in machine learning, it tends to be the case that the data set is split in a few different ways to evaluate the machine learning model. And so one has to be careful to understand which set is the truly independent validation set to draw conclusions from. >> It seems to me that the criteria of having the test independent of new gold or reference standard is very easily met with machine learning.
What about the gold standard and having an acceptable or adequate gold standard? What is typically done when you look at machine learning as a diagnostic test in terms of the gold standard? >> Yeah, that's a great question, and it's quite nuanced because this varies depending on the specific clinical scenario that we are looking at. For diabetic eye disease, for diabetic retinopathy specifically, what tends to be done is that the reference standard is based on expert interpretation of the same images.
For certain parts of the diabetic eye disease, one has to get a secondary diagnostic imaging -- for example, an OCT imaging -- to understand better whether there is diabetic macular edema. But generally speaking, it's based on expert interpretation. >> You describe validation. If I've understood it correctly, validation means taking a completely independent data set -- Your first data set looks great; sensitivity 98%, specificity 98%.
You then take a completely independent dataset -- for instance, with the eye, a completely different set of images -- and apply the machine learning approach and see if you get those similar excellent results. Do I understand correctly? >> That's correct. And here, there are different ways of getting new independent datasets. One way is what we would call a split sample, or random splits, where we started off with one cohort of patients and they were split into two independent subsets.
The machine learning diagnostic test was developed using one of the subsets and then tested on the second subset. Now, statistically, because these were random splits, one would expect that the performance of the machine learning test on the first subset will be quite similar to that on the second subset. So this is insufficient. In many cases, we want a totally external validation set, and one can get that by approaching a different institution, a different country, or a different continent entirely.
>> Great. So you go to another place with a different situation and see if it works there. Is one enough? Is one of these independent validations enough? And if not, how many are required? >> Yeah, that's a very interesting question, and I don't think there's a single right answer here. The first part is easy. One is likely not enough, mostly because, with one validation data set, you would realize that it maybe works well on that particular set, but it's unclear if it generalizes to yet an additional dataset.
So having two or more is highly preferable. How many is enough is a really challenging question to ask. I think this is somewhat subjective and depends on what the original data sets were tested on. So for example, one could imagine that if the original datasets -- let's say there were three of them -- came from three different continents with very different patient populations. One might expect that the test will generalize fairly well to new populations because it has already worked well on several different populations.
>> So the variability in clinical setting is a determinant of how many datasets you need. And what strikes me is if I'm thinking of applying it in my setting, I want a validation set that somehow was similar to my set. Would that be true? >> Yes, that's exactly right. If you were, for example, looking at the diabetic retinopathy diagnostic test, then whether it was evaluated in the similar setting to what you are thinking of applying it to will be very important.
For example, if you were thinking of applying it to a primary care clinic, then it will be important that the patient population is reflective of such a spectrum. >> Okay. That's very clear. That's great. Now, we started out by previous users' guides for diagnostic tests have certain criteria, and you went through them and said that they needed to be met here. But having been involved in the production of the original users' guides for diagnostic tests, I can tell you they did not require the sort of validation and other samples that you are suggesting for machine learning.
What do you think? Were the previous guides lacking, or is there some reason that machine learning needs to be validated but other tests -- laboratory, clinician-read imaging -- do not? >> So I think that the answer is kind of in the middle in that machine learning tests do indeed need to be validated more carefully to some extent. The main reason is that they are very complex. And so machine learning algorithms can learn to identify patterns that we do not expect.
And so for example, if we consider a risk score that is being used to predict if a person would develop a stroke, then one might expect that age, sex, hypertension, and a few other factors are correlated with the outcome. Now, a machine learning model that is looking at all of the data might learn correlates within the data that we do not expect. So the reason that we really want machine learning algorithms to be validated carefully is to ensure that it has not learned these confounding factors as actual predictors instead.
>> Can you give us an example of where such an unexpected finding happened? >> There were several published studies. I'll give two examples. The first example is, when we were applying machine learning to better understand color fundus photography, what we found was that the machine learning model could actually predict the age of the person who submitted the photograph to within about three years, on average.
This sort of accuracy was not known to be possible at all from a photograph, and so it was important that we got this validated. In fact, we actually had external validation for this post publication by a different group who evaluated our predictions against ground truth that only they had access to. So essentially, this was an external, blinded validation that gave us the same performance that we observed in our own datasets and, thus, validated the findings.
>> By looking at the fundus, you could establish the patient's age. Am I understanding correctly? >> That is correct. It was quite an unexpected finding. >> No kidding. In retrospect, is there an explanation? Obviously, the fundus must be changing as you get older in some very specific ways. >> So here's what's interesting Ophthalmologists do know that younger subjects do have what they call a sheen of youth that is correlated to actual physiological changes within the eye, and that sheen of youth tends to disappear with age.
What was not known was that you could actually quantify the age of the subject via the image itself. Now, we always try to understand what the model is doing with the images. However, it's quite challenging. In this particular case, we did isolate it down to changes in the vessel and the optic disc and other areas of the fundus photograph that the machine learning model is looking at. However, it's really hard for us to precisely explain exactly what is predictive and then have people reproduce that.
That last part is one of the challenges that we are actively trying to solve. >> Very interesting, although slightly depressing to know that, as I age, my fundus, in addition to the rest of me, is getting old. What is the -- You were going to give me a second example. Sorry to have interrupted you. Apologies. What's the second example? >> The second example was in some other studies where they were looking at what the machine learning model was using to make its prediction.
What they found, in this case, it was a study about using images of the skin to predict if it was actually cancer. In this study, what they found was that the machine learning model was actually using the ruler that was in the image as part of its prediction. This was a little alarming because the ruler was placed there by the physician because they were trying to measure the lesion to understand if it might be malignant. And so the aspect of the image that the machine learning model had learned was not one that the physicians wanted the machine learning model to be learning.
>> Did the placement of the ruler really predict? I mean, did the physicians somehow subconsciously put the ruler down in a way that distinguished between the malignant and nonmalignant? >> That's a great question. In this case, I think what the authors believe is that it is the presence of the ruler that the machine learning model had learned to associate with the potential malignancy. >> So how did the folks doing the machine learning and trying to develop their machine learning model, how did they handle the situation?
>> So I think in this particular study, it was done after the study was complete, and so it was an interesting finding about a complete study. In principle, the right way to set this up would be to be more careful about the selection of the images that were included in the study. So for example, if one had the option of doing prospective data collection to develop a machine learning model, one might request that the images of the lesions be taken before any sort of intervention, including measurements, were performed.
In that way, one can ensure that the distribution of the images, with or without malignancies later, will not have any overt differences. >> You seem to say, at the end of the scenario in your article, that you would only use the machine learning if it was validated in one's own setting. Am I correct in reading it that way? And if so, what does that imply? >> In this particular article, we were trying to emphasize that it is important that the clinical scenario in which the diagnostic tests were validated matches the one where one was planning to use the test.
So the specific scenario, though, can mean many different things. So one aspect of it might be whether it's an urban setting or a rural setting. Another aspect might be the specific devices that are being used, whether it's a portable camera versus a stationary, larger device. And lastly, the patient population in terms of either patient demographics or in terms of coexisting morbidities and other conditions.
So as a quick example, consider that a general diabetic eye disease screening population will be quite different than the UK Biobank's more general study population, which is comprised of not only diabetics. >> So in the scenario, was the conclusion that something didn't match with respect to those factors? >> Yes. So I think in this specific scenario, it was potentially possible that this particular reader's scenario was not a perfect match for either the French dataset or the ipex [phonetic] dataset from the original study.
>> One needs a perfect match? >> That's a great question. I don't think one always needs a perfect match, but one needs to have reasonable confidence that it might work. So that will involve looking at the totality of evidence, where, for example, if it works on three different scenarios and there's some overlap in the scenarios with the target scenario that one wants to apply the algorithm to, then it is likely that it will work. >> In this particular one, there was enough skepticism to make one hesitate.
Is that right? >> I believe at the time that the article was written, there may not have been enough validation studies for this specific algorithm. Since then, I believe there's been several more validation studies on a broader array of patients. >> In the current situation, your administrator might now say, "There's been enough validation and enough settings close enough to mine that we can go ahead," correct? >> That's correct. >> Okay, great.
If you would like the audience to remember somewhere between one and three points -- I'll leave it for you to choose the number -- what would they be? >> The first, most important point here will be to understand that machine learning development happens differently compared to more traditional approaches in the way that the data are split such that there is one dataset to develop the model and a second dataset that is actually being used to tune the model.
Tuning the model refers to setting specific hyperparameters of the model. The hyperparameters are certain settings that affect what the machine learning model learns. And so tuning of the hyperparameters has a large effect of what the model learns and its final accuracy. What this means is that one looks at these studies and looks at the evaluation dataset. One needs to understand whether the evaluation dataset was independent of both the training dataset and the tuning dataset.
>> Okay. Well, thanks very much. That was extremely informative. I learned a great deal. Professor Liu, thanks very much. It's been great having you with us for this podcast. >> Dr. Guyatt, it was a pleasure to be on. Thank you so much for the questions, and I look forward to chatting more. >> Thanks for listening. For more podcasts, visit us at jamanetworkaudio.com. You can subscribe to our podcast wherever you get your podcasts.
[ Music ]