PI-CAI Challenge: AI vs Radiologists in Prostate Cancer MRI Detection, Journal Club - Rashid Sayyid & Zachary Klaassen
July 9, 2024
Rashid Sayyid and Zach Klaassen discuss the PI-CAI Challenge, an international study comparing artificial intelligence (AI) to radiologists in prostate cancer detection using MRI. The study develops and validates an AI system using a large dataset of prostate MRIs, comparing its performance to both study radiologists and real-world radiology readings. The AI system demonstrates superiority to study radiologists in detecting clinically significant prostate cancer and non-inferiority to routine radiology practice. Key findings include the AI system's ability to detect 6.8% more clinically significant cancers and 50.4% fewer false positives compared to radiologists at the same specificity. The study highlights the potential of AI as a supportive tool in prostate cancer diagnosis, while acknowledging limitations such as retrospective data curation and the need for prospective validation.
Biographies:
Rashid Sayyid, MD, MSc, Urologic Oncology Fellow, Division of Urology, University of Toronto, Toronto, ON
Zachary Klaassen, MD, MSc, Urologic Oncologist, Assistant Professor Surgery/Urology at the Medical College of Georgia at Augusta University, Well Star MCG, Georgia Cancer Center, Augusta, GA
Biographies:
Rashid Sayyid, MD, MSc, Urologic Oncology Fellow, Division of Urology, University of Toronto, Toronto, ON
Zachary Klaassen, MD, MSc, Urologic Oncologist, Assistant Professor Surgery/Urology at the Medical College of Georgia at Augusta University, Well Star MCG, Georgia Cancer Center, Augusta, GA
Related Content:
Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study
EAU 2023: Discussant: Artificial Intelligence and Radiologists at Prostate Cancer Detection on MRI: Preliminary results from the PI-CAI Challenge
Artificial intelligence and radiologists in prostate cancer detection on MRI (PI-CAI): an international, paired, non-inferiority, confirmatory study
EAU 2023: Discussant: Artificial Intelligence and Radiologists at Prostate Cancer Detection on MRI: Preliminary results from the PI-CAI Challenge
Read the Full Video Transcript
Rashid Sayyid: Hello everyone, and thank you for joining us today in this UroToday Journal Club recording, where we'll be discussing the recently published paper, the PI-CAI Challenge, which looks at artificial intelligence and radiologists in prostate cancer detection MRI, which was an international paired non-inferiority confirmatory study. I'm Rashid Sayyid, a urologic oncology fellow at the University of Toronto, and I'm joined today by Zach Klaassen, the Associate Professor and Program Director at Wellstar MCG Health.
This paper, a very important paper, was recently published in the Lancet Oncology with Dr. Aninda Saha as the first author. We've seen over the last, essentially decade, that pre-biopsy MRI has emerged and is now endorsed by numerous international guidelines. This is for a number of reasons, including improved detection of clinically significant prostate cancer when it's incorporated in the pre-biopsy setting. It reduces the diagnosis of grade group one disease and also potentially can reduce the number of unnecessary biopsies as well. But there are issues that arise, obviously, with the adoption of MRIs, and one of them is that they're actually labor-intensive. And not only from having more machines available and having more personnel to run them, but also the more that they are done, the more that they need to be read as well. From a reader, radiologist standpoint, that does significantly increase the workload. There's also the issue of high inter-reader variability. These are some of the issues that we encounter, some of the issues that we need to address with novel strategies.
One potential strategy is incorporating the AI models, and the AI models have been shown to match expert clinicians in medical image analysis across numerous specialties, particularly prostate and breast cancer. And so AI-assisted image interpretation can potentially address this rising demand in medical imaging worldwide. And so before they're adopted, however, the efficacy of these AI models needs to be tested in order to allow for this wide-scale adoption in the prostate cancer diagnostic space.
And so to this end, the authors and study investigators hypothesize the state-of-the-art AI models trained using thousands of patient exams, potentially could be non-inferior to radiologists for detecting clinically significant prostate cancer using MRI, which remains the ultimate goal. And so to this end, they designed the prostate imaging cancer artificial intelligence, the PI-CAI challenge, whereby they really did a pretty comprehensive job in terms of developing, training, and then externally validating an AI system that was developed for detecting clinically significant prostate cancer using a large international multi-center cohort. And then compared the performance of this AI model to first, radiologists participating in a study. And then two, the radiology readings from the actual radiologists who read the images when they were performed in the setting of a multidisciplinary routine practice. And so we'll talk about this further in the methods section. I just want to highlight that although the methodology here is quite intense and dense, it's important to go over this in order to understand how this model works and how we can incorporate it potentially into the future as clinicians into our practice.
And so in this study, this was an international paired non-inferiority confirmatory study. Essentially, it is making sure this AI model, at the very least, is as good as a radiologist's reading. And so as a first step, algorithm developers design the AI models using a sample of about 10,000 cases from 9,000 patients with these images collected from four European tertiary care centers over a decade between 2012 and 2021. And then we'll talk about this further, but amongst thousands of models submitted, the top five performing models were selected and then combined into one algorithm.
At the same time, in addition or in parallel to these algorithms being developed, 62 radiologists were invited to participate in a multi-reader, multi-case observer study. This is the study group of radiologists, which is different from the real-world radiologists who read them at the time of the imaging being performed. We'll talk about this distinction later. These algorithm developers and radiologists were invited to participate through referrals, outreach programs, conference presentations, and most importantly, an open call on the grand-challenge.org platform.
In terms of the patient inclusion criteria, these are the patients for whom the images were performed, and these were the selection criteria that were applied. All patients who underwent imaging were adult patients with the median age of 66, who had a high suspicion for prostate cancer, either an abnormal rectal exam, or PSA of three or higher. These patients were allowed to have had prior biopsies performed, but they could have had no prior treatment for prostate cancer and no known history of grade two or higher disease. It's important, obviously, that all these patients who were selected for inclusion had MRI images available with complete reporting and with high image quality.
In terms of the MRIs, these were either 1.5 or three Tesla scanners, obtained commercially. All these images, when they were performed as clinically indicated in the routine practice, were read by at least one of 18 radiologists from these participating centers. And these were quite experienced radiologists who had anywhere between one to 21 years of experience reading these prostate MRIs and reported the findings from these imaging using PI-RADS classification.
It's important to note this is the real-world radiology cohort and so as you would expect, the patient history from the charts, as well as peer consultations, a second set of eyes or a third set of eyes, were available to aid in the diagnosis. Patients with positive MRIs defined as PI-RADS three or higher underwent biopsies, and the targeted number of cores was two to four per lesion. And amongst those who had a negative MRI, meaning either no lesion at all or PI-RADS one to two, they either had no biopsy performed or only a systematic biopsy performed with six to 16 cores. This essentially was based on clinician preference.
In this study, clinically significant prostate cancer, the ultimate outcome was defined as grade two to five disease. And how was this defined using the biopsy of the RP? Well, if patients underwent a radical prostatectomy, the study investigators used whole mount specimen to assign grade, otherwise the biopsy specimen was used. And then in patients who had a negative MRI, in order to ensure that they truly had negative disease, a minimum fallout period of three years was applied to confirm the absence of clinically significant prostate cancer in these patients. Again, this just asks for the fidelity and validity of this study to ensure that negative is truly negative. A very important detail to be aware of in this study.
Now, let's just take a step back and talk about how the AI system was developed. We're not going to go into the specific details. We understand this is quite challenging and very intense from a methodology standpoint, but essentially the first step of the study was to invite the AI algorithm developers to join the study. And the way this was done was through the PI-CAI challenge being hosted on the grand-challenge.org platform. It's interesting to note that this challenge will be continuously hosted until May 2027, so this gives the study investigators a chance to further optimize their model. And so based on the information on the website, AI developers worldwide could opt in and they download an annotated public data set of about 1,500 MRI cases. Their AI models were trained for detecting this clinically significant prostate cancer using bi-parametric MRI, so it's an important detail.
These AI models really had to complete two tasks. First, they had to localize and classify a lesion in terms of a likelihood of having clinically significant cancer from zero to 100, and then classify the overall case using the same zero to 100 likelihood score, so two different tasks. And then it's important to know the models could use the imaging data, so the bi-parametric MRI, also several metadata that were made available to these algorithm developers, so: age, PSA level, prostate volume, and the MRI scanner name.
Next, once these algorithm developers developed these AI models at the end of the development cycle, they submitted them. And so the study investigators next validated these AI models. First step is creating it. The next step is validating it in a set of a thousand cases. This was done in a remote offline center and everybody was fully masked to the results. And again, they used histopathology in a fallout period of at least three years to establish the reference standard.
And then out of all these models submitted, the study investigators independently retrained the five top performing AI models using over 9,000 cases. And once they were trained, the nice thing is they combined these five different models into one model using equal weighting. At this point, we have our AI model developed. In addition to the AI model that was developed and then validated by these study investigators, they had to recruit the radiologists. It isn't fair to compare the AI model alone to the real-world radiologists. You need a third sample. We can think about this as a three-arm study. You have the AI model in one arm. The second arm is the radiologist participating in this study just for the purpose of the study. And then you have the third cohort of the radiologists who read them in a real-world, real-time setting.
These radiologists, the second cohort, were also invited using the grand-challenge.org platform, and they read 400 MRI exams that were randomly sampled from the testing cohort. It's important that the investigators selected four radiologists that were experienced, so all of them had extensive experience reading multi-parametric MRI using the PIRADS scoring system in order to score them with a median experience of seven years. It's important to note that none of these radiologists had participated or were working at one of these seven centers. If we look at their expertise, based on the ESUR/ESUI consensus statements, 74% of them were self-designated as experts. So really, an experienced cohort here.
These radiologists weren't asked to read all 400 MRIs. That's obviously labor-intensive and would decrease the chance that these radiologists would be willing to partake in this study. And so they used a split plot design where the readers and cases were randomly distributed into four different blocks of 100 cases each, and then each of these radiologists had to read the images in two sequential rounds. First, they looked at the bi-parametric imaging and the metadata that was available for the AI system as well. This is the prostate volume, and PSA, etc., that were available. And then as the second step, these radiologists had to read this in a multi-parametric MRI study, so not bi-parametric, multi-parametric, and the readers could use this additional information from the additional sequencing to update their findings.
But it's important to note that these readers did not have access to patient history or could consult with their peers. That's different from the cohort of radiologists who read these images in a real-time setting. They really have less information. It helps parse these comparisons. It's important that in the context of this study, only the multi-parametric MRI readings were considered for the analysis.
And so in terms of the statistics, really we can think about this in terms of two pairwise comparisons. We compare the AI system to the 62 radiologists who participated just for the purpose of the study, and then we compare the AI system as well to the historical radiology readings that were made during clinical practice.
The primary hypothesis was that the standalone AI system would be non-inferior to both sets of radiologists. And then if it was non-inferior, then potentially we could also test for the superiority of the AI system, which is pretty standard in these non-inferiority designs.
When we talk about the test statistics for comparisons, when the investigators compared the AI model to the 62 radiologists, they used the area under the receiver operating curve characteristic statistic. And then when they were the AI model to the historical radiology readings, essentially they looked at the difference in specificity when the same sensitivity as a PI-RADS three or greater threshold was set.
And when we talk about non-inferiority, non-inferiority would be concluded if the test statistic was greater than zero and the lower boundary of the 95% confidence interval was greater than negative 0.05. And then if non-inferiority was concluded based on these criteria, then the superiority of the AI system over either set of radiologists was assessed.
At this point, thank you for bearing with us through this dense methodology section, but it's very important to understand in order to contextualize the results. Zach will go over the results of this section, going over the characteristics of the cohort and then looking at the results on how this AI model compared to the radiologists.
Zach Klaassen: Rashid, thanks so much for that great introduction and overview of the methodology. So before we look at the table on the right, we're going to highlight that between June 12, 2022 and November 28, 2022, there were 809 individuals from 53 countries that opted into the development of the AI system. This resulted in 293 AI algorithms that were submitted. And as Rashid mentioned, the top five models used included the University of Sydney, the University of Science and Technology in China, the Guerbet Research Center in France, the Istanbul Technical University, and Stanford University.
When we look at the patient distribution across the cohorts, there was a total of 9,129 patients with a median age of 66 years of age. The median PSA at the time of the study was eight, and the median prostate volume was 61 mls. When we look at the MR scanners, the majority of these were Siemens and Phillips medical system scanners. And when we look at the field strength from the Tesla point of view, roughly half were 1.3 and the other half were three. When we look at the cases, there were 76% of patients that had benign or indolent prostate cancer. And the important part here is that 24% of patients had clinically significant prostate cancer defined as greater than or equal to Gleason grade group two.
With regards to the positive MRI lesions, 16% had PI-RADS three, 47% had PI-RADS four, and 37% of patients had PI-RADS five. Lastly, when we look at the ISUP-based lesions, we see that 40% of patients were Gleason grade group one, 31% were grade group two, 14% of patients were grade group three, 6% of patients were grade group four, and 8% of patients were Gleason grade group five.
This is the ROC curve of the AI system and the pool of 62 radiologists. As Rashid laid out, this is the Reader study, 400 testing cases, and we see that the teal line here is the AI system with an ROC curve of 0.91. The radiologist is in red with an ROC curve of 0.86. And as Rashid nicely laid out the methods, this is the key finding in this study is that the AI system in the Reader study was non-inferior, and thus was tested as superiority, and was deemed superior to the radiologist in this outcome. When we look at this, this is the difference in the ROC metric between the AI system and the pool of 62 radiologists. This is a little bit more visually appealing, in terms of understanding non-inferiority and superiority. We can see here that the line is to the right of both the non-inferiority margin and the superiority margin, favoring the AI system.
And so what does this really mean in terms of real-world outcomes and how does this operationalize to the clinic? What this means is that the AI system versus radiologist at their PI-RADS three or greater operating point, detected 6.8% more clinically significant prostate cancer at the same specificity, 50.4% fewer false positives, and 20% fewer Gleason grade group one cancers at the same sensitivity as the radiologist.
This is the ROC curve of the AI system and the PI-RADS operating points of the radiology reads made during routine multidisciplinary practice, and this was the thousand testing cases. This is the "real-world experience." We see here that the AI system had an ROC curve of 0.93 and when we extrapolate this compared to the radiologist, the AI system was non-inferior to radiologists in routine multidisciplinary practice, but was not deemed superior.
And again, when we look at this from almost like a force plot point of view, we see the difference in specificity that matched sensitivity was negative 0.2. Again, we see that this line and the confidence intervals is to the right of the non-inferiority margin, but is not to the right of the superiority margin. Again, non-inferior for the AI system compared to radiologists in routine clinical practice.
What does this all mean? Basically, the PI-CAI challenge showed that a state-of-the-art AI system was superior in discriminating patients with clinically significant prostate cancer on MRI compared to 62 radiologists using PI-RADS version 2.1 in an international reader study. Secondly, it was non-inferior when compared to the standard of routine care in radiology practice. Why was it not superior? There are several reasons that this may be secondary. One, is that these radiologists had access to patient history. They were able to consult with their peers if it was a difficult case. They could ask other radiologists, they could attend meetings, have case presentations, as well as potentially the protocol familiarity. This is perhaps why there was no superiority of the AI system to radiologists in routine radiology practice.
What we see in this study is that predictive values for the AI system were very high, 89.5% sensitivity at a 79.1% specificity, as well as a 93.8% negative predictive value at an estimated 33% prevalence. Although it's somewhat difficult to look at these results and compare them to radiologists and previous other analyses, when we do look at these compared to multi-parametric AI and the PROMIS Trial, 88% sensitivity at 45% specificity for the radiologists in PROMIS and a 76% negative predictive value at an estimated 53 prevalence.
When we look at two meta-analyses of 42 studies for radiologists, the sensitivity was 96% at a 29% specificity with a 90.8% negative predictive value. When we look at the limitations, there are several for this. This was a well-designed study, but the authors did a great job of highlighting several limitations. The first is that the data set was retrospectively curated over several years and multiple sites, which led to a mix of consecutive patients and samples.
Secondly, radiologists provided their analysis for retrospective data through an online reading environment, and this may have differed significantly from their day-to-day native workflow.
Third, biopsy planning and histological verification were guided by the original radiology read and not prospectively by either the radiologist or the artificial intelligence system.
Fourth, this study may have been hampered by differential verification bias, and this means patient examinations are verified, but multiple standards such as biopsies, prostatectomies, and follow-up are combined to establish the presence or absence of significant cancer.
And finally, there was no data on ethnicity, and 93.4% of all MRI exams were acquired from one MRI manufacturer. Certainly, reproducibility needs to be confirmed on other MRI systems, as well as in other races across the prostate cancer spectrum.
In conclusion, an AI system was superior to radiologists using PI-RADS version 2.1 at detecting clinically significant prostate cancer and comparable to standard of care in routine radiology practice. Such a system shows the potential to be a supportive tool within a primary diagnostic setting with several potential associated benefits for patients and radiologists.
And finally, prospective validation, which is undergoing in the CHANGE trial, is needed to test the clinical applicability of this system.
Thank you very much for your attention. We hope you enjoyed this Uro Today Journal Club discussion of the PI-CAI study published recently in the Lancet Oncology.
Rashid Sayyid: Hello everyone, and thank you for joining us today in this UroToday Journal Club recording, where we'll be discussing the recently published paper, the PI-CAI Challenge, which looks at artificial intelligence and radiologists in prostate cancer detection MRI, which was an international paired non-inferiority confirmatory study. I'm Rashid Sayyid, a urologic oncology fellow at the University of Toronto, and I'm joined today by Zach Klaassen, the Associate Professor and Program Director at Wellstar MCG Health.
This paper, a very important paper, was recently published in the Lancet Oncology with Dr. Aninda Saha as the first author. We've seen over the last, essentially decade, that pre-biopsy MRI has emerged and is now endorsed by numerous international guidelines. This is for a number of reasons, including improved detection of clinically significant prostate cancer when it's incorporated in the pre-biopsy setting. It reduces the diagnosis of grade group one disease and also potentially can reduce the number of unnecessary biopsies as well. But there are issues that arise, obviously, with the adoption of MRIs, and one of them is that they're actually labor-intensive. And not only from having more machines available and having more personnel to run them, but also the more that they are done, the more that they need to be read as well. From a reader, radiologist standpoint, that does significantly increase the workload. There's also the issue of high inter-reader variability. These are some of the issues that we encounter, some of the issues that we need to address with novel strategies.
One potential strategy is incorporating the AI models, and the AI models have been shown to match expert clinicians in medical image analysis across numerous specialties, particularly prostate and breast cancer. And so AI-assisted image interpretation can potentially address this rising demand in medical imaging worldwide. And so before they're adopted, however, the efficacy of these AI models needs to be tested in order to allow for this wide-scale adoption in the prostate cancer diagnostic space.
And so to this end, the authors and study investigators hypothesize the state-of-the-art AI models trained using thousands of patient exams, potentially could be non-inferior to radiologists for detecting clinically significant prostate cancer using MRI, which remains the ultimate goal. And so to this end, they designed the prostate imaging cancer artificial intelligence, the PI-CAI challenge, whereby they really did a pretty comprehensive job in terms of developing, training, and then externally validating an AI system that was developed for detecting clinically significant prostate cancer using a large international multi-center cohort. And then compared the performance of this AI model to first, radiologists participating in a study. And then two, the radiology readings from the actual radiologists who read the images when they were performed in the setting of a multidisciplinary routine practice. And so we'll talk about this further in the methods section. I just want to highlight that although the methodology here is quite intense and dense, it's important to go over this in order to understand how this model works and how we can incorporate it potentially into the future as clinicians into our practice.
And so in this study, this was an international paired non-inferiority confirmatory study. Essentially, it is making sure this AI model, at the very least, is as good as a radiologist's reading. And so as a first step, algorithm developers design the AI models using a sample of about 10,000 cases from 9,000 patients with these images collected from four European tertiary care centers over a decade between 2012 and 2021. And then we'll talk about this further, but amongst thousands of models submitted, the top five performing models were selected and then combined into one algorithm.
At the same time, in addition or in parallel to these algorithms being developed, 62 radiologists were invited to participate in a multi-reader, multi-case observer study. This is the study group of radiologists, which is different from the real-world radiologists who read them at the time of the imaging being performed. We'll talk about this distinction later. These algorithm developers and radiologists were invited to participate through referrals, outreach programs, conference presentations, and most importantly, an open call on the grand-challenge.org platform.
In terms of the patient inclusion criteria, these are the patients for whom the images were performed, and these were the selection criteria that were applied. All patients who underwent imaging were adult patients with the median age of 66, who had a high suspicion for prostate cancer, either an abnormal rectal exam, or PSA of three or higher. These patients were allowed to have had prior biopsies performed, but they could have had no prior treatment for prostate cancer and no known history of grade two or higher disease. It's important, obviously, that all these patients who were selected for inclusion had MRI images available with complete reporting and with high image quality.
In terms of the MRIs, these were either 1.5 or three Tesla scanners, obtained commercially. All these images, when they were performed as clinically indicated in the routine practice, were read by at least one of 18 radiologists from these participating centers. And these were quite experienced radiologists who had anywhere between one to 21 years of experience reading these prostate MRIs and reported the findings from these imaging using PI-RADS classification.
It's important to note this is the real-world radiology cohort and so as you would expect, the patient history from the charts, as well as peer consultations, a second set of eyes or a third set of eyes, were available to aid in the diagnosis. Patients with positive MRIs defined as PI-RADS three or higher underwent biopsies, and the targeted number of cores was two to four per lesion. And amongst those who had a negative MRI, meaning either no lesion at all or PI-RADS one to two, they either had no biopsy performed or only a systematic biopsy performed with six to 16 cores. This essentially was based on clinician preference.
In this study, clinically significant prostate cancer, the ultimate outcome was defined as grade two to five disease. And how was this defined using the biopsy of the RP? Well, if patients underwent a radical prostatectomy, the study investigators used whole mount specimen to assign grade, otherwise the biopsy specimen was used. And then in patients who had a negative MRI, in order to ensure that they truly had negative disease, a minimum fallout period of three years was applied to confirm the absence of clinically significant prostate cancer in these patients. Again, this just asks for the fidelity and validity of this study to ensure that negative is truly negative. A very important detail to be aware of in this study.
Now, let's just take a step back and talk about how the AI system was developed. We're not going to go into the specific details. We understand this is quite challenging and very intense from a methodology standpoint, but essentially the first step of the study was to invite the AI algorithm developers to join the study. And the way this was done was through the PI-CAI challenge being hosted on the grand-challenge.org platform. It's interesting to note that this challenge will be continuously hosted until May 2027, so this gives the study investigators a chance to further optimize their model. And so based on the information on the website, AI developers worldwide could opt in and they download an annotated public data set of about 1,500 MRI cases. Their AI models were trained for detecting this clinically significant prostate cancer using bi-parametric MRI, so it's an important detail.
These AI models really had to complete two tasks. First, they had to localize and classify a lesion in terms of a likelihood of having clinically significant cancer from zero to 100, and then classify the overall case using the same zero to 100 likelihood score, so two different tasks. And then it's important to know the models could use the imaging data, so the bi-parametric MRI, also several metadata that were made available to these algorithm developers, so: age, PSA level, prostate volume, and the MRI scanner name.
Next, once these algorithm developers developed these AI models at the end of the development cycle, they submitted them. And so the study investigators next validated these AI models. First step is creating it. The next step is validating it in a set of a thousand cases. This was done in a remote offline center and everybody was fully masked to the results. And again, they used histopathology in a fallout period of at least three years to establish the reference standard.
And then out of all these models submitted, the study investigators independently retrained the five top performing AI models using over 9,000 cases. And once they were trained, the nice thing is they combined these five different models into one model using equal weighting. At this point, we have our AI model developed. In addition to the AI model that was developed and then validated by these study investigators, they had to recruit the radiologists. It isn't fair to compare the AI model alone to the real-world radiologists. You need a third sample. We can think about this as a three-arm study. You have the AI model in one arm. The second arm is the radiologist participating in this study just for the purpose of the study. And then you have the third cohort of the radiologists who read them in a real-world, real-time setting.
These radiologists, the second cohort, were also invited using the grand-challenge.org platform, and they read 400 MRI exams that were randomly sampled from the testing cohort. It's important that the investigators selected four radiologists that were experienced, so all of them had extensive experience reading multi-parametric MRI using the PIRADS scoring system in order to score them with a median experience of seven years. It's important to note that none of these radiologists had participated or were working at one of these seven centers. If we look at their expertise, based on the ESUR/ESUI consensus statements, 74% of them were self-designated as experts. So really, an experienced cohort here.
These radiologists weren't asked to read all 400 MRIs. That's obviously labor-intensive and would decrease the chance that these radiologists would be willing to partake in this study. And so they used a split plot design where the readers and cases were randomly distributed into four different blocks of 100 cases each, and then each of these radiologists had to read the images in two sequential rounds. First, they looked at the bi-parametric imaging and the metadata that was available for the AI system as well. This is the prostate volume, and PSA, etc., that were available. And then as the second step, these radiologists had to read this in a multi-parametric MRI study, so not bi-parametric, multi-parametric, and the readers could use this additional information from the additional sequencing to update their findings.
But it's important to note that these readers did not have access to patient history or could consult with their peers. That's different from the cohort of radiologists who read these images in a real-time setting. They really have less information. It helps parse these comparisons. It's important that in the context of this study, only the multi-parametric MRI readings were considered for the analysis.
And so in terms of the statistics, really we can think about this in terms of two pairwise comparisons. We compare the AI system to the 62 radiologists who participated just for the purpose of the study, and then we compare the AI system as well to the historical radiology readings that were made during clinical practice.
The primary hypothesis was that the standalone AI system would be non-inferior to both sets of radiologists. And then if it was non-inferior, then potentially we could also test for the superiority of the AI system, which is pretty standard in these non-inferiority designs.
When we talk about the test statistics for comparisons, when the investigators compared the AI model to the 62 radiologists, they used the area under the receiver operating curve characteristic statistic. And then when they were the AI model to the historical radiology readings, essentially they looked at the difference in specificity when the same sensitivity as a PI-RADS three or greater threshold was set.
And when we talk about non-inferiority, non-inferiority would be concluded if the test statistic was greater than zero and the lower boundary of the 95% confidence interval was greater than negative 0.05. And then if non-inferiority was concluded based on these criteria, then the superiority of the AI system over either set of radiologists was assessed.
At this point, thank you for bearing with us through this dense methodology section, but it's very important to understand in order to contextualize the results. Zach will go over the results of this section, going over the characteristics of the cohort and then looking at the results on how this AI model compared to the radiologists.
Zach Klaassen: Rashid, thanks so much for that great introduction and overview of the methodology. So before we look at the table on the right, we're going to highlight that between June 12, 2022 and November 28, 2022, there were 809 individuals from 53 countries that opted into the development of the AI system. This resulted in 293 AI algorithms that were submitted. And as Rashid mentioned, the top five models used included the University of Sydney, the University of Science and Technology in China, the Guerbet Research Center in France, the Istanbul Technical University, and Stanford University.
When we look at the patient distribution across the cohorts, there was a total of 9,129 patients with a median age of 66 years of age. The median PSA at the time of the study was eight, and the median prostate volume was 61 mls. When we look at the MR scanners, the majority of these were Siemens and Phillips medical system scanners. And when we look at the field strength from the Tesla point of view, roughly half were 1.3 and the other half were three. When we look at the cases, there were 76% of patients that had benign or indolent prostate cancer. And the important part here is that 24% of patients had clinically significant prostate cancer defined as greater than or equal to Gleason grade group two.
With regards to the positive MRI lesions, 16% had PI-RADS three, 47% had PI-RADS four, and 37% of patients had PI-RADS five. Lastly, when we look at the ISUP-based lesions, we see that 40% of patients were Gleason grade group one, 31% were grade group two, 14% of patients were grade group three, 6% of patients were grade group four, and 8% of patients were Gleason grade group five.
This is the ROC curve of the AI system and the pool of 62 radiologists. As Rashid laid out, this is the Reader study, 400 testing cases, and we see that the teal line here is the AI system with an ROC curve of 0.91. The radiologist is in red with an ROC curve of 0.86. And as Rashid nicely laid out the methods, this is the key finding in this study is that the AI system in the Reader study was non-inferior, and thus was tested as superiority, and was deemed superior to the radiologist in this outcome. When we look at this, this is the difference in the ROC metric between the AI system and the pool of 62 radiologists. This is a little bit more visually appealing, in terms of understanding non-inferiority and superiority. We can see here that the line is to the right of both the non-inferiority margin and the superiority margin, favoring the AI system.
And so what does this really mean in terms of real-world outcomes and how does this operationalize to the clinic? What this means is that the AI system versus radiologist at their PI-RADS three or greater operating point, detected 6.8% more clinically significant prostate cancer at the same specificity, 50.4% fewer false positives, and 20% fewer Gleason grade group one cancers at the same sensitivity as the radiologist.
This is the ROC curve of the AI system and the PI-RADS operating points of the radiology reads made during routine multidisciplinary practice, and this was the thousand testing cases. This is the "real-world experience." We see here that the AI system had an ROC curve of 0.93 and when we extrapolate this compared to the radiologist, the AI system was non-inferior to radiologists in routine multidisciplinary practice, but was not deemed superior.
And again, when we look at this from almost like a force plot point of view, we see the difference in specificity that matched sensitivity was negative 0.2. Again, we see that this line and the confidence intervals is to the right of the non-inferiority margin, but is not to the right of the superiority margin. Again, non-inferior for the AI system compared to radiologists in routine clinical practice.
What does this all mean? Basically, the PI-CAI challenge showed that a state-of-the-art AI system was superior in discriminating patients with clinically significant prostate cancer on MRI compared to 62 radiologists using PI-RADS version 2.1 in an international reader study. Secondly, it was non-inferior when compared to the standard of routine care in radiology practice. Why was it not superior? There are several reasons that this may be secondary. One, is that these radiologists had access to patient history. They were able to consult with their peers if it was a difficult case. They could ask other radiologists, they could attend meetings, have case presentations, as well as potentially the protocol familiarity. This is perhaps why there was no superiority of the AI system to radiologists in routine radiology practice.
What we see in this study is that predictive values for the AI system were very high, 89.5% sensitivity at a 79.1% specificity, as well as a 93.8% negative predictive value at an estimated 33% prevalence. Although it's somewhat difficult to look at these results and compare them to radiologists and previous other analyses, when we do look at these compared to multi-parametric AI and the PROMIS Trial, 88% sensitivity at 45% specificity for the radiologists in PROMIS and a 76% negative predictive value at an estimated 53 prevalence.
When we look at two meta-analyses of 42 studies for radiologists, the sensitivity was 96% at a 29% specificity with a 90.8% negative predictive value. When we look at the limitations, there are several for this. This was a well-designed study, but the authors did a great job of highlighting several limitations. The first is that the data set was retrospectively curated over several years and multiple sites, which led to a mix of consecutive patients and samples.
Secondly, radiologists provided their analysis for retrospective data through an online reading environment, and this may have differed significantly from their day-to-day native workflow.
Third, biopsy planning and histological verification were guided by the original radiology read and not prospectively by either the radiologist or the artificial intelligence system.
Fourth, this study may have been hampered by differential verification bias, and this means patient examinations are verified, but multiple standards such as biopsies, prostatectomies, and follow-up are combined to establish the presence or absence of significant cancer.
And finally, there was no data on ethnicity, and 93.4% of all MRI exams were acquired from one MRI manufacturer. Certainly, reproducibility needs to be confirmed on other MRI systems, as well as in other races across the prostate cancer spectrum.
In conclusion, an AI system was superior to radiologists using PI-RADS version 2.1 at detecting clinically significant prostate cancer and comparable to standard of care in routine radiology practice. Such a system shows the potential to be a supportive tool within a primary diagnostic setting with several potential associated benefits for patients and radiologists.
And finally, prospective validation, which is undergoing in the CHANGE trial, is needed to test the clinical applicability of this system.
Thank you very much for your attention. We hope you enjoyed this Uro Today Journal Club discussion of the PI-CAI study published recently in the Lancet Oncology.