GPT-4 in Urology: Revolutionizing Renal Surgery Data Extraction - Jessica Hsueh & Mark Ball
August 26, 2024
Mark Ball and Jessica Hsueh discuss a study on using GPT-4 for data extraction from renal surgery operative notes. The research compares GPT-4's performance to human-curated data across five variables: laterality, surgery type, approach, estimated blood loss, and ischemia time. Results show high accuracy for categorical variables but lower accuracy for continuous variables and heterogeneously written information. The study suggests GPT-4 can be an efficient first step in data extraction, with human oversight for improvement. Dr. Ball notes the experience has led to more templated operative notes at his institution. The discussion explores potential future applications, including AI analysis of surgical videos and automated operative note generation. The researchers emphasize the transformative potential of AI in clinical practice while acknowledging privacy concerns and the need for regulations.
Biographies:
Jessica Hsueh, Medical Student, Georgetown University School of Medicine, Washington, DC
Mark Ball, MD, Associate Program Director of the Urologic Oncology Fellowship Program, Urologic Oncology Branch, National Cancer Institute, Bethesda, MD
Ruchika Talwar, MD, Assistant Professor of Urology, Urologic Oncologist, and Associate Medical Director in Population Health, Vanderbilt University Medical Center, Nashville, TN
Biographies:
Jessica Hsueh, Medical Student, Georgetown University School of Medicine, Washington, DC
Mark Ball, MD, Associate Program Director of the Urologic Oncology Fellowship Program, Urologic Oncology Branch, National Cancer Institute, Bethesda, MD
Ruchika Talwar, MD, Assistant Professor of Urology, Urologic Oncologist, and Associate Medical Director in Population Health, Vanderbilt University Medical Center, Nashville, TN
Read the Full Video Transcript
Ruchika Talwar: Hi everyone. Welcome back to UroToday's Health Policy Center of Excellence. As always, my name is Ruchika Talwar, and I'm a urologic oncologist in Nashville, Tennessee. I'm really excited today to be joined by Dr. Mark Ball from the NCI and Jessica Hsueh, who's a fourth-year medical student at Georgetown. They'll be here discussing some of their recent work, exploring the use of ChatGPT for renal surgery operative notes. Thanks to both of you for making the time to join us. We appreciate it.
Mark Ball: Thank you so much for having us.
Jessica Hsueh: Yeah, thanks so much for having us. Thank you so much, once again, for having us talk today about our research project looking at GPT-4 as a data extraction tool for renal surgery operative notes.
So just to provide a little bit of background. Data extraction, as we know, is a very vital first step in any sort of clinical data analysis that we do. And clinical data analysis is very important in terms of informing how we can better care for our patients. But data extraction is also, as it stands right now, very tedious and requires a lot of human effort and manpower. So we were interested in looking at whether or not we could use large language models to accelerate this data extraction process or make it more efficient. And one of the biggest large language models around is GPT-4. There have been many studies published out there that show that GPT-4 has a lot of applications in urology already, from being able to write clinical notes and generate discharge summaries to being able to answer standardized questions and be used as a medical education tool. And what we noticed is that no studies yet have looked at GPT-4 as a data extraction tool in urology specifically, and also to be able to extract operative data.
So what we did in our study was we compiled around 1,500 renal surgery operative notes from 2003 to 2023, and we had five variables that we were particularly interested in looking at: laterality, so whether the surgery was left versus right; surgery, so whether the surgery was a radical versus partial nephrectomy; approach, so looking at whether the surgery was open, laparoscopic, or robotic; and then we also had two continuous variables that we were interested in, estimated blood loss and ischemia time.
We de-identified our operative notes through a third-party Python-based program. And then we put the operative notes in GPT-4 and asked it to extract data for the five variables that I talked about. We used several prompting strategies to refine the data extraction process, but after that, we were able to compare our GPT-4 findings to a human-curated database that we had compiled over several years with the same information on the same variables that I talked about. And then to compare GPT-4 and our human-curated database, we looked at match rates, so just seeing how many of the data points matched up with each other directly. We went through, manually reviewed all the not-matched data points to determine accuracy rates for each variable. And then we also looked at Cohen's Kappa or intraclass coefficient as another method just to see the inter-rater agreement reliability between GPT-4 and our human-curated database.
And what we can see here from our table is, well, we have several findings that we took away from this. First, we noticed that when we look at match rates for the different variables, we can see that there are pretty high match rates between GPT-4 and human-curated extraction for laterality, surgery, and approach, with match rates all above 85%. And then with estimated blood loss, there was a slightly lower match rate with 77%, and ischemia time had a match rate of approximately 26%. After manually reviewing all the data points and calculating accuracy rates, we actually noticed that what was interesting is that GPT-4 had a higher accuracy rate for estimated blood loss compared to our human-curated database and had similar accuracy rates for GPT-4 and human-curated for laterality. But human-curated was more accurate for the other three variables: surgery, approach, and ischemia time. And then when we look at our Cohen's Kappa and ICC to look at inter-rater agreement reliability, GPT-4 and human-curated, they had near-perfect or substantial levels of agreement for almost all the variables except ischemia time.
So some of our takeaways from our study or things that we concluded were that we noticed that GPT-4 was very accurate in extracting data for variables that tended to be categorical. So, for example, laterality, it was pretty good at extracting data for that because with laterality, you really only have the option of left versus right, versus with continuous variables, where numbers all exist on an infinite spectrum, and GPT-4 might have had a little bit more of a challenge with extracting that sort of data. And then we also noticed that GPT-4 struggled with extracting data for variables that were written more heterogeneously.
So what we mean by that is when we have 20 years' worth of operative data notes, everyone also writes their operative notes a little bit differently, and with laterality, for example, there's only so many ways you could write that a surgery was on the left side or the right side, but there are a lot of different ways to phrase ischemia time. So we think that GPT-4, because of the variations and different text patterns throughout the note, might have had a little bit more trouble with extracting some information, for example, like ischemia time. But overall, with our findings, we believe that GPT-4 can still be utilized, especially as a first step to efficiently extract data for operative notes, and then with human feedback and oversight, really be able to improve the data extraction process overall.
Ruchika Talwar: Thank you. Super interesting study. When I read your paper, the first thing I started thinking about was, wow, just another way that we can use ChatGPT in medicine to enhance efficiency. And I interpreted this as a great way to be able to query clinical databases, for example, or take operative notes and create a clinical database. Dr. Ball, I'm curious, can you share a bit of your thoughts on utility?
Mark Ball: Well, I think this study was very eye-opening for me as a surgeon and just thinking about the way I document things. So as a direct result of this study, we have gone to a more templated operative note, at least have a section that is more templated. So things like ischemia time, laterality, all these variables are sort of broken apart separate from the rest of our narrative description of surgery. And I think if you're tracking clinical outcomes from a research perspective, that's an important thing to do. I also think this is where we are in 2024. I think that large language models are only going to get better and better, and it's only a matter of time before some of these difficulties with heterogeneous interpretation of old text is going to be so easy that human-curated data is probably on its last legs these days.
Ruchika Talwar: Yeah, really exciting time, I think. Every time I hear a new article being published on the utility of GPT and large language models, it's like you said, eye-opening, and it really makes you think about what's next. You mentioned standardizing your own operative notes to be able to more efficiently pull this data using this method. The thing that I think about is even outside of the research space, surgeons who are interested in perhaps tracking their own outcomes in terms of decreasing clamp time or trends in various aspects of surgery with difficult masses, things like that.
So I think the possibilities are truly endless, both from a clinical perspective as well as a research perspective. But I have to ask the question, do you think we'll ever get to a place where we can correlate, for example, surgical videos with templated operative notes so it enhances our efficiency even more? We're using scribes in clinic now through AI that just listen to your conversation, but I'm wondering if there's a world in which, while I'm operating, I can dictate surgery to a large language model, or it can even pull some of that data from a video obviously using AI, machine learning, not so much large language models, but do you think that's a possibility?
Mark Ball: I think absolutely. I think we're still in the very early stages of video analysis. We know that static images from radiology and pathology, that AI does a great job. Well, video is really just a series of static images. So I think that that will come, and our research group is interested in looking at using AI applications of surgical videos, especially with minimally invasive surgery. A lot of times, by default, these are recorded. So we have thousands of hours that we can train AI models and we'll see how they can then do prospectively.
Ruchika Talwar: Really interested to see the work that is coming out of your group. I think that, again, the possibilities are endless here. As we wrap up, what is your big takeaway for the urologic community based on your study?
Jessica Hsueh: I think that to me, one of the biggest takeaways from our study is, as Dr. Ball alluded to, just it's 2024. There are just always new technologies and new tools coming out every single day. And I think that artificial intelligence is here, and I think that a lot of it is really going to transform how we currently do clinical practice and how we care for our patients. And I think that we can acknowledge that artificial intelligence does have some limitations still, especially revolving around concerns for privacy and things like not having very universal regulations yet. But I think nevertheless, it is still a very powerful tool that I think that as urologists, we can embrace and at least appreciate how to use them so that we can better care for patients.
Mark Ball: I don't think I could say it any better than that.
Ruchika Talwar: Absolutely. I think a lot of the points that you bring up are really important. Privacy concerns, there is emerging legislation about ethical AI use, but it has not directly been, I would say, correlated to what we do in medicine, particularly in surgery. There has been more and more focus on making sure that we maintain patient privacy and HIPAA protections when we refer to surgical videos. But it'll be interesting to see where legislation goes. And I really applaud you both for taking this head on because we need research and we need data to inform health policy regulations that come out. So thank you for your work on this study, and thanks for taking some time to share it with the UroToday audience.
Mark Ball: Our pleasure. Thank you so much for having us and for your interest in our study.
Ruchika Talwar: Thanks all for tuning in. We'll see you next time.
Ruchika Talwar: Hi everyone. Welcome back to UroToday's Health Policy Center of Excellence. As always, my name is Ruchika Talwar, and I'm a urologic oncologist in Nashville, Tennessee. I'm really excited today to be joined by Dr. Mark Ball from the NCI and Jessica Hsueh, who's a fourth-year medical student at Georgetown. They'll be here discussing some of their recent work, exploring the use of ChatGPT for renal surgery operative notes. Thanks to both of you for making the time to join us. We appreciate it.
Mark Ball: Thank you so much for having us.
Jessica Hsueh: Yeah, thanks so much for having us. Thank you so much, once again, for having us talk today about our research project looking at GPT-4 as a data extraction tool for renal surgery operative notes.
So just to provide a little bit of background. Data extraction, as we know, is a very vital first step in any sort of clinical data analysis that we do. And clinical data analysis is very important in terms of informing how we can better care for our patients. But data extraction is also, as it stands right now, very tedious and requires a lot of human effort and manpower. So we were interested in looking at whether or not we could use large language models to accelerate this data extraction process or make it more efficient. And one of the biggest large language models around is GPT-4. There have been many studies published out there that show that GPT-4 has a lot of applications in urology already, from being able to write clinical notes and generate discharge summaries to being able to answer standardized questions and be used as a medical education tool. And what we noticed is that no studies yet have looked at GPT-4 as a data extraction tool in urology specifically, and also to be able to extract operative data.
So what we did in our study was we compiled around 1,500 renal surgery operative notes from 2003 to 2023, and we had five variables that we were particularly interested in looking at: laterality, so whether the surgery was left versus right; surgery, so whether the surgery was a radical versus partial nephrectomy; approach, so looking at whether the surgery was open, laparoscopic, or robotic; and then we also had two continuous variables that we were interested in, estimated blood loss and ischemia time.
We de-identified our operative notes through a third-party Python-based program. And then we put the operative notes in GPT-4 and asked it to extract data for the five variables that I talked about. We used several prompting strategies to refine the data extraction process, but after that, we were able to compare our GPT-4 findings to a human-curated database that we had compiled over several years with the same information on the same variables that I talked about. And then to compare GPT-4 and our human-curated database, we looked at match rates, so just seeing how many of the data points matched up with each other directly. We went through, manually reviewed all the not-matched data points to determine accuracy rates for each variable. And then we also looked at Cohen's Kappa or intraclass coefficient as another method just to see the inter-rater agreement reliability between GPT-4 and our human-curated database.
And what we can see here from our table is, well, we have several findings that we took away from this. First, we noticed that when we look at match rates for the different variables, we can see that there are pretty high match rates between GPT-4 and human-curated extraction for laterality, surgery, and approach, with match rates all above 85%. And then with estimated blood loss, there was a slightly lower match rate with 77%, and ischemia time had a match rate of approximately 26%. After manually reviewing all the data points and calculating accuracy rates, we actually noticed that what was interesting is that GPT-4 had a higher accuracy rate for estimated blood loss compared to our human-curated database and had similar accuracy rates for GPT-4 and human-curated for laterality. But human-curated was more accurate for the other three variables: surgery, approach, and ischemia time. And then when we look at our Cohen's Kappa and ICC to look at inter-rater agreement reliability, GPT-4 and human-curated, they had near-perfect or substantial levels of agreement for almost all the variables except ischemia time.
So some of our takeaways from our study or things that we concluded were that we noticed that GPT-4 was very accurate in extracting data for variables that tended to be categorical. So, for example, laterality, it was pretty good at extracting data for that because with laterality, you really only have the option of left versus right, versus with continuous variables, where numbers all exist on an infinite spectrum, and GPT-4 might have had a little bit more of a challenge with extracting that sort of data. And then we also noticed that GPT-4 struggled with extracting data for variables that were written more heterogeneously.
So what we mean by that is when we have 20 years' worth of operative data notes, everyone also writes their operative notes a little bit differently, and with laterality, for example, there's only so many ways you could write that a surgery was on the left side or the right side, but there are a lot of different ways to phrase ischemia time. So we think that GPT-4, because of the variations and different text patterns throughout the note, might have had a little bit more trouble with extracting some information, for example, like ischemia time. But overall, with our findings, we believe that GPT-4 can still be utilized, especially as a first step to efficiently extract data for operative notes, and then with human feedback and oversight, really be able to improve the data extraction process overall.
Ruchika Talwar: Thank you. Super interesting study. When I read your paper, the first thing I started thinking about was, wow, just another way that we can use ChatGPT in medicine to enhance efficiency. And I interpreted this as a great way to be able to query clinical databases, for example, or take operative notes and create a clinical database. Dr. Ball, I'm curious, can you share a bit of your thoughts on utility?
Mark Ball: Well, I think this study was very eye-opening for me as a surgeon and just thinking about the way I document things. So as a direct result of this study, we have gone to a more templated operative note, at least have a section that is more templated. So things like ischemia time, laterality, all these variables are sort of broken apart separate from the rest of our narrative description of surgery. And I think if you're tracking clinical outcomes from a research perspective, that's an important thing to do. I also think this is where we are in 2024. I think that large language models are only going to get better and better, and it's only a matter of time before some of these difficulties with heterogeneous interpretation of old text is going to be so easy that human-curated data is probably on its last legs these days.
Ruchika Talwar: Yeah, really exciting time, I think. Every time I hear a new article being published on the utility of GPT and large language models, it's like you said, eye-opening, and it really makes you think about what's next. You mentioned standardizing your own operative notes to be able to more efficiently pull this data using this method. The thing that I think about is even outside of the research space, surgeons who are interested in perhaps tracking their own outcomes in terms of decreasing clamp time or trends in various aspects of surgery with difficult masses, things like that.
So I think the possibilities are truly endless, both from a clinical perspective as well as a research perspective. But I have to ask the question, do you think we'll ever get to a place where we can correlate, for example, surgical videos with templated operative notes so it enhances our efficiency even more? We're using scribes in clinic now through AI that just listen to your conversation, but I'm wondering if there's a world in which, while I'm operating, I can dictate surgery to a large language model, or it can even pull some of that data from a video obviously using AI, machine learning, not so much large language models, but do you think that's a possibility?
Mark Ball: I think absolutely. I think we're still in the very early stages of video analysis. We know that static images from radiology and pathology, that AI does a great job. Well, video is really just a series of static images. So I think that that will come, and our research group is interested in looking at using AI applications of surgical videos, especially with minimally invasive surgery. A lot of times, by default, these are recorded. So we have thousands of hours that we can train AI models and we'll see how they can then do prospectively.
Ruchika Talwar: Really interested to see the work that is coming out of your group. I think that, again, the possibilities are endless here. As we wrap up, what is your big takeaway for the urologic community based on your study?
Jessica Hsueh: I think that to me, one of the biggest takeaways from our study is, as Dr. Ball alluded to, just it's 2024. There are just always new technologies and new tools coming out every single day. And I think that artificial intelligence is here, and I think that a lot of it is really going to transform how we currently do clinical practice and how we care for our patients. And I think that we can acknowledge that artificial intelligence does have some limitations still, especially revolving around concerns for privacy and things like not having very universal regulations yet. But I think nevertheless, it is still a very powerful tool that I think that as urologists, we can embrace and at least appreciate how to use them so that we can better care for patients.
Mark Ball: I don't think I could say it any better than that.
Ruchika Talwar: Absolutely. I think a lot of the points that you bring up are really important. Privacy concerns, there is emerging legislation about ethical AI use, but it has not directly been, I would say, correlated to what we do in medicine, particularly in surgery. There has been more and more focus on making sure that we maintain patient privacy and HIPAA protections when we refer to surgical videos. But it'll be interesting to see where legislation goes. And I really applaud you both for taking this head on because we need research and we need data to inform health policy regulations that come out. So thank you for your work on this study, and thanks for taking some time to share it with the UroToday audience.
Mark Ball: Our pleasure. Thank you so much for having us and for your interest in our study.
Ruchika Talwar: Thanks all for tuning in. We'll see you next time.