Artificial Intelligence for Urology Research: The Holy Grail of Data Science or Pandora's Box of Misinformation? - Beyond the Abstract
For example, an Endourologist practicing in Metro Detroit (where this author is from) might be able to access an AI interface and pose the prompt “access XYZ (example) publicly available database and tell me what the incidence of kidney stone disease is for patients who live in zip codes 48302 and 48343” or “for patients living in zip codes 48302 and 48343, tell me which demographic is most at risk for kidney stone disease” so that a clinician can provide more tailored care to their patients. Effectively this could allow for clinical application of customized “local epidemiology.”
In order to get to this idealized technological future, we needed to start with a much more basic premise. Therefore, we decided to use currently available AI interfaces to see if they are even capable of performing a relatively straightforward urologic epidemiologic calculation. We chose to request the calculation of the incidence and prevalence of kidney stone disease as this had well-established published data from the Urologic Diseases in America Project to which we could compare results. We chose the chatbot/large language models (LLMs) ChatGPT and Bard because these are popular, free, and easily understood. Having a more complicated interface that is not easily available might defeat some of the purposes, as community urologists with fewer resources would be less likely to utilize them. Initially, this seemed promising as we discovered Bard has access to the internet and seemed capable of accessing the National Health and Nutrition Examination Survey (NHANES) database. We had planned to perform comparisons between the numbers produced by the LLMs and those published in prior studies. Bard gave convincing, detailed responses about accessing the NHANES database, downloading datasets, performing the requested calculations, and providing plausible-sounding numbers with confidence intervals. However, on several of the responses, the numbers appeared markedly similar to those published in our reference studies. Despite Bard detailing the calculation, it eventually admitted to having pulled the numbers from published sources once we explicitly asked if they pulled the numbers from other studies. As a secondary attempt, we tried the route of asking ChatGPT and Bard to generate customized code which could then be executed on a given dataset, however, neither was able to successfully accomplish this. Perhaps someone more versed in computer code would be able to successfully troubleshoot the code, however, that too would defeat the purpose of being easily usable for the lay urologist.
Overall, we learned that in their current state, easily accessible and free LLMs such as ChatGPT and Bard are incapable of performing basic epidemiology calculations and certainly not more complex calculations for customized local epidemiology useful to clinicians. Both ChatGPT and Bard give warnings that information may be inaccurate or that the AIs may confabulate, however, we found it especially concerning the degree of convincing detail that went into these “calculations” that could easily mislead users. Although the results were not what we had anticipated, with the current buzz surrounding AI in urology we thought our findings represented an important counterpoint to serve as a caution when incorporating AI into one’s work. We anticipate that future iterations of LLMs or other forms of AI such as those designed with specific tasks in mind will be better at accomplishing some of our aspirations.
Written by: Ryan Matthew Blake, MD, & Johnathan Alexander Khusid, MD, Icahn School of Medicine at Mount Sinai, New York, NY
Read the Abstract