Since the release of ChatGPT, there have been a lot of concerns about the dangers it poses. Some are justified, but most of them are so ridiculous that you wonder how we got this far as a species in the first place.
One very legitimate concern, however, was how these large language models (LLMs) would be used in places like hospitals and doctor’s offices. The stakes can be a matter of life and death, which means you might not want chatbots that are known to get things wrong—a lot of the time—to be making big decisions about your health. The risks haven’t stopped people from trying to build one, launching a race involving everyone from pharma bros to the Tech Giants.
That includes Google. Meet Med-PaLM 2, an LLM the company previously announced in May as a chatbot dedicated to answering medical questions. AI researchers at the company published a paper Wednesday in Nature detailing more about how Med-PaLM 2 works, and they also introduced a set of benchmarks that the researchers say can help evaluate AI chatbots for their efficacy and accuracy in the medical setting. The authors say that these metrics can help prevent or at least cut down on instances of bias and harm caused by LLMs.
ADVERTISEMENT
Earlier this week, The Wall Street Journal reported that the chatbot is already being tested in the Mayo Clinic in Minneapolis, Minnesota. So, while you might feel uncomfortable with a bot helping your doctor answer questions, the reality is that it’s already happening—and it’s being used at one of the world’s most prestigious and largest medical group practices in the world.
“Medicine is a humane endeavor in which language enables key interactions for and between clinicians, researchers and patients,” the study’s authors write. “Yet, today’s artificial intelligence (AI) models for applications in medicine and healthcare have largely failed to fully utilize language.”
To solve what the authors describe as a “discordance between what today’s models can do and what may be expected of them” in medical settings, the team introduced a clinical benchmark for AI called MultiMedQA. This model allows clinicians, hospitals, and researchers to essentially test different LLMs for their accuracy before they use them. This can cut down on instances of chatbots hallucinating harmful misinformation or reinforcing biases in medical settings.
MultiMedQA utilizes six different datasets of questions and answers related to professional medicine. This includes a new dataset developed by Google called HealthSearchQA that contains 3,173 of the most commonly searched medical questions online.
The researchers then ran PaLM—the underlying LLM that Google incorporated into its Bard chatbot—and a variant dubbed FLAN-PaLM through the benchmark. It discovered that the latter performed fairly well, and even exceeded previous chatbots in the U.S. Medical Licensing Exam-style questions.
However, a panel of human clinicians evaluated the model’s long-form answers and found that just 62 percent of the responses aligned with scientific consensus—an obvious problem in places like hospitals where the wrong answer could mean someone dies.
So the researchers used a technique called prompt tuning, or the process of giving a more precise description of a task that you want the chatbot to achieve, to refine the model. The result was Med-PaLM, which greatly improved results. In fact, the panel found that 92.6 percent of the model’s answers were in scientific consensus and was even on par with human clinician given answers (92.9 percent).
Additionally, while the panel said that 29.7 percent of FLAN-PaLM’s answers could have led to harmful outcomes, they found that just 5.8 percent of Med-PaLMs answers could do so—which is better than human clinician given answers at 6.5 percent.
Of course, as with anything AI, there are a few caveats to keep in mind. For one, the study authors note several limitations to this study including the use of a relatively limited database of medical knowledge, the constantly changing state of scientific consensus, and the fact that none of the human clinicians evaluating it thought that Med-PaLM was at “clinical expert level” in several metrics.
The model—like so many other LLMs—may also run the risk of reinforcing biases in medicine. The perennial issue in AI poses a threat of causing real world harm if its usage goes unchecked, especially in an industry prone to scientific and medical racism and sexism.
“The use of LLMs to answer medical questions can cause harms that contribute to health disparities,” the authors wrote. They added, “This could lead to systems that produce differences in behavior or performance across populations that result in downstream harms in medical decision-making or reproduce racist misconceptions regarding the cause of health disparities.”
Perhaps the biggest issue here, though, is the idea that Google seems to be setting up the goal posts for a game they’re playing in themselves. They’re the ones who created Med-PaLM. They’re also the ones that created this new benchmark that they say could be used for other medical LLMs. This poses an inherent conflict of interests that calls into question whether or not they should be the ones doing so.
For now, the bots are already starting to roll out in our hospitals. Time will tell the full story of whether or not they’ll truly help doctors and other clinicians save lives—or cause a whole lot of harm in the process.