Key Takeaways:
- Large Language Models (LLMs) have the potential to improve medical literacy and reduce misinformation online
- LLMs can provide more accurate and reliable health information than traditional search engines like Google
- However, LLMs also come with risks, including sycophancy and hallucination, which can spread medical misinformation
- Studies have shown that LLMs can answer medical questions correctly around 85% of the time, but may struggle with more complex problems
- The development of health-specific LLMs, such as ChatGPT Health, may help to mitigate these risks and provide more accurate and trustworthy health information
Introduction to LLMs in Healthcare
The rise of Large Language Models (LLMs) has the potential to revolutionize the way people access and understand medical information online. With the vast amount of health-related information available on the internet, it can be difficult for patients to navigate and distinguish between high-quality sources and dubious websites. LLMs, such as ChatGPT, can help to bridge this gap by providing accurate and reliable health information. According to Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist, LLMs can help to reduce patient anxiety and misinformation by providing more accurate and trustworthy information.
The Potential Benefits of LLMs
The release of ChatGPT Health and other health-specific LLMs indicates that the AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. While LLMs come with risks, including sycophancy and hallucination, they also have the potential to provide numerous benefits. For example, LLMs can help to reduce the burden of medical misinformation and unnecessary health anxiety that the internet has created. As Amulya Yadav, an associate professor at Pennsylvania State University, notes, LLMs can provide more accurate and reliable health information than traditional search engines like Google. In fact, studies have shown that LLMs can answer medical questions correctly around 85% of the time.
Evaluating the Effectiveness of LLMs
However, evaluating the effectiveness of LLMs for consumer health is a complex task. As Danielle Bitterman, the clinical lead for data science and AI at the Mass General Brigham health-care system, notes, it is difficult to evaluate an open-ended chatbot like ChatGPT. Large language models score well on medical licensing examinations, but these exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information. To address this gap, researchers have attempted to evaluate LLMs using more realistic prompts and scenarios. For example, a study by Sirisha Rambhatla, an assistant professor of management science and engineering at the University of Waterloo, found that GPT-4o responded correctly to licensing exam questions without access to a list of possible answers only about half of the time.
Limitations and Risks of LLMs
While LLMs have the potential to provide numerous benefits, they also come with significant limitations and risks. For example, LLMs can be sycophantic and prone to hallucination, which can spread medical misinformation. As Reeva Lederman, a professor at the University of Melbourne, notes, patients who don’t like their diagnosis or treatment recommendations may seek out another opinion from an LLM, which can encourage them to reject their doctor’s advice. Additionally, LLMs may struggle with more complex problems and may not be able to provide accurate and reliable information in all cases. Furthermore, the abundance of medically dubious diagnoses and treatments floating around the internet can contribute to the spread of medical misinformation, particularly if people see LLMs as trustworthy.
The Development of Health-Specific LLMs
To address these limitations and risks, the development of health-specific LLMs, such as ChatGPT Health, may help to provide more accurate and trustworthy health information. OpenAI has reported that the GPT-5 series of models is markedly less sycophantic and prone to hallucination than their predecessors. The company has also evaluated the model that powers ChatGPT Health on its responses to health-specific questions using the HealthBench benchmark, which rewards models that express uncertainty when appropriate and recommend that users seek medical attention when necessary. While these developments are promising, it is essential to continue evaluating and improving LLMs to ensure that they provide accurate and reliable health information.
Future Directions
As LLMs continue to evolve and improve, it is likely that they will play an increasingly important role in healthcare. However, it is essential to address the limitations and risks associated with LLMs and to ensure that they are used responsibly and effectively. This may involve developing more advanced evaluation metrics and benchmarks, such as HealthBench, to assess the performance of LLMs in healthcare. Additionally, it is crucial to educate patients and healthcare professionals about the potential benefits and limitations of LLMs and to promote responsible use of these technologies. By doing so, we can harness the potential of LLMs to improve medical literacy and reduce misinformation online, ultimately leading to better health outcomes and more informed decision-making.


