Key Takeaways
- A Penn State study found that large language models (LLMs) answer everyday health‑related queries with only 76 % accuracy, translating to an error rate above 20 %—nearly double the mistake rate of human physicians.
- The research employed a real‑world, crowdsourced prompting approach: 34 participants generated 212 health‑concern prompts using their preferred LLMs over a week‑long event.
- Physician reviewers highlighted that while LLM responses can be “undeniably impressive,” the technology remains unreliable for self‑diagnosis or health‑related decision‑making.
- Errors are especially concerning for underrepresented patient populations and rare conditions, raising equity issues that require deliberate data‑collection and model‑development efforts.
- False‑positive outputs may cause psychological harm, increase unnecessary medical visits, and erode trust in professional care.
- The authors urge caution in integrating LLMs into healthcare applications, stressing that they should support—not replace clinical expertise and that reliance on them could mask systemic workforce shortages.
Introduction
Asking medical questions of AI using the kind of language people actually type—“in the wild”—produces answers with mediocre accuracy, according to a new study from Penn State. The researchers report that the models got the correct answer only 76 % of the time, leaving an error rate that exceeds 20 %. As one of the study’s authors put it, “If acted upon, such errors could lead to harmful clinical outcomes.” This performance is troubling because it is close to double the mistake rate of human physicians, suggesting that LLMs are not yet ready to serve as frontline medical advisors.
Study Design and Participant Recruitment
To capture everyday usage, the team turned a university‑level competition into a crowdsourced prompting experiment. Informatics PhD candidate Bonam Mingole, MEng, and colleagues recruited 34 volunteers—faculty, staff, and students—who were instructed to “choose the LLM of their choice and use it as they would on a normal day.” Over the course of a weeklong event, these participants entered a total of 212 prompts that expressed both actual and imagined health concerns. The prompts were directed at four popular LLMs, allowing the researchers to observe how each model performed under realistic, unscripted conditions.
Evaluation Methodology
After the prompting phase, a panel of nine board‑certified physicians reviewed the AI‑generated responses for accuracy. The physicians applied a standardized rubric that judged whether each answer was medically correct, partially correct, or incorrect. This expert assessment aimed to mirror the kind of scrutiny a clinician would give to patient‑submitted information. By grounding the evaluation in professional judgment, the study sought to bridge the gap between technical performance metrics and real‑world clinical relevance.
Accuracy Results and Comparison to Clinicians
The physicians’ ratings revealed that the LLMs achieved an overall accuracy of 76 %, meaning that roughly one in four answers was flawed. More strikingly, the error rate topped 20 %, a figure that the authors note “far exceeds the acceptable margin in most healthcare settings.” For context, human physicians typically err at a rate closer to 10 % in comparable diagnostic tasks, placing the AI’s performance well below the standard expected of trained clinicians. As the paper states, “even the best‑performing model (GPT‑4o) generates invalid responses in roughly one out of every five cases.”
Researchers’ Cautionary Message
Given these findings, the authors urge that the integration of LLMs into healthcare applications be “approached with great caution.” They emphasize that the technology’s current error profile could lead to misguided self‑diagnosis, inappropriate treatment choices, or delayed care if users place undue trust in the output. The study concludes with a direct recommendation: “it is crucial that users exercise judicious scrutiny when employing these tools for self‑diagnosis or health‑related decision‑making.” This warning is intended for developers, clinicians, and the general public alike.
Discussion: Impressive Yet Unreliable
In their discussion, Mingole and co‑authors acknowledge that LLMs’ ability to respond to health‑related queries in ways physicians often find satisfactory is “undeniably impressive.” However, they swiftly caution that this impressiveness should not be mistaken for reliability. The paper quotes the team: “On the contrary… we hope that these results highlight that even the best‑performing model (GPT‑4o) generates invalid responses in roughly one out of every five cases.” This nuanced view recognizes the models’ linguistic fluency while underscoring their shortcomings in medical accuracy.
Potential for Harmful Clinical Outcomes
The authors further warn that acting on inaccurate LLM advice could have tangible harms. They state, “If acted upon, such errors could lead to harmful clinical outcomes.” This statement serves as a stark reminder that erroneous information—whether about medication dosage, symptom interpretation, or disease risk—can translate into real‑world danger, especially for patients who lack immediate access to professional clarification.
Equity Concerns: Underrepresented Populations and Rare Conditions
A secondary but critical insight from the study is that LLM performance drops for underrepresented patient groups and rare medical conditions. The researchers argue that this disparity risks exacerbating existing healthcare inequities. They write: “Addressing this issue requires more than technical mechanisms; it calls for a broader commitment to equity in the data collection, model development and evaluation processes.” Ensuring that training data reflect diverse demographics and that evaluation includes rare diseases is presented as essential to prevent the technology from widening gaps in care quality.
Psychological Burden of False Positives
The paper also highlights the psychological costs of false‑positive AI outputs. Misleading reassurance or alarm can cause increased health‑related preoccupation, unnecessary medical consultations, and even avoidance of professional care due to fear or mistrust. The authors note: “The resulting psychological burden ‘could lead to increased health‑related preoccupation, additional medical consultations and even avoidance of professional healthcare altogether due to fear or mistrust.’” They argue that these mental‑health impacts must be weighed when assessing the net value of deploying LLMs in user‑facing health apps.
Societal Implications and Workforce Considerations
Finally, the researchers caution against viewing LLMs as a stopgap for physician shortages. They warn that reliance on AI might create a false sense of sufficiency, diverting attention from the structural need to expand the health‑professional workforce. The study quotes: “While LLMs may offer temporary support, they should not be viewed as replacements for clinical expertise. In areas already facing shortages of physicians, the reliance on LLMs may create a false sense of sufficiency, deflecting attention from the structural need to increase the supply of health professionals.” This perspective urges policymakers to treat AI as a complement, not a substitute, for human expertise.
Participatory Research Approach
Mingole emphasizes the study’s strength lies in its real‑world, participatory design. He is quoted as saying, “[We told] participants to choose the LLM of their choice and use it as they would on a normal day.” This approach, he adds, “is so important for understanding how the public uses AI in their daily life.” By letting volunteers use the models organically, the research captured authentic prompting behaviors that laboratory‑only studies often miss.
Conclusion and Outlook
The Penn State investigation delivers a sobering snapshot of LLMs’ current capability to answer everyday health questions: modest accuracy, notable error rates, and significant equity and safety concerns. While the models’ linguistic fluency is undeniably impressive, the evidence suggests they are not yet trustworthy substitutes for professional medical advice. Moving forward, the authors call for rigorous, in‑the‑wild evaluations, greater attention to data diversity, and clear communication to users about the limitations of AI‑generated health information. Only with such safeguards can the potential benefits of LLMs be realized without compromising patient safety or exacerbating existing disparities.
https://healthexec.com/topics/artificial-intelligence/when-used-wild-conditions-real-world-large-language-ai-stumbles

