Top AI models fail spectacularly when faced with slightly altered medical questions

Shocking (not):

Artificial intelligence systems often perform impressively on standardized medical exams—but new research suggests these test scores may be misleading. A study published in JAMA Network Open indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns. When those patterns were slightly altered, the models’ performance dropped significantly—sometimes by more than half.

Link to report

Pascal Finette @radical