Towards Reliable Clinical AI: Evaluating Factuality, Robustness, and Real-World Performance of Large Language Models
Large language models are increasingly deployed in clinical settings, but their reliability remains uncertain—they hallucinate facts, behave inconsistently across instruction phrasings, and struggle with evolving medical terminology. In my talk, I address methods to systematically evaluate clinical LLM reliability across four dimensions aligned with how healthcare professionals actually work: verifying concrete facts (FactEHR), ensuring stable guidance across instruction variations (instruction sensitivity study showing up to 0.6 AUROC variation), integrating up-to-date knowledge (BEACON improving biomedical NER by 15%), and assessing real patient conversations (PATIENT-EVAL revealing models abandon safety warnings when patients seek reassurance). These contributions establish evaluation standards spanning factuality, robustness, knowledge integration, and patient-centered communication, charting a path toward clinical AI that is safer, more equitable, and more trustworthy.