MedPI Eval Whitepaper: Interaction-First Clinical AI Evaluation
Lumos AI has developed a robust new framework for assessing medical AI in realistic multi-turn patient conversations, evaluated by AI judges across 105 clinical criteria.
Single-turn evaluation doesn’t
capture real conversations
Most evaluations test only single-turn answers
Patient conversations are multi-turn and clinician-aligned
Safety depends on more than correctness
Real evaluation requires richer measures
Evaluating AI requires full patient conversations, not single-turn answers.
We introduce MedPI Eval, an interaction-first framework for assessing medical AI.
Four foundational components of the framework:
We evaluated 9 leading language models in 7,010 simulated patient conversations.
Each case included rich patient histories and realistic clinical scenarios. Models were assessed across 31 competencies and 105 evaluation dimensions — providing a comprehensive view of conversational competence.
High-level performance overview
Competency domain overview
This view compares models across six core domains, grouping 31 underlying competencies to provide an at-a-glance perspective on key areas of clinical performance.
Encounter types overview
This view shows how models perform across different medical conditions. Each score combines results from multiple tasks within the condition.
Conditions overview
This view shows how models perform across clinical encounter types, based on tasks simulating each setting.
Detailed competency comparison
This view breaks down performance across individual competencies and their underlying dimensions.
Models strengths and weaknesses
This view highlights the five highest-scoring and five lowest-scoring competencies for the selected model.