Whitepaper

MedPI Eval Whitepaper: Interaction-First Clinical AI Evaluation

Lumos AI has developed a robust new framework for assessing medical AI in realistic multi-turn patient conversations, evaluated by AI judges across 105 clinical criteria.

Single-turn evaluation doesn’t
capture real conversations

Most evaluations test only single-turn answers

Patient conversations are multi-turn and clinician-aligned

Safety depends on more than correctness

Real evaluation requires richer measures

Evaluating AI requires full patient conversations, not single-turn answers.

We introduce MedPI Eval, an interaction-first framework for assessing medical AI.

By simulating multi-turn dialogues with patients, it evaluates models on their ability to conduct clinically sensible conversations — identifying red flags, clarifying history, maintaining state, and escalating when necessary.

Four foundational components of the framework:

01
Patient packets
realistic, longitudinal, controllable profiles
02
AI-patient simulator
natural conversation with human nuances
03
Task-to-rubric decomposition
granular, observable clinical workflows
04
LLM-as-judge protocol
systematic scoring calibrated against human raters

We evaluated 9 leading language models in 7,010 simulated patient conversations.

Each case included rich patient histories and realistic clinical scenarios. Models were assessed across 31 competencies and 105 evaluation dimensions — providing a comprehensive view of conversational competence.

High-level performance overview

Competency domain overview

This view compares models across six core domains, grouping 31 underlying competencies to provide an at-a-glance perspective on key areas of clinical performance.

Select models to compare

Encounter types overview

This view shows how models perform across different medical conditions. Each score combines results from multiple tasks within the condition.

Select models to compare

Conditions overview

This view shows how models perform across clinical encounter types, based on tasks simulating each setting.

Select models to compare

Detailed competency comparison

This view breaks down performance across individual competencies and their underlying dimensions.

Select competencies to compare

Models strengths and weaknesses

This view highlights the five highest-scoring and five lowest-scoring competencies for the selected model.