/ MedPI Leaderboard

MedPI Eval Leaderboard

Benchmarking leading AI models in realistic multi-turn clinical conversations, with AI patients and judges scoring performance across 105 criteria.

Published: Sep 29, 2025
GPT 5
1
53.6%
o3
2
50.5
Gemini 2.5 Pro
3
50%
Claude Opus 4.1
4
45.6%
GPT oss 120b
5
44.3%
Claude Sonnet 4
6
42.5%
Grok 4
7
42.2%
MedGemma
8
41.1%
Llama 3.3 70b Instruct
9
34.8%

Learn more about MedPI Eval

MedPI Eval by Lumos AI is the first evaluation framework for clinical AI in realistic multi-turn conversations. Built on thousands of simulated patient dialogues and judged across 105 rubric criteria, it sets a new standard for robust evaluation in medicine. Read the whitepaper for the more detailed data overview.