MedPI Eval Leaderboard

Benchmarking leading AI models in realistic multi-turn clinical conversations, with AI patients and judges scoring performance across 105 criteria.

Published: Sep 29, 2025

GPT 5

53.6%

50.5

Gemini 2.5 Pro

50%

Claude Opus 4.1

45.6%

GPT oss 120b

44.3%

Claude Sonnet 4

42.5%

Grok 4

42.2%

MedGemma

41.1%

Llama 3.3 70b Instruct

34.8%

Learn more about MedPI Eval

MedPI Eval by Lumos AI is the first evaluation framework for clinical AI in realistic multi-turn conversations. Built on thousands of simulated patient dialogues and judged across 105 rubric criteria, it sets a new standard for robust evaluation in medicine. Read the whitepaper for the more detailed data overview.

Read Whitepaper Read Technical Paper