ScribeBench Clinical documentation AI fidelity benchmark

Released June 2026

A public audit bench for AI scribe fidelity.

ScribeBench exists to answer one practical question: did the generated note preserve the source encounter, or did it invent care that never happened? Clinicians can inspect failure modes; builders can run the harness and submit powered rows.

Current public artifact

Benchmark snapshot

Official cases
--
Ranked systems
--
Primary metric
Dangerous fab
Data policy
Scores only
Narrative Fidelity Fabrication Leaks

Why this exists

Not a consumer app. Not a model popularity contest.

The website is the public face of an evaluation harness for clinical AI scribes. A visitor should be able to see the benchmark claim, inspect what a dangerous fabrication looks like, run a single-note lab check, and find the exact command for a real PriMock57 benchmark run.

For clinicians and buyers

See why fluent notes are not enough: unsupported CTs, diagnoses, orders, and workups can look polished while changing care.

For builders

Run the scoring harness on your generated notes, then submit aggregate scores without publishing closed-model note text.

For the public record

Separate real powered runs from smoke tests, state the judge model, and make every leaderboard row auditable.

Launch baselines

PriMock57 powered rows, not a current market ranking

These n=57 rows prove the benchmark path and show the failure gradient from the launch baselines. They are not meant to claim that GPT-4o, GPT-4.1, or older Claude rows are the models people should use today. Current-model rows should be added through powered PriMock57 runs using the backend commands below.

Raw results JSON
What visitors should do here: inspect the ranked powered rows as historical baselines, use the demo to understand dangerous fabrication, then run or submit a current model on PriMock57. The site is useful only when smoke tests, launch baselines, and current powered rows stay separate.
Rank System Dataset n Dangerous fab Narrative Fidelity Leak Judge
Loading leaderboard...

Synthetic smoke tests

Visible, but not ranked

These rows use the bundled n=3 synthetic cases to exercise the scoring path and demo behavior. A 0% or 33% rate at this size should never outrank a powered benchmark run.

Status System Dataset n Dangerous fab Narrative Fidelity Leak Judge
Loading smoke tests...

Live lab

Score a note against a source encounter

Paste a source encounter and generated note, choose a provider/model, and run a fast ScribeBench-style check for narrative quality, fidelity, fabrications, and leaks. The lab is exploratory and single-pass; leaderboard rows use fixed datasets, repeats, and aggregate confidence intervals.

Walk-up demo

Inspect what the benchmark catches

The demo uses the bundled synthetic candidate. One case intentionally inserts a false head CT and syncope workup so the fabrication judge has something to find.

Case

Loading case...

Demo

Source encounter


              

Candidate note


              

Why this matters

Method

Three checks, one public scoring harness

ScribeBench separates quality from safety. A note can be fluent and complete while still inventing something clinically meaningful, so the fabrication review is run as a separate adversarial judge.

01

Narrative quality

Six physician-style dimensions: story, completeness, flow, artifacts, readability, and input fidelity.

02

Fabrication review

Flags unsupported clinical content and separates dangerous invention from standard charting.

03

Leak detection

Deterministic scan for raw template placeholders and internal metadata tokens.

Use it

Run smoke tests, then a powered benchmark

The 3-case synthetic set is only a plumbing check. A public leaderboard row should come from the powered PriMock57 run with aggregate scores only.

Powered leaderboard run

npm install

export SCRIBEBENCH_BACKEND=baseten
export BASETEN_API_KEY=...
export SCRIBEBENCH_JUDGE_MODEL=deepseek-ai/DeepSeek-V4-Pro

npx tsx eval/run_benchmark.ts \
  --dataset data/primock57/cases \
  --candidate your_primock57_notes.json \
  --system "your-system" \
  --repeats 2 \
  --out leaderboard/_pending.json

Smoke test only

npx tsx eval/run_benchmark.ts \
  --dataset data/synthetic/cases \
  --candidate data/synthetic/example_candidate.json \
  --system "example-baseline" \
  --out leaderboard/_pending.json