For clinicians and buyers
See why fluent notes are not enough: unsupported CTs, diagnoses, orders, and workups can look polished while changing care.
Released June 2026
ScribeBench exists to answer one practical question: did the generated note preserve the source encounter, or did it invent care that never happened? Clinicians can inspect failure modes; builders can run the harness and submit powered rows.
Current public artifact
Why this exists
The website is the public face of an evaluation harness for clinical AI scribes. A visitor should be able to see the benchmark claim, inspect what a dangerous fabrication looks like, run a single-note lab check, and find the exact command for a real PriMock57 benchmark run.
See why fluent notes are not enough: unsupported CTs, diagnoses, orders, and workups can look polished while changing care.
Run the scoring harness on your generated notes, then submit aggregate scores without publishing closed-model note text.
Separate real powered runs from smoke tests, state the judge model, and make every leaderboard row auditable.
Launch baselines
These n=57 rows prove the benchmark path and show the failure gradient from the launch baselines. They are not meant to claim that GPT-4o, GPT-4.1, or older Claude rows are the models people should use today. Current-model rows should be added through powered PriMock57 runs using the backend commands below.
| Rank | System | Dataset | n | Dangerous fab | Narrative | Fidelity | Leak | Judge |
|---|---|---|---|---|---|---|---|---|
| Loading leaderboard... | ||||||||
Synthetic smoke tests
These rows use the bundled n=3 synthetic cases to exercise the scoring path and demo behavior. A 0% or 33% rate at this size should never outrank a powered benchmark run.
| Status | System | Dataset | n | Dangerous fab | Narrative | Fidelity | Leak | Judge |
|---|---|---|---|---|---|---|---|---|
| Loading smoke tests... | ||||||||
Live lab
Paste a source encounter and generated note, choose a provider/model, and run a fast ScribeBench-style check for narrative quality, fidelity, fabrications, and leaks. The lab is exploratory and single-pass; leaderboard rows use fixed datasets, repeats, and aggregate confidence intervals.
Walk-up demo
The demo uses the bundled synthetic candidate. One case intentionally inserts a false head CT and syncope workup so the fabrication judge has something to find.
Method
ScribeBench separates quality from safety. A note can be fluent and complete while still inventing something clinically meaningful, so the fabrication review is run as a separate adversarial judge.
Six physician-style dimensions: story, completeness, flow, artifacts, readability, and input fidelity.
Flags unsupported clinical content and separates dangerous invention from standard charting.
Deterministic scan for raw template placeholders and internal metadata tokens.
Use it
The 3-case synthetic set is only a plumbing check. A public leaderboard row should come from the powered PriMock57 run with aggregate scores only.
npm install
export SCRIBEBENCH_BACKEND=baseten
export BASETEN_API_KEY=...
export SCRIBEBENCH_JUDGE_MODEL=deepseek-ai/DeepSeek-V4-Pro
npx tsx eval/run_benchmark.ts \
--dataset data/primock57/cases \
--candidate your_primock57_notes.json \
--system "your-system" \
--repeats 2 \
--out leaderboard/_pending.json
npx tsx eval/run_benchmark.ts \
--dataset data/synthetic/cases \
--candidate data/synthetic/example_candidate.json \
--system "example-baseline" \
--out leaderboard/_pending.json