Released June 2026

A public audit bench for AI scribe fidelity.

ScribeBench exists to answer one practical question: did the generated note preserve the source encounter, or did it invent care that never happened? Clinicians can inspect failure modes; builders can run the harness and submit powered rows.

View leaderboard Score a note

Current public artifact

Benchmark snapshot

Official cases: --
Ranked systems: --
Primary metric: Dangerous fab
Data policy: Scores only

Narrative Fidelity Fabrication Leaks

Why this exists

Not a consumer app. Not a model popularity contest.

The website is the public face of an evaluation harness for clinical AI scribes. A visitor should be able to see the benchmark claim, inspect what a dangerous fabrication looks like, run a single-note lab check, and find the exact command for a real PriMock57 benchmark run.

For clinicians and buyers

See why fluent notes are not enough: unsupported CTs, diagnoses, orders, and workups can look polished while changing care.

For builders

Run the scoring harness on your generated notes, then submit aggregate scores without publishing closed-model note text.

For the public record

Separate real powered runs from smoke tests, state the judge model, and make every leaderboard row auditable.

Launch baselines

PriMock57 powered rows, not a current market ranking

These n=57 rows prove the benchmark path and show the failure gradient from the launch baselines. They are not meant to claim that GPT-4o, GPT-4.1, or older Claude rows are the models people should use today. Current-model rows should be added through powered PriMock57 runs using the backend commands below.

Raw results JSON

What visitors should do here: inspect the ranked powered rows as historical baselines, use the demo to understand dangerous fabrication, then run or submit a current model on PriMock57. The site is useful only when smoke tests, launch baselines, and current powered rows stay separate.

Rank	System	Dataset	n	Dangerous fab	Narrative	Fidelity	Leak	Judge
Loading leaderboard...

Synthetic smoke tests

Visible, but not ranked

These rows use the bundled n=3 synthetic cases to exercise the scoring path and demo behavior. A 0% or 33% rate at this size should never outrank a powered benchmark run.

Status	System	Dataset	n	Dangerous fab	Narrative	Fidelity	Leak	Judge
Loading smoke tests...

Live lab

Score a note against a source encounter

Paste a source encounter and generated note, choose a provider/model, and run a fast ScribeBench-style check for narrative quality, fidelity, fabrications, and leaks. The lab is exploratory and single-pass; leaderboard rows use fixed datasets, repeats, and aggregate confidence intervals.

Waiting for a note

The result will show a normalized narrative score, six dimensions, dangerous fabrication items, standard items, deterministic leaks, and model usage.

Walk-up demo

Inspect what the benchmark catches

The demo uses the bundled synthetic candidate. One case intentionally inserts a false head CT and syncope workup so the fabrication judge has something to find.

Source encounter

Candidate note

Why this matters

Method

Three checks, one public scoring harness

ScribeBench separates quality from safety. A note can be fluent and complete while still inventing something clinically meaningful, so the fabrication review is run as a separate adversarial judge.

Narrative quality

Six physician-style dimensions: story, completeness, flow, artifacts, readability, and input fidelity.

Fabrication review

Flags unsupported clinical content and separates dangerous invention from standard charting.

Leak detection

Deterministic scan for raw template placeholders and internal metadata tokens.

Use it

Run smoke tests, then a powered benchmark

The 3-case synthetic set is only a plumbing check. A public leaderboard row should come from the powered PriMock57 run with aggregate scores only.

Powered leaderboard run

npm install

export SCRIBEBENCH_BACKEND=baseten
export BASETEN_API_KEY=...
export SCRIBEBENCH_JUDGE_MODEL=deepseek-ai/DeepSeek-V4-Pro

npx tsx eval/run_benchmark.ts \
  --dataset data/primock57/cases \
  --candidate your_primock57_notes.json \
  --system "your-system" \
  --repeats 2 \
  --out leaderboard/_pending.json

Smoke test only

npx tsx eval/run_benchmark.ts \
  --dataset data/synthetic/cases \
  --candidate data/synthetic/example_candidate.json \
  --system "example-baseline" \
  --out leaderboard/_pending.json

Submission guide Model backends Methodology Fabrication taxonomy

A public audit bench for AI scribe fidelity.

Benchmark snapshot

Not a consumer app. Not a model popularity contest.

For clinicians and buyers

For builders

For the public record

PriMock57 powered rows, not a current market ranking

Visible, but not ranked

Score a note against a source encounter

Inspect what the benchmark catches

Loading case...

Source encounter

Candidate note

Why this matters

Three checks, one public scoring harness

Narrative quality

Fabrication review

Leak detection

Run smoke tests, then a powered benchmark

Powered leaderboard run

Smoke test only