Skip to content

Evaluations

How we measure quality across the engine and keep it from regressing as prompts evolve.

State of automation, by surface:

Surface Automated? Where it lives
Entity extraction (NER) Yes — eval harness extractor/eval.py + scripts/eval_entity_extractor.py
Outcomes extraction (structured tuples) Yes — eval harness extractor/outcomes_eval.py + scripts/eval_outcomes_extractor.py
KIQ-answer schema Yes — runtime gate engine/verification.py — runs inside the synthesizer retry loop
Briefing-level quality (framing, specificity, calibration) No — manual checklist below

The briefing-level harness is still aspirational. The sections below cover (a) what we manually check before shipping a synthesizer prompt change, (b) how the entity / outcomes eval harnesses are structured (so the briefing-level one can mirror them), and © the proposed path to automation.


What "good" means for a briefing

Four axes, in decreasing order of importance:

1. Factual grounding (hard constraint)

Every claim in the executive summary, signal analyses, and strategic implications must map to a specific Signal in the DB. If the briefing says "BMS terminated MRTX0902", the DB must contain an NCT-linked Signal with that event. Hallucinated drugs, trials, or events are the only inexcusable failure mode.

Check: grep the briefing text for drug names, NCT numbers, PMIDs — each must resolve in Signal.raw_data or signal_analyses[].signal_id.

2. Strategic framing

Does the briefing lead with what changed and what it means (good), or does it rank raw events chronologically (bad)? Does it connect signals across sources — e.g. "phase transition + new publication + regulatory event tell the same story"?

The synthesizer prompt (synthesizer.py:19) explicitly demands this framing, but LLMs sometimes drop it under heavy signal load.

3. Specificity

Does it name: drug generic names, companies, NCT numbers, target/MoA, quantified timeline impact ("narrows first-mover window by 6 months")? Vague strategic narratives are low-value.

4. Confidence calibration

Every signal_analyses[] entry has a confidence field (high | medium | low). An FDA approval should be high; a phase transition inferred from a trial registration should be medium or low. Mis-calibration (everything high) is a common failure mode when the prompt lacks examples.


The reference golden briefing

examples/nsclc_briefing_2026-04-03.json is our current reference output — generated against a real NSCLC landscape snapshot on 2026-04-03.

It includes the BMS MRTX0902 / CC-90003 co-termination analysis that shows the engine working as intended:

  • Two independent trial-termination signals (from ClinicalTrials.gov) with different NCT numbers
  • The synthesizer detected they were co-terminated within the same briefing cycle
  • It connected them to a broader narrative (BMS exit from internal KRAS combination development)
  • It surfaced the forward implication: Eli Lilly's LY3537982 benefits from reduced competition

How we use this today: as a read-through reference when editing the synthesizer prompt. If we make a change and a fresh briefing from the same DB state loses the cross-source connection, we know the change was a regression.

How we should use it (not built yet): as a fixture for a pytest assertion — run the synthesizer against a frozen DB snapshot, compare the new output's signal_analyses[].signal_id list to the golden set, fail if required signals are missing.


Cost-per-run budget

Per briefing cycle on NSCLC-sized signal volume (~1,500 signals, ~20 after classification):

Stage Model Tokens in Tokens out ~USD
Classifier (batched) Haiku ~30 k ~500 $0.002
Synthesizer Sonnet ~40 k ~6 k $0.05–0.20
Per-tab analyzer × 3 Haiku ~15 k each ~1 k each $0.01 each
QueryEngine (per ad-hoc question) Haiku ~15 k ~500 $0.001

Watch signs of drift: - Classifier token-in spike ⇒ signal volume exploded (check a source for a new bug producing duplicates). - Synthesizer token-out spike ⇒ response no longer JSON-structured; likely the model started emitting markdown fences (see _strip_fences in synthesizer.py).

No automated budget-enforcement exists. Prompt changes should include a manual token-count diff.


Manual check before shipping a prompt change

  1. Run the full pipeline against the current DB: make briefing.
  2. diff the new latest_briefing_nsclc_001.json against the previous one.
  3. Look specifically at:
  4. executive_summary — still leads with what changed?
  5. signal_analyses[].signal_id — still references real IDs?
  6. signal_analyses[].confidence — distribution across high/medium/low looks sane?
  7. watchlist[] — still forward-looking, not retrospective?
  8. strategic_implications — 3+ paragraphs, or did it collapse to bullets?
  9. Spot-check 3 signals against the DB with the query engine:
    make query Q="Is NCT12345678 really terminated, per the data?"
    
  10. Pull the HTML render: make briefing-html and eyeball it end-to-end.

This is manual. It's fine for a small team. It won't scale.


Eval harnesses that already exist

Entity extraction (NER)

The entity extractor (entity_extractor.py) is graded against a 25-sample gold set derived from the BIOPSY benchmark (see research.md).

uv run python scripts/eval_entity_extractor.py

Reports per-label precision / recall / F1 plus a side-by-side diff between GLiNER and the Claude fallback. Gold set + golden output committed under evals/extraction/. Current baseline: GLiNER F1 ≈ 0.88, Claude F1 ≈ 0.91 (Claude wins on rare biomarker compounds, GLiNER wins on speed).

Outcomes extraction (structured tuples)

outcomes_eval.py compares extractor output against a 25-sample gold set of (endpoint, arm, value, unit, CI, HR, p, comparator) tuples drawn from real ASCO / ESMO / NEJM abstracts.

uv run python scripts/eval_outcomes_extractor.py

Scoring is field-aware: (primary_endpoint, arm, value) are required; (unit, CI, HR, p, comparator) are partial credit. The harness emits both an aggregate F1 and a per-tuple diff so prompt iterations are auditable.

KIQ-answer schema (runtime gate, not a harness)

validate_kiq_answers is not an eval — it's a deterministic shape-and-confidence-enum check that runs inside the synthesizer retry loop. If the synthesizer emits malformed KIQ answers (missing field, wrong confidence value, undersized prose), the validator's error messages are appended to the next prompt and the synthesizer retries up to MAX_SCHEMA_RETRIES times. Pass/fail state persists on the Briefing row as schema_valid + schema_errors so the API can surface gate failures to operators. See architecture.md §4.8.


Proposed automation (briefing-level, not built)

Regression harness

Store a frozen DB snapshot (e.g. tests/fixtures/golden_nsclc.db). A new test module tests/eval/test_briefing_regression.py:

@pytest.mark.skipif(not settings.anthropic_api_key, reason="requires real LLM")
def test_briefing_contains_required_cross_source_connections():
    # Run synthesizer against frozen DB
    briefing = run_briefing_pipeline("nsclc-001")
    # Required signal IDs from the golden set
    required = {"mrtx0902-terminated", "cc-90003-terminated"}
    actual = {a["signal_id"] for a in briefing.signal_analyses}
    assert required.issubset(actual)

This would be an integration test (live LLM call, not part of make test).

LLM-as-judge

For the narrative-quality axes (strategic framing, specificity), a grading prompt to Opus that returns a 1–5 score with rationale. Aggregate across N runs to get stability estimates.

The risk is judge drift — Opus's preferences change between model versions. Pin the judge model version and re-calibrate on model upgrades.

Dataset augmentation

We need more golden landscapes than just NSCLC. Candidates:

  • Immunology — already seeded (immunology-001), no golden briefing yet.
  • Oncology — KRAS subtype — spin-off from NSCLC.
  • Rare disease — test the "sparse signals" edge case.

What we don't evaluate

  • Signal coverage — we don't measure whether sources surfaced every relevant event. Doing so would require a ground-truth pipeline database (e.g. Citeline) to compare against.
  • Latency — the pipeline runs in ~60 s end-to-end and nobody has complained yet.
  • User-reported quality — no feedback loop from the frontend to the engine. Phase 3 full ("client memory — per-landscape alert thresholds, BD context") will be the right surface for this.

Prior art we're watching

  • BIOPSY (Kognitic / EMNLP 2025) — a named-entity-recognition benchmark for pharma pipeline data. Relevant for grading source ingestion accuracy. See research.md.
  • G-Eval / Bespoke pharma benchmarks — not yet surveyed rigorously. Open question: does anyone else in pharma CI publish briefing-quality benchmarks? (As of writing, no public ones found.)