ADR-0004: Classifier Phase 3 (graph boost) and Phase 4 (eval framework)¶
Status: Proposed Date: 2026-04-29 Driver: Khalil Related: PR #82 (Phase 3 draft), PR #84 (Exhibit 99.1 fetch — merged), PR #86 (Deals-tab SEC wiring — merged), Phase 0–2 of the classifier rework (PRs #79, #80, #81 — merged).
Context¶
The classifier rework split into five phases (per the build plan in the parent thread). Phases 0–2 are merged on main and behaving as designed in production:
- Phase 0 (#79) — hard-fail on total LLM-classifier failure instead of silent severity-only fallback.
- Phase 1 (#80) — deterministic rule tier in front of the LLM. Live ratio on immunology-001: 50.3% of signals rule-scored (target was 60–80%).
- Phase 2 (#81) — annotates
DetectedChange.contextwithlinked_drugsandmatched_kiq_ids, richer LLM input lines, chunk size 50→20.
Phase 3 (graph-derived boost) and Phase 4 (eval framework) remain. Live measurements during the close-out of Phases 1 and 2 surfaced concrete reasons to revise the original Phase 3 design before merging it, and to scope Phase 4 around the gaps Phases 1 and 3 cannot self-validate. This ADR records those decisions so the work can resume cleanly without re-tracing the investigation.
Findings that motivate this ADR¶
Two live runs against immunology-001 (7,536 signals, 28 candidate drugs):
Phase 1 rule-tier ratio is below target — and that's fine¶
total signals : 7,536
rule_scored : 3,789 (50.3%) ← below the 60–80% plan target
llm_residual : 3,747 (49.7%)
The gap is concentrated in TRIAL_REGISTERED + Phase 2/Unknown (2,546 signals deferring to LLM). The plan's 60–80% target was probably calibrated on an oncology-style corpus where Phase 1 trial registrations dominate; immunology has a different mix. Tuning the rule weights without labels would move the ratio without us knowing whether precision held — a Phase 4 question.
Phase 3 boost rules don't fire on real data — except recency¶
After PR #84 (Exhibit 99.1 fetch) the SignalDrug graph went from 0 SEC→dupilumab edges to 29. Even with that lift, the boost distribution on immunology-001 looks like:
boost = +0 : 6,948 (92.2%)
boost = +1 : 586 ( 7.8%) ← almost all from recency
boost = +2 : 2
boost = +3 (capped) : 0
Per-rule eligibility breakdown on the post-#84 corpus:
| rule | lift | live behaviour |
|---|---|---|
| Density (≥3 candidate drugs linked) | +2 | 0 signals corpus-wide |
| KIQ alignment (operational KIQ mentions linked candidate) | +2 | depends on KIQ count; only 2 active KIQs in immunology landscape today |
| Franchise source ({sec, lens} + ≥1 candidate link) | +1 | 38 SEC signals eligible (was 8 pre-#84) |
| Recency (event_date within 30 days) | +1 | ~588 signals eligible — does ~80% of all non-zero boost work |
Single-product 8-K press releases mean a Sanofi 8-K names Dupixent, a J&J 8-K names Stelara. The original density rule's mental model — "an SEC filing references the full pipeline" — does not match how 8-K Exhibit 99.1 attachments are scoped in practice. Lowering the threshold to ≥2 helps marginally (one extra signal). The rule needs a different definition or removal.
The Deals tab regression caught a similar issue¶
While verifying Phase 3 we discovered the asset-page Deals tab filtered on isDealType() only, dropping 69% of an asset's SEC 8-Ks (the press_release and leadership_change classifications the SEC source emits when Item codes don't pin to a specific deal type). PR #86 fixed it by adding a source-conditioned predicate. This is the same shape of problem as the density rule: plan-level intuitions about how SEC data is structured drifted from the actual data shape; only a live run revealed it.
This ADR therefore commits to making the Phase 4 eval framework cover enough of the pipeline that the next "plan vs reality" gap is caught at landing, not in the field.
Decision¶
Phase 3 — proceed with surgery, not the original spec¶
Merge a revised Phase 3 with three changes from PR #82 as currently drafted:
- Drop the density rule. It fires on 0 signals on real data and — given the single-product-press-release pattern — won't fire meaningfully on most landscapes. Removing it shrinks the boost surface to three rules.
- Replace it with a target/MoA cluster overlap rule. A signal that mentions or is linked to ≥2 drugs sharing a tracked target (e.g. multiple IL-13 antagonists) earns +1. This captures the strategic relevance the original density rule was reaching for — "this signal touches several competitors at once" — but does so at the target level, where the data actually clusters.
- Lower the recency window to 14 days, give it +0.5. Currently recency is +1 for anything within 30 days, doing 80% of the boost's work. Halving the lift and tightening the window keeps recency as a tiebreaker instead of the dominant signal.
Final boost rule set after revision:
| rule | lift | precondition |
|---|---|---|
| Target/MoA cluster | +2 | linked drugs share a tracked target with ≥2 other candidates |
| KIQ alignment | +2 | linked candidate drug mentioned by an active OPERATIONAL-horizon KIQ |
| Franchise source | +1 | source ∈ {sec, lens} AND ≥1 linked drug is a candidate |
| Recency | +0.5 (rounded) | event_date within 14 days |
The revision can ship as a follow-up PR on top of #82 once the original is reviewed for the wiring (threshold-after-boost, context-cache reuse, orchestrator integration). The wiring is correct and worth landing; the rule definitions can iterate.
Phase 4 — eval framework scoped to validate Phases 1 and 3¶
Phase 4 ships an offline eval harness that answers three questions every classifier-rework PR after this ADR will be expected to clear:
- Per-rule precision. For Phase 1 rules and Phase 3 boost rules, what fraction of triggered signals would a human analyst keep at the same score (±1)?
- Top-K agreement. What is the overlap between the classifier's top-20 and a human-curated top-20 over the same signal set?
- Cost-per-briefing. How many Haiku/Sonnet calls per briefing? Latency? Token cost?
Out of scope for Phase 4 (defer to later): online A/B against analyst feedback, end-to-end synthesizer-quality eval, automated landscape sweep beyond immunology-001.
Eval data¶
A frozen fixture of 150 hand-labeled signals drawn from the immunology-001 corpus across all sources, with the following per-signal labels:
| field | shape | source |
|---|---|---|
expected_score |
int 1–10 | analyst rating (Khalil + 1 reviewer) |
expected_in_top20 |
bool | analyst selection |
rationale |
string | one-line free text explaining the score |
The fixture lives at evals/classifier/immunology_v1.jsonl and is committed to the repo. Labelling is the cost — ~3 hours of analyst time. Refresh annually or after a major rule change.
Eval runner¶
A new CLI: scripts/eval/eval_classifier.py. It runs the live classifier (rule tier + LLM tier + boost) against the fixture and emits:
== Phase 1 rule tier ==
rule-tier coverage: 52.3% (76/150)
rule-tier precision @ ±1: 0.91 (69/76)
rule-tier precision @ ±2: 0.97 (74/76)
== Phase 3 boost ==
signals with non-zero boost: 18/150
boost precision (analyst agrees the boosted signal belongs higher): 0.83 (15/18)
per-rule firing counts:
target_cluster: 11
kiq_alignment: 3
franchise: 8
recency: 5
== Top-20 agreement ==
jaccard(classifier_top20, analyst_top20): 0.71
positions changed by ≥5: 4
== Cost ==
haiku_calls: 4
sonnet_calls: 1
total_tokens: 31_240
estimated_cost_usd: 0.18
Add make eval-classifier and run it in CI on PRs that touch ogur/engine/classifier* or ogur/engine/agents/.
Eval thresholds for landing¶
A classifier change cannot regress these without explicit acknowledgement: - Rule-tier precision @ ±1 ≥ 0.85 - Top-20 jaccard ≥ 0.65 - Cost-per-briefing ≤ $0.30
Sub-threshold runs fail the CI job; the PR author either fixes the regression or amends a # WHY block in the PR description explaining why the regression is intentional.
Why these specific revisions¶
Why drop density entirely vs. shrink the threshold¶
We considered ≥2 as a fallback. The corpus shows 1 signal with 2 candidate links (out of 7,536). Even if a future SEC parsing improvement doubled this, the rule would be a long-tail trigger, not a primary boost vector. The MoA-cluster rule captures the same intent (multi-competitor relevance) at the right granularity for actual press release content.
Why target/MoA cluster instead of company cluster¶
A single 8-K names one product per filing. But the landscape has many drugs sharing a target — IL-13 has dupilumab, lebrikizumab, tralokinumab, cendakimab. A press release naming dupilumab is implicitly about the IL-13 cluster's competitive structure. Target/MoA-level density is the level at which signals about one drug are still informative about other drugs — the right altitude for re-ranking.
Why halve recency¶
Recency at +1 within 30 days currently does ~80% of all boost work. That makes the boost effectively a recency tiebreaker, which is not its purpose. The synthesizer's KIQ-aware prompt already weights freshness; the boost should add structural-graph signal, not duplicate temporal signal. Halving the lift keeps recency as a marginal tiebreaker without dominating.
Why 150 labeled signals, not 50 or 500¶
50 was the original Phase 1 plan target — too few to break out per-rule precision (a rule that fires on 5% of signals would land 2-3 examples in the eval set). 500 is a serious labeling commitment (~10 hours). 150 keeps per-source representation viable (≥10 signals per source minimum) without ballooning analyst time. Refresh frequency yearly is sufficient as long as the rule set is stable; rule changes invalidate the fixture and require re-labeling.
Why not online evaluation¶
An online A/B against analyst feedback (briefings shown to users, votes captured) would be the gold-standard signal. It requires user instrumentation we don't have, statistical-power planning, and a feedback corpus that doesn't exist yet. Phase 4's offline eval is the floor; online eval is a Phase 5 decision once this lands and survives a quarter.
Architecture¶
Phase 3 revisions to existing module structure¶
ogur/engine/classifier_boost.py
graph_boost(change, *, linked_drugs, candidate_drugs, operational_kiqs,
target_clusters) ← new param
Rules:
target_cluster_boost(change, target_clusters) ← new
kiq_alignment_boost(change, linked_drugs, kiqs) ← unchanged
franchise_boost(change, linked_drugs, candidates) ← unchanged
recency_boost(change) ← window 30→14, lift 1.0→0.5
target_clusters is a dict[str, set[str]] keyed on target name, value is the set of candidate drugs known to act on it. Built once per orchestrator call from DrugTarget rows.
Phase 4 module structure (new)¶
evals/
classifier/
immunology_v1.jsonl ← labeled fixture (150 signals)
README.md ← labeling guide
scripts/eval/
eval_classifier.py ← CLI runner
ogur/eval/
classifier.py ← reusable eval primitives:
run_eval(fixture) -> EvalReport
EvalReport (rule-tier metrics, boost metrics,
top-K agreement, cost)
fixtures.py ← load/save/validate JSONL
tests/unit/eval/
test_classifier_eval.py
test_fixtures.py
The runner imports the live classifier; it does not duplicate scoring logic. CI invocation runs against a throwaway in-memory DB seeded from the fixture.
Schema changes¶
None for Phase 3.
For Phase 4: the fixture file evals/classifier/immunology_v1.jsonl (signal IDs are stable on the immunology-001 DB but will need a re-anchor if the seed changes shape — the fixture stores (signal.title, signal.source, signal.signal_type) tuples alongside the ID for fallback resolution).
Consequences¶
Wins: - Phase 3's wiring (threshold-after-boost, context-cache reuse) lands and provides a foundation, even with revised rules. - The MoA-cluster rule replaces a known-dead density rule with one that has a plausible firing surface. - Phase 4 gives every future classifier change a numeric pass/fail signal at PR time, not after a customer demo. - Plan-vs-reality drift caught early. The Deals-tab regression and the density-rule surprise both came from "the data isn't shaped like the plan assumed". The eval fixture institutionalizes catching this.
Costs:
- ~3 hours analyst labeling for the 150-signal fixture, plus repeat for any rule change that invalidates it.
- One extra DB query per orchestrator run to build target_clusters from DrugTarget (cached per call; negligible cost).
- CI gain: ~30s for make eval-classifier. Run only on classifier-touching PRs.
Risks:
- Fixture overfitting. If we tune rules to maximize fixture metrics, we drift toward the labeling biases of the analyst who built it. Mitigation: expected_in_top20 is the analyst's pick from a blinded signal set (no scores shown); rules are not allowed to read the fixture during scoring.
- Single-corpus eval. immunology-001 isn't oncology, isn't rare disease. Phase 1's 50% rule-tier ratio gap was explicitly a different-corpus issue. Mitigation: when the next landscape (likely NSCLC) is seeded with depth, build nsclc_v1.jsonl and require both fixtures pass before merging.
Implementation phases¶
Phase 3a — drop density, ship rest of #82 (1 day)¶
Remove the density rule from classifier_boost.py; merge the wiring (threshold-after-boost, context-cache reuse, orchestrator integration) and the franchise / KIQ / recency rules. Tests adjust accordingly.
Phase 3b — target/MoA cluster rule + recency tightening (2 days)¶
Add target_cluster_boost. Halve recency lift, tighten window. Add unit tests. Live re-measure on immunology-001.
Phase 4a — eval primitives (3 days)¶
ogur/eval/classifier.py, ogur/eval/fixtures.py, basic CLI runner without thresholds. Wire to make eval-classifier. Smoke-test against existing immunology data.
Phase 4b — fixture labeling (3 hours analyst + 0.5 day engineering)¶
Sample 150 signals from immunology-001 stratified by source. Khalil + 1 reviewer label expected_score and expected_in_top20. Commit evals/classifier/immunology_v1.jsonl.
Phase 4c — CI thresholds + first run (0.5 day)¶
Wire make eval-classifier into CI on classifier-touching PRs. Bake in the precision @ ±1 ≥ 0.85, top-20 jaccard ≥ 0.65, cost ≤ $0.30 thresholds. First run will probably fail one threshold; either tune or document why.
Phase 4d (deferred) — second-corpus fixture¶
Once NSCLC has depth, repeat 4b for NSCLC. CI requires both pass before merge of classifier-touching PRs.
Open questions¶
-
Should Phase 1's rule weights be tuned now or after Phase 4 lands? Tuning blind risks regressing precision. The eval framework is the right time. ADR position: hold rule changes until 4c.
-
Does the boost cap stay at +3 if recency drops to +0.5? Max stack post-revision is
target_cluster (2) + kiq (2) + franchise (1) + recency (0.5) = 5.5. Cap at +3 still bites. Reasonable to keep +3 as the ceiling so the boost stays a re-rank tool, not a score generator. -
Eval fixture in repo or out-of-tree? This ADR commits to in-repo (
evals/classifier/). The labels are not sensitive; signal IDs reference public data. Out-of-tree (S3, separate repo) is overkill at this size. -
What happens when SignalDrug coverage improves (e.g. a smarter SEC parser, or LLM-based drug extraction)? The target_cluster rule's firing rate is a function of graph density. We accept that boost behaviour will shift as the graph improves; the eval framework will catch regressions either direction.