ADR-0004: Classifier Phase 3 (graph boost) and Phase 4 (eval framework)¶

Status: Proposed Date: 2026-04-29 Driver: Khalil Related: PR #82 (Phase 3 draft), PR #84 (Exhibit 99.1 fetch — merged), PR #86 (Deals-tab SEC wiring — merged), Phase 0–2 of the classifier rework (PRs #79, #80, #81 — merged).

Context¶

The classifier rework split into five phases (per the build plan in the parent thread). Phases 0–2 are merged on main and behaving as designed in production:

Phase 0 (#79) — hard-fail on total LLM-classifier failure instead of silent severity-only fallback.
Phase 1 (#80) — deterministic rule tier in front of the LLM. Live ratio on immunology-001: 50.3% of signals rule-scored (target was 60–80%).
Phase 2 (#81) — annotates DetectedChange.context with linked_drugs and matched_kiq_ids, richer LLM input lines, chunk size 50→20.

Phase 3 (graph-derived boost) and Phase 4 (eval framework) remain. Live measurements during the close-out of Phases 1 and 2 surfaced concrete reasons to revise the original Phase 3 design before merging it, and to scope Phase 4 around the gaps Phases 1 and 3 cannot self-validate. This ADR records those decisions so the work can resume cleanly without re-tracing the investigation.

Findings that motivate this ADR¶

Two live runs against immunology-001 (7,536 signals, 28 candidate drugs):

Phase 1 rule-tier ratio is below target — and that's fine¶

total signals    : 7,536
rule_scored      : 3,789  (50.3%)   ← below the 60–80% plan target
llm_residual     : 3,747  (49.7%)

The gap is concentrated in TRIAL_REGISTERED + Phase 2/Unknown (2,546 signals deferring to LLM). The plan's 60–80% target was probably calibrated on an oncology-style corpus where Phase 1 trial registrations dominate; immunology has a different mix. Tuning the rule weights without labels would move the ratio without us knowing whether precision held — a Phase 4 question.

Phase 3 boost rules don't fire on real data — except recency¶

After PR #84 (Exhibit 99.1 fetch) the SignalDrug graph went from 0 SEC→dupilumab edges to 29. Even with that lift, the boost distribution on immunology-001 looks like:

boost = +0          : 6,948 (92.2%)
boost = +1          :   586 ( 7.8%)   ← almost all from recency
boost = +2          :     2
boost = +3 (capped) :     0

Per-rule eligibility breakdown on the post-#84 corpus:

rule	lift	live behaviour
Density (≥3 candidate drugs linked)	+2	0 signals corpus-wide
KIQ alignment (operational KIQ mentions linked candidate)	+2	depends on KIQ count; only 2 active KIQs in immunology landscape today
Franchise source ({sec, lens} + ≥1 candidate link)	+1	38 SEC signals eligible (was 8 pre-#84)
Recency (event_date within 30 days)	+1	~588 signals eligible — does ~80% of all non-zero boost work

Single-product 8-K press releases mean a Sanofi 8-K names Dupixent, a J&J 8-K names Stelara. The original density rule's mental model — "an SEC filing references the full pipeline" — does not match how 8-K Exhibit 99.1 attachments are scoped in practice. Lowering the threshold to ≥2 helps marginally (one extra signal). The rule needs a different definition or removal.

The Deals tab regression caught a similar issue¶

While verifying Phase 3 we discovered the asset-page Deals tab filtered on isDealType() only, dropping 69% of an asset's SEC 8-Ks (the press_release and leadership_change classifications the SEC source emits when Item codes don't pin to a specific deal type). PR #86 fixed it by adding a source-conditioned predicate. This is the same shape of problem as the density rule: plan-level intuitions about how SEC data is structured drifted from the actual data shape; only a live run revealed it.

This ADR therefore commits to making the Phase 4 eval framework cover enough of the pipeline that the next "plan vs reality" gap is caught at landing, not in the field.

Decision¶

Phase 3 — proceed with surgery, not the original spec¶

Merge a revised Phase 3 with three changes from PR #82 as currently drafted:

Drop the density rule. It fires on 0 signals on real data and — given the single-product-press-release pattern — won't fire meaningfully on most landscapes. Removing it shrinks the boost surface to three rules.
Replace it with a target/MoA cluster overlap rule. A signal that mentions or is linked to ≥2 drugs sharing a tracked target (e.g. multiple IL-13 antagonists) earns +1. This captures the strategic relevance the original density rule was reaching for — "this signal touches several competitors at once" — but does so at the target level, where the data actually clusters.
Lower the recency window to 14 days, give it +0.5. Currently recency is +1 for anything within 30 days, doing 80% of the boost's work. Halving the lift and tightening the window keeps recency as a tiebreaker instead of the dominant signal.

Final boost rule set after revision:

rule	lift	precondition
Target/MoA cluster	+2	linked drugs share a tracked target with ≥2 other candidates
KIQ alignment	+2	linked candidate drug mentioned by an active OPERATIONAL-horizon KIQ
Franchise source	+1	source ∈ {sec, lens} AND ≥1 linked drug is a candidate
Recency	+0.5 (rounded)	`event_date` within 14 days

The revision can ship as a follow-up PR on top of #82 once the original is reviewed for the wiring (threshold-after-boost, context-cache reuse, orchestrator integration). The wiring is correct and worth landing; the rule definitions can iterate.

Phase 4 — eval framework scoped to validate Phases 1 and 3¶

Phase 4 ships an offline eval harness that answers three questions every classifier-rework PR after this ADR will be expected to clear:

Per-rule precision. For Phase 1 rules and Phase 3 boost rules, what fraction of triggered signals would a human analyst keep at the same score (±1)?
Top-K agreement. What is the overlap between the classifier's top-20 and a human-curated top-20 over the same signal set?
Cost-per-briefing. How many Haiku/Sonnet calls per briefing? Latency? Token cost?

Out of scope for Phase 4 (defer to later): online A/B against analyst feedback, end-to-end synthesizer-quality eval, automated landscape sweep beyond immunology-001.

Eval data¶

A frozen fixture of 150 hand-labeled signals drawn from the immunology-001 corpus across all sources, with the following per-signal labels:

field	shape	source
`expected_score`	int 1–10	analyst rating (Khalil + 1 reviewer)
`expected_in_top20`	bool	analyst selection
`rationale`	string	one-line free text explaining the score

The fixture lives at evals/classifier/immunology_v1.jsonl and is committed to the repo. Labelling is the cost — ~3 hours of analyst time. Refresh annually or after a major rule change.

Eval runner¶

A new CLI: scripts/eval/eval_classifier.py. It runs the live classifier (rule tier + LLM tier + boost) against the fixture and emits:

== Phase 1 rule tier ==
  rule-tier coverage: 52.3% (76/150)
  rule-tier precision @ ±1: 0.91 (69/76)
  rule-tier precision @ ±2: 0.97 (74/76)

== Phase 3 boost ==
  signals with non-zero boost: 18/150
  boost precision (analyst agrees the boosted signal belongs higher): 0.83 (15/18)
  per-rule firing counts:
    target_cluster: 11
    kiq_alignment:   3
    franchise:       8
    recency:         5

== Top-20 agreement ==
  jaccard(classifier_top20, analyst_top20): 0.71
  positions changed by ≥5: 4

== Cost ==
  haiku_calls: 4
  sonnet_calls: 1
  total_tokens: 31_240
  estimated_cost_usd: 0.18

Add make eval-classifier and run it in CI on PRs that touch ogur/engine/classifier* or ogur/engine/agents/.

Eval thresholds for landing¶

A classifier change cannot regress these without explicit acknowledgement: - Rule-tier precision @ ±1 ≥ 0.85 - Top-20 jaccard ≥ 0.65 - Cost-per-briefing ≤ $0.30

Sub-threshold runs fail the CI job; the PR author either fixes the regression or amends a # WHY block in the PR description explaining why the regression is intentional.

Why these specific revisions¶

Why drop density entirely vs. shrink the threshold¶

We considered ≥2 as a fallback. The corpus shows 1 signal with 2 candidate links (out of 7,536). Even if a future SEC parsing improvement doubled this, the rule would be a long-tail trigger, not a primary boost vector. The MoA-cluster rule captures the same intent (multi-competitor relevance) at the right granularity for actual press release content.

Why target/MoA cluster instead of company cluster¶

A single 8-K names one product per filing. But the landscape has many drugs sharing a target — IL-13 has dupilumab, lebrikizumab, tralokinumab, cendakimab. A press release naming dupilumab is implicitly about the IL-13 cluster's competitive structure. Target/MoA-level density is the level at which signals about one drug are still informative about other drugs — the right altitude for re-ranking.

Why halve recency¶

Recency at +1 within 30 days currently does ~80% of all boost work. That makes the boost effectively a recency tiebreaker, which is not its purpose. The synthesizer's KIQ-aware prompt already weights freshness; the boost should add structural-graph signal, not duplicate temporal signal. Halving the lift keeps recency as a marginal tiebreaker without dominating.

Why 150 labeled signals, not 50 or 500¶

50 was the original Phase 1 plan target — too few to break out per-rule precision (a rule that fires on 5% of signals would land 2-3 examples in the eval set). 500 is a serious labeling commitment (~10 hours). 150 keeps per-source representation viable (≥10 signals per source minimum) without ballooning analyst time. Refresh frequency yearly is sufficient as long as the rule set is stable; rule changes invalidate the fixture and require re-labeling.

Why not online evaluation¶

An online A/B against analyst feedback (briefings shown to users, votes captured) would be the gold-standard signal. It requires user instrumentation we don't have, statistical-power planning, and a feedback corpus that doesn't exist yet. Phase 4's offline eval is the floor; online eval is a Phase 5 decision once this lands and survives a quarter.

Architecture¶

Phase 3 revisions to existing module structure¶

ogur/engine/classifier_boost.py
    graph_boost(change, *, linked_drugs, candidate_drugs, operational_kiqs,
                target_clusters)      ← new param
    Rules:
      target_cluster_boost(change, target_clusters)        ← new
      kiq_alignment_boost(change, linked_drugs, kiqs)      ← unchanged
      franchise_boost(change, linked_drugs, candidates)    ← unchanged
      recency_boost(change)                                ← window 30→14, lift 1.0→0.5

target_clusters is a dict[str, set[str]] keyed on target name, value is the set of candidate drugs known to act on it. Built once per orchestrator call from DrugTarget rows.

Phase 4 module structure (new)¶

evals/
    classifier/
        immunology_v1.jsonl          ← labeled fixture (150 signals)
        README.md                    ← labeling guide
scripts/eval/
    eval_classifier.py               ← CLI runner
ogur/eval/
    classifier.py                    ← reusable eval primitives:
                                       run_eval(fixture) -> EvalReport
                                       EvalReport (rule-tier metrics, boost metrics,
                                                   top-K agreement, cost)
    fixtures.py                      ← load/save/validate JSONL
tests/unit/eval/
    test_classifier_eval.py
    test_fixtures.py

The runner imports the live classifier; it does not duplicate scoring logic. CI invocation runs against a throwaway in-memory DB seeded from the fixture.

Schema changes¶

None for Phase 3.

For Phase 4: the fixture file evals/classifier/immunology_v1.jsonl (signal IDs are stable on the immunology-001 DB but will need a re-anchor if the seed changes shape — the fixture stores (signal.title, signal.source, signal.signal_type) tuples alongside the ID for fallback resolution).

Consequences¶

Wins: - Phase 3's wiring (threshold-after-boost, context-cache reuse) lands and provides a foundation, even with revised rules. - The MoA-cluster rule replaces a known-dead density rule with one that has a plausible firing surface. - Phase 4 gives every future classifier change a numeric pass/fail signal at PR time, not after a customer demo. - Plan-vs-reality drift caught early. The Deals-tab regression and the density-rule surprise both came from "the data isn't shaped like the plan assumed". The eval fixture institutionalizes catching this.

Costs: - ~3 hours analyst labeling for the 150-signal fixture, plus repeat for any rule change that invalidates it. - One extra DB query per orchestrator run to build target_clusters from DrugTarget (cached per call; negligible cost). - CI gain: ~30s for make eval-classifier. Run only on classifier-touching PRs.

Risks: - Fixture overfitting. If we tune rules to maximize fixture metrics, we drift toward the labeling biases of the analyst who built it. Mitigation: expected_in_top20 is the analyst's pick from a blinded signal set (no scores shown); rules are not allowed to read the fixture during scoring. - Single-corpus eval. immunology-001 isn't oncology, isn't rare disease. Phase 1's 50% rule-tier ratio gap was explicitly a different-corpus issue. Mitigation: when the next landscape (likely NSCLC) is seeded with depth, build nsclc_v1.jsonl and require both fixtures pass before merging.

Implementation phases¶

Phase 3a — drop density, ship rest of #82 (1 day)¶

Remove the density rule from classifier_boost.py; merge the wiring (threshold-after-boost, context-cache reuse, orchestrator integration) and the franchise / KIQ / recency rules. Tests adjust accordingly.

Phase 3b — target/MoA cluster rule + recency tightening (2 days)¶

Add target_cluster_boost. Halve recency lift, tighten window. Add unit tests. Live re-measure on immunology-001.

Phase 4a — eval primitives (3 days)¶

ogur/eval/classifier.py, ogur/eval/fixtures.py, basic CLI runner without thresholds. Wire to make eval-classifier. Smoke-test against existing immunology data.

Phase 4b — fixture labeling (3 hours analyst + 0.5 day engineering)¶

Sample 150 signals from immunology-001 stratified by source. Khalil + 1 reviewer label expected_score and expected_in_top20. Commit evals/classifier/immunology_v1.jsonl.

Phase 4c — CI thresholds + first run (0.5 day)¶

Wire make eval-classifier into CI on classifier-touching PRs. Bake in the precision @ ±1 ≥ 0.85, top-20 jaccard ≥ 0.65, cost ≤ $0.30 thresholds. First run will probably fail one threshold; either tune or document why.

Phase 4d (deferred) — second-corpus fixture¶

Once NSCLC has depth, repeat 4b for NSCLC. CI requires both pass before merge of classifier-touching PRs.

Open questions¶

Should Phase 1's rule weights be tuned now or after Phase 4 lands? Tuning blind risks regressing precision. The eval framework is the right time. ADR position: hold rule changes until 4c.
Does the boost cap stay at +3 if recency drops to +0.5? Max stack post-revision is target_cluster (2) + kiq (2) + franchise (1) + recency (0.5) = 5.5. Cap at +3 still bites. Reasonable to keep +3 as the ceiling so the boost stays a re-rank tool, not a score generator.
Eval fixture in repo or out-of-tree? This ADR commits to in-repo (evals/classifier/). The labels are not sensitive; signal IDs reference public data. Out-of-tree (S3, separate repo) is overkill at this size.
What happens when SignalDrug coverage improves (e.g. a smarter SEC parser, or LLM-based drug extraction)? The target_cluster rule's firing rate is a function of graph density. We accept that boost behaviour will shift as the graph improves; the eval framework will catch regressions either direction.