ADR-0003: Candidate-drug ranking strategy¶

Status: Proposed Date: 2026-04-28 Driver: Khalil Related: ADR-0001 (monitoring-setup primitives), ADR-0002 (knowledge-graph roadmap), PR #74 (Phase 4+5 demo run)

Context¶

ADR-0001 §"Decision" gave us a discovery primitive: discover_competitors.py reads (indications × targets × moa × horizon) from the Landscape row, queries Open Targets, and writes Landscape.candidate_drugs as a sorted JSON list. The Phase 4+5 demo against the immunology-001 landscape confirmed the design: 88.9% recall against the canonical AD set, zero placebo / generic-derm / aspirin contamination.

But the demo also surfaced a second-order problem the original ADR didn't address: the persisted candidate_drugs list is alphabetically sorted, and the downstream _select_top_competitors(top_n) reads candidate_drugs[:top_n] as-is. With 28 candidates discovered for the immunology-001 landscape:

0. abrocitinib   ← canonical AD
1. amlitelimab   ← real, not canonical (OX40L mAb, early-stage)
2. anrukinzumab  ← real, not canonical (anti-IL-13, discontinued)
3. baricitinib   ← canonical AD
4. benralizumab  ← asthma, not AD canonical
5. brepocitinib  ← real, not canonical (TYK2/JAK1)
6. delgocitinib  ← approved JP topical, not canonical
7. depemokimab   ← anti-IgE, asthma
8. dupilumab     ← THE FRANCHISE ANCHOR (Sanofi's product)
...

top_n=6 evidence-extraction picks drugs 0–5, excluding dupilumab entirely. The franchise asset isn't covered. For a budget-constrained run (the Phase 5 demo had --top-n 6 --limit-trials 1 because of the $3 cap), this misallocates spend on noise rows ahead of the very drug the analyst is monitoring against.

The Phase 5 demo worked around this by directly editing Landscape.candidate_drugs in SQLite to put canonical drugs first. That's a hack that gets repaved every time discovery runs.

Why "minimum human input"¶

The ADR-0001 design philosophy was: analysts describe the competitive boundary (indication + target + MoA + horizon), not maintain a drug roster. The drug list is derived. We want to preserve that — adding back manual drug curation would defeat the monitoring premise.

So the question is: how do we rank derived candidates so that the most strategically-relevant drugs land at the front of candidate_drugs, without re-introducing curation tax?

Considered approaches¶

1. Manual anchor drug per landscape¶

Add Landscape.anchor_drug: str | None. Discovery writes:

candidate_drugs = (
    [anchor_drug] if anchor_drug else []
) + [d for d in sorted_others if d != anchor_drug]

Pros: Trivial to implement. Captures the most important domain knowledge (which drug is the franchise asset) in one optional field. For Sanofi's immunology landscape, anchor_drug="dupilumab" is the only configuration needed.

Cons: Only solves the anchor case. The rest of the list is still alphabetical. For a non-franchise-defense landscape (e.g. broad oncology landscape monitoring 50 drugs), the analyst still gets noise drugs alphabetically ahead of high-relevance ones.

Verdict: Necessary but insufficient.

2. Open Targets disease-association score¶

Open Targets exposes a numeric score ∈ [0, 1] per (drug, disease) pair, computed from clinical / regulatory / literature evidence. Sort candidate_drugs by score desc, with name as a stable tie-breaker.

Pros: Zero human input. OT is the authoritative source on drug-disease relevance. Approved drugs naturally float up; preclinical noise sinks. Idempotent (OT scores are stable enough for our purposes).

Cons: Currently we don't query the score field. Schema change in discover_competitors.py's GraphQL fragment needed. OT scores are sometimes noisy for early-stage drugs (eblasakimab — when OT eventually indexes it — may score below well-established noise drugs).

Verdict: Strong primary sort key.

3. Phase-weighted ordering¶

Sort by (phase_int desc, ot_score desc, name asc):

phase_4_or_approved > phase_3 > phase_2 > phase_1 > preclinical

Pros: Aligns with what an analyst expects ("show me approved/late-phase drugs first"). Already encoded in the _OT_STAGE_TO_INT dict in discover_competitors.py. Free.

Cons: Two drugs in the same phase still need a tie-breaker — leading us back to OT score or alphabetical.

Verdict: Use as the primary sort key, with OT score as the secondary.

4. Quality-aware signal volume (post-PR-#72)¶

PR #72's SignalDrug table has link_method and confidence columns. We can compute "high-confidence results-bearing signal volume per drug":

SELECT drug.name, COUNT(*) as quality_signals
FROM signaldrug sd
JOIN signal s ON sd.signal_id = s.id
WHERE sd.link_method = 'source_column'
  AND sd.confidence > 0.7
  AND s.signal_type IN ('PUBLICATION', 'CONFERENCE_ABSTRACT', 'TRIAL_REGISTERED')
GROUP BY drug.name;

Pros: Captures "what the analyst actually wants" (drugs with real coverage in the landscape) without the placebo-promotion bug. The placebo-promotion bug came from counting all signals including low-quality ones (sponsorship-only mentions, registrations without results); filtering by signal_type and confidence removes that failure mode.

Cons: Bootstrap problem — first run has no signals yet, so this metric is zero everywhere. Requires PR #72 merged. Requires re-running discovery after seeding (couples discovery to ingestion order).

Verdict: Strong third-tier sort key for re-runs on a seeded DB. Initial-seed run falls back to phase + OT score.

5. MoA cluster diversity¶

For each MoA cluster (anti-IL-13, JAK family, anti-TSLP, etc.), pick the top-K drugs. Ensures the top-N covers the major mechanisms without over-representing one cluster.

Pros: Addresses a real concern: in immunology AD, top-6 by phase+score might be all JAK inhibitors (4 active, all with similar profiles). MoA-diverse top-6 surfaces the IL-13 antagonists, the OX40L mAbs, the IL-31 mAbs.

Cons: Complex to tune (cluster definitions matter). Can deprioritize the franchise anchor if it's in an over-represented cluster.

Verdict: Future enhancement, not in the MVP.

Decision¶

A two-tier strategy:

Tier 1 — Optional manual anchor (the franchise pin)¶

Add Landscape.anchor_drug: str | None. When set, discovery writes the anchor drug first in candidate_drugs, regardless of any other ranking. This captures the single piece of analyst knowledge that's hard to derive: "this is my product." For Sanofi's immunology landscape, that's "dupilumab". For a non-franchise-defense landscape (broad monitoring), leave it None.

Tier 2 — Derived ranking for the rest¶

Sort the remaining candidates by (phase_int desc, ot_score desc, drug_name asc):

Sort key	Source	Why
1. Phase as int (desc)	`_OT_STAGE_TO_INT[row.maxClinicalStage]`	Approved/Phase-4 drugs are more competitively relevant than preclinical. Already computed.
2. OT disease-association score (desc)	New: extend GraphQL query to fetch `score` from `disease.associatedDrugs`	OT's authoritative measure of drug-disease relevance.
3. Drug name (asc)	string compare	Stable tie-breaker — keeps the output deterministic / idempotent.

Tier 3 — Quality-aware re-rank (after PR #72)¶

Once PR #72 lands and SignalDrug is populated, add an optional secondary pass that re-ranks the post-Tier-2 list by signal-quality count for re-runs against a seeded DB. Tier-2 ordering remains the source-of-truth on first seed; Tier-3 is a refinement.

`Landscape.explicit_drugs` for OT misses¶

Discovery alone misses drugs OT hasn't indexed yet (eblasakimab today, every novel-mechanism startup forever). Add Landscape.explicit_drugs: list[str] — analyst-curated additions to the candidate list. Discovery output becomes:

candidate_drugs = (
    [anchor_drug] if anchor_drug else []
) + explicit_drugs + [
    d for d in tier_2_sorted_others
    if d not in {anchor_drug, *explicit_drugs}
]

This is the safety valve for the eblasakimab case. For each explicit_drug not yet in DrugSynonym, the seed should also insert a manual DrugSynonym row so PR #72's Pass B picks up text mentions. One row of SQL per missed drug — small price for completeness.

Schema additions¶

Two new fields on Landscape, both nullable / defaulted:

anchor_drug: str | None = None       # Tier 1: franchise asset, single name
explicit_drugs: str = "[]"           # Tier 3: JSON list of names not in OT

Consequences¶

Easier: - Top-N evidence extraction lands on the right drugs even at low budget caps. - The Phase 5 demo's manual reorder hack disappears; discovery output is correct as-persisted. - Eblasakimab and similar OT-misses become a one-row-of-SQL configuration, not a re-architecture.

Harder: - discover_competitors.py now requires the OT GraphQL fragment to fetch the score field. Small change, but it's another OT-schema dependency. - The Tier-3 pass introduces a coupling between discovery ordering and SignalDrug seedness — re-running discovery between seed runs may produce different orderings. Mitigation: Tier-3 is opt-in, off by default.

Still unsolved: - Anchor-drug as analyst input: still requires the analyst to type one drug name. Unavoidable — the alternative is heuristics ("largest market cap incumbent in the indication") which are fragile and external-data-hungry. - MoA diversity: top-6 by phase × score on an immunology-AD landscape may still bias toward one cluster. ADR-0004 territory.

Implementation phases¶

Phase A — Schema (1 day, gateable for the demo)¶

Add anchor_drug and explicit_drugs fields to Landscape.
Update conftest factory + a few unit tests for round-trip.
Update seed_immunology.py to set anchor_drug="dupilumab", explicit_drugs=["eblasakimab"].

Phase B — Discovery integration (1 day)¶

discover_competitors.py reads anchor_drug, explicit_drugs from the landscape.
GraphQL fragment extended to include score.
filter_candidates returns a sorted-by-(phase, score, name) list.
Output composition: [anchor] + explicit + tier_2_sorted.

Phase C — Manual `DrugSynonym` seeding (15 min)¶

Small CLI scripts/add_drug_synonym.py or part of the seed: for each explicit_drug not in DrugSynonym, insert with link_method='manual_curation'.
Re-run make signal-drug-links (PR #72) — eblasakimab signals from CT.gov become attributable.

Phase D — Tier-3 re-rank (optional, post PR #72 merge)¶

Add --rerank-by-signals flag to discover_competitors.py.
After Tier-2 sort, re-rank by SignalDrug quality-signal count.
Off by default. Documented as a "warm DB" optimization.

Open questions¶

Multiple anchors. For Sanofi-Regeneron's immunology franchise, both dupilumab and itepekimab are pipeline assets. Should anchor_drug be list[str] from day one? Recommend: yes, anchor_drugs: list[str]. JSON-encoded like the other fields.
Eblasakimab DrugSynonym source attribution. Should manually-curated synonyms have source='manual' or source='analyst:khalil'? (The PR-#72 link_method='manual_curation' already covers the "how was this linked" question.) Recommend: source='manual' with no person-attribution; this isn't a journal.
Tier-3 coupling concerns. If discovery is re-run with --rerank-by-signals against a stale DB (signals from 6 months ago), does that produce a stale ranking? Yes. Recommend: emit a warning when the re-rank is requested but last_discovered_at < last_signal_ingest_at.
OT score availability. Verify that disease.associatedDrugs(...) { score } returns scores for all drugs we currently get from disease.drugAndClinicalCandidates. If the field is missing for clinical-only candidates, treat missing-score as 0.0 (so they sort below any scored drug at the same phase).
Interaction with MoA filter. Currently discover_competitors.filter_candidates accepts an optional moa filter. Does the MoA filter apply before or after the new ranking? Recommend: before. The ranking is over the post-filter candidate set.

Why this is the right shape now¶

We are in the position of having shipped Phase 4+5 with a hack-fix and a clear-eyed account of why it's a hack. ADR-0003 codifies the proper fix in the same vocabulary as ADR-0001 (analyst describes the boundary; pipeline derives the list) without slipping into curation-tax territory. The two-tier design separates the irreducible domain knowledge (which drug is yours) from everything that's truly derivable. That's the same architectural cut ADR-0001 made for indications and targets — extending it to the candidate ranking is the principled continuation.