ADR-0003: Candidate-drug ranking strategy¶
Status: Proposed Date: 2026-04-28 Driver: Khalil Related: ADR-0001 (monitoring-setup primitives), ADR-0002 (knowledge-graph roadmap), PR #74 (Phase 4+5 demo run)
Context¶
ADR-0001 §"Decision" gave us a discovery primitive: discover_competitors.py reads (indications × targets × moa × horizon) from the Landscape row, queries Open Targets, and writes Landscape.candidate_drugs as a sorted JSON list. The Phase 4+5 demo against the immunology-001 landscape confirmed the design: 88.9% recall against the canonical AD set, zero placebo / generic-derm / aspirin contamination.
But the demo also surfaced a second-order problem the original ADR didn't address: the persisted candidate_drugs list is alphabetically sorted, and the downstream _select_top_competitors(top_n) reads candidate_drugs[:top_n] as-is. With 28 candidates discovered for the immunology-001 landscape:
0. abrocitinib ← canonical AD
1. amlitelimab ← real, not canonical (OX40L mAb, early-stage)
2. anrukinzumab ← real, not canonical (anti-IL-13, discontinued)
3. baricitinib ← canonical AD
4. benralizumab ← asthma, not AD canonical
5. brepocitinib ← real, not canonical (TYK2/JAK1)
6. delgocitinib ← approved JP topical, not canonical
7. depemokimab ← anti-IgE, asthma
8. dupilumab ← THE FRANCHISE ANCHOR (Sanofi's product)
...
top_n=6 evidence-extraction picks drugs 0–5, excluding dupilumab entirely. The franchise asset isn't covered. For a budget-constrained run (the Phase 5 demo had --top-n 6 --limit-trials 1 because of the $3 cap), this misallocates spend on noise rows ahead of the very drug the analyst is monitoring against.
The Phase 5 demo worked around this by directly editing Landscape.candidate_drugs in SQLite to put canonical drugs first. That's a hack that gets repaved every time discovery runs.
Why "minimum human input"¶
The ADR-0001 design philosophy was: analysts describe the competitive boundary (indication + target + MoA + horizon), not maintain a drug roster. The drug list is derived. We want to preserve that — adding back manual drug curation would defeat the monitoring premise.
So the question is: how do we rank derived candidates so that the most strategically-relevant drugs land at the front of candidate_drugs, without re-introducing curation tax?
Considered approaches¶
1. Manual anchor drug per landscape¶
Add Landscape.anchor_drug: str | None. Discovery writes:
candidate_drugs = (
[anchor_drug] if anchor_drug else []
) + [d for d in sorted_others if d != anchor_drug]
Pros: Trivial to implement. Captures the most important domain knowledge (which drug is the franchise asset) in one optional field. For Sanofi's immunology landscape, anchor_drug="dupilumab" is the only configuration needed.
Cons: Only solves the anchor case. The rest of the list is still alphabetical. For a non-franchise-defense landscape (e.g. broad oncology landscape monitoring 50 drugs), the analyst still gets noise drugs alphabetically ahead of high-relevance ones.
Verdict: Necessary but insufficient.
2. Open Targets disease-association score¶
Open Targets exposes a numeric score ∈ [0, 1] per (drug, disease) pair, computed from clinical / regulatory / literature evidence. Sort candidate_drugs by score desc, with name as a stable tie-breaker.
Pros: Zero human input. OT is the authoritative source on drug-disease relevance. Approved drugs naturally float up; preclinical noise sinks. Idempotent (OT scores are stable enough for our purposes).
Cons: Currently we don't query the score field. Schema change in discover_competitors.py's GraphQL fragment needed. OT scores are sometimes noisy for early-stage drugs (eblasakimab — when OT eventually indexes it — may score below well-established noise drugs).
Verdict: Strong primary sort key.
3. Phase-weighted ordering¶
Sort by (phase_int desc, ot_score desc, name asc):
Pros: Aligns with what an analyst expects ("show me approved/late-phase drugs first"). Already encoded in the _OT_STAGE_TO_INT dict in discover_competitors.py. Free.
Cons: Two drugs in the same phase still need a tie-breaker — leading us back to OT score or alphabetical.
Verdict: Use as the primary sort key, with OT score as the secondary.
4. Quality-aware signal volume (post-PR-#72)¶
PR #72's SignalDrug table has link_method and confidence columns. We can compute "high-confidence results-bearing signal volume per drug":
SELECT drug.name, COUNT(*) as quality_signals
FROM signaldrug sd
JOIN signal s ON sd.signal_id = s.id
WHERE sd.link_method = 'source_column'
AND sd.confidence > 0.7
AND s.signal_type IN ('PUBLICATION', 'CONFERENCE_ABSTRACT', 'TRIAL_REGISTERED')
GROUP BY drug.name;
Pros: Captures "what the analyst actually wants" (drugs with real coverage in the landscape) without the placebo-promotion bug. The placebo-promotion bug came from counting all signals including low-quality ones (sponsorship-only mentions, registrations without results); filtering by signal_type and confidence removes that failure mode.
Cons: Bootstrap problem — first run has no signals yet, so this metric is zero everywhere. Requires PR #72 merged. Requires re-running discovery after seeding (couples discovery to ingestion order).
Verdict: Strong third-tier sort key for re-runs on a seeded DB. Initial-seed run falls back to phase + OT score.
5. MoA cluster diversity¶
For each MoA cluster (anti-IL-13, JAK family, anti-TSLP, etc.), pick the top-K drugs. Ensures the top-N covers the major mechanisms without over-representing one cluster.
Pros: Addresses a real concern: in immunology AD, top-6 by phase+score might be all JAK inhibitors (4 active, all with similar profiles). MoA-diverse top-6 surfaces the IL-13 antagonists, the OX40L mAbs, the IL-31 mAbs.
Cons: Complex to tune (cluster definitions matter). Can deprioritize the franchise anchor if it's in an over-represented cluster.
Verdict: Future enhancement, not in the MVP.
Decision¶
A two-tier strategy:
Tier 1 — Optional manual anchor (the franchise pin)¶
Add Landscape.anchor_drug: str | None. When set, discovery writes the anchor drug first in candidate_drugs, regardless of any other ranking. This captures the single piece of analyst knowledge that's hard to derive: "this is my product." For Sanofi's immunology landscape, that's "dupilumab". For a non-franchise-defense landscape (broad monitoring), leave it None.
Tier 2 — Derived ranking for the rest¶
Sort the remaining candidates by (phase_int desc, ot_score desc, drug_name asc):
| Sort key | Source | Why |
|---|---|---|
| 1. Phase as int (desc) | _OT_STAGE_TO_INT[row.maxClinicalStage] |
Approved/Phase-4 drugs are more competitively relevant than preclinical. Already computed. |
| 2. OT disease-association score (desc) | New: extend GraphQL query to fetch score from disease.associatedDrugs |
OT's authoritative measure of drug-disease relevance. |
| 3. Drug name (asc) | string compare | Stable tie-breaker — keeps the output deterministic / idempotent. |
Tier 3 — Quality-aware re-rank (after PR #72)¶
Once PR #72 lands and SignalDrug is populated, add an optional secondary pass that re-ranks the post-Tier-2 list by signal-quality count for re-runs against a seeded DB. Tier-2 ordering remains the source-of-truth on first seed; Tier-3 is a refinement.
Landscape.explicit_drugs for OT misses¶
Discovery alone misses drugs OT hasn't indexed yet (eblasakimab today, every novel-mechanism startup forever). Add Landscape.explicit_drugs: list[str] — analyst-curated additions to the candidate list. Discovery output becomes:
candidate_drugs = (
[anchor_drug] if anchor_drug else []
) + explicit_drugs + [
d for d in tier_2_sorted_others
if d not in {anchor_drug, *explicit_drugs}
]
This is the safety valve for the eblasakimab case. For each explicit_drug not yet in DrugSynonym, the seed should also insert a manual DrugSynonym row so PR #72's Pass B picks up text mentions. One row of SQL per missed drug — small price for completeness.
Schema additions¶
Two new fields on Landscape, both nullable / defaulted:
anchor_drug: str | None = None # Tier 1: franchise asset, single name
explicit_drugs: str = "[]" # Tier 3: JSON list of names not in OT
Consequences¶
Easier: - Top-N evidence extraction lands on the right drugs even at low budget caps. - The Phase 5 demo's manual reorder hack disappears; discovery output is correct as-persisted. - Eblasakimab and similar OT-misses become a one-row-of-SQL configuration, not a re-architecture.
Harder:
- discover_competitors.py now requires the OT GraphQL fragment to fetch the score field. Small change, but it's another OT-schema dependency.
- The Tier-3 pass introduces a coupling between discovery ordering and SignalDrug seedness — re-running discovery between seed runs may produce different orderings. Mitigation: Tier-3 is opt-in, off by default.
Still unsolved: - Anchor-drug as analyst input: still requires the analyst to type one drug name. Unavoidable — the alternative is heuristics ("largest market cap incumbent in the indication") which are fragile and external-data-hungry. - MoA diversity: top-6 by phase × score on an immunology-AD landscape may still bias toward one cluster. ADR-0004 territory.
Implementation phases¶
Phase A — Schema (1 day, gateable for the demo)¶
- Add
anchor_drugandexplicit_drugsfields toLandscape. - Update conftest factory + a few unit tests for round-trip.
- Update
seed_immunology.pyto setanchor_drug="dupilumab",explicit_drugs=["eblasakimab"].
Phase B — Discovery integration (1 day)¶
discover_competitors.pyreadsanchor_drug,explicit_drugsfrom the landscape.- GraphQL fragment extended to include
score. filter_candidatesreturns a sorted-by-(phase, score, name) list.- Output composition:
[anchor] + explicit + tier_2_sorted.
Phase C — Manual DrugSynonym seeding (15 min)¶
- Small CLI
scripts/add_drug_synonym.pyor part of the seed: for eachexplicit_drugnot inDrugSynonym, insert withlink_method='manual_curation'. - Re-run
make signal-drug-links(PR #72) — eblasakimab signals from CT.gov become attributable.
Phase D — Tier-3 re-rank (optional, post PR #72 merge)¶
- Add
--rerank-by-signalsflag todiscover_competitors.py. - After Tier-2 sort, re-rank by
SignalDrugquality-signal count. - Off by default. Documented as a "warm DB" optimization.
Open questions¶
- Multiple anchors. For Sanofi-Regeneron's immunology franchise, both
dupilumabanditepekimabare pipeline assets. Shouldanchor_drugbelist[str]from day one? Recommend: yes,anchor_drugs: list[str]. JSON-encoded like the other fields. - Eblasakimab DrugSynonym source attribution. Should manually-curated synonyms have
source='manual'orsource='analyst:khalil'? (The PR-#72 link_method='manual_curation' already covers the "how was this linked" question.) Recommend:source='manual'with no person-attribution; this isn't a journal. - Tier-3 coupling concerns. If discovery is re-run with
--rerank-by-signalsagainst a stale DB (signals from 6 months ago), does that produce a stale ranking? Yes. Recommend: emit a warning when the re-rank is requested butlast_discovered_at < last_signal_ingest_at. - OT score availability. Verify that
disease.associatedDrugs(...) { score }returns scores for all drugs we currently get fromdisease.drugAndClinicalCandidates. If the field is missing for clinical-only candidates, treat missing-score as 0.0 (so they sort below any scored drug at the same phase). - Interaction with MoA filter. Currently
discover_competitors.filter_candidatesaccepts an optionalmoafilter. Does the MoA filter apply before or after the new ranking? Recommend: before. The ranking is over the post-filter candidate set.
Why this is the right shape now¶
We are in the position of having shipped Phase 4+5 with a hack-fix and a clear-eyed account of why it's a hack. ADR-0003 codifies the proper fix in the same vocabulary as ADR-0001 (analyst describes the boundary; pipeline derives the list) without slipping into curation-tax territory. The two-tier design separates the irreducible domain knowledge (which drug is yours) from everything that's truly derivable. That's the same architectural cut ADR-0001 made for indications and targets — extending it to the candidate ranking is the principled continuation.