ADR-0005: Sponsor / asset discovery harvester for modality-target landscapes¶
Status: Accepted
Date: 2026-06-08
Driver: Khalil
Related: PR #107 (closed-world coverage test that this work flips into discovery), ADR-0001 (Landscape.candidate_drugs is "derived, not authored"), docs/sirna_landscape_gap_analysis_brief.md §0 (closed-world vs discovery framing)
Context¶
PR #107 introduced scope_type="modality_target" landscapes and the staging row cardiometabolic-rnai-001. The PR's seed (scripts/seed/seed_cardiometabolic_rnai.py) hand-curates candidate_drugs (30 entries) and companies (40 entries) from a reference deck, then asks every Ogur source "do you have any signal for each named entity?" — a closed-world coverage test, not an Explore-mode discovery run.
This ADR specifies the discovery harvester that flips that mechanism: given only a thesis (modality + targets + indications), produce a ranked sponsor + drug list by projecting source hits onto applicant / sponsor / author-affiliation / filer fields, with source_diversity as the trust signal.
The harvester is the real Explore-mode answer to "the analyst doesn't know the players yet." It also closes the asymmetry called out in docs/sirna_landscape_gap_analysis_brief.md §0: PR #107's "13/32 assets, 2/12 deals" figure can only be interpreted relative to a list someone else collected. With the harvester, both axes of the gap analysis are inferred from the same input.
Decision¶
Build ogur/engine/discovery_modality.py — the modality-target analogue of scripts/seed/discover_competitors.py (which is indication-axis, OT-only). Five source adapters project hits onto a sponsor-like field; a merge stage produces ranked DiscoveredEntity rows; an async /api/explore/discover endpoint exposes the harvester to the Explore wizard.
Five design decisions resolved up front, captured here so the implementation has a single reference.
Decision 1 — Per-landscape patent CPC config (was §4.2)¶
Chosen: add Landscape.source_filters JSON column. Default {"patents": {"cpc_prefixes": ["A61"]}} preserves existing NSCLC / immunology behaviour. cardiometabolic-rnai-001 gets {"patents": {"cpc_prefixes": ["A61", "C12N"]}} because siRNA composition patents file under C12N15 (nucleic acid biotechnology) more often than A61 (therapeutics).
Not chosen:
- Global broadening to
(A61 OR C12N)— would add noise to immunology / NSCLC landscapes, which doesn't matter much in practice but is the wrong shape long-term. - Defer until the call site reveals which is cleaner — the call site is unambiguous (
_CPC_PHARMA_PREFIXis referenced in_search_lens_with_must_clauseand_search_epo_with_cql; both need to switch from a single prefix to a list).
Consequence: this PR pulls in what was "Proposal 2" from PR #107's discussion. Schema migration mirrors _ensure_modality_target_schema in seed_cardiometabolic_rnai.py — idempotent ALTER TABLE ... ADD COLUMN with a JSON default. Future per-source filter knobs (date windows, max-results caps) live in the same JSON blob without further schema churn.
Plumbing notes:
_CPC_PHARMA_PREFIX = "A61"becomes a method that reads from the landscape'ssource_filters.- Both
_search_lens_*(class_cpc.symbolprefix-match) and_search_epo_*(CQLcl=<prefix>operator) accept a list and OR the prefixes together ({"bool": {"should": [{"prefix": ...}, ...]}}for Lens;cl=A61 OR cl=C12Nfor EPO CQL).
Decision 2 — Async API with in-memory job store (was §4.3)¶
Chosen: POST /api/explore/discover returns 202 + {job_id, status_url}. GET /api/explore/discover/{job_id} returns {status, drugs, companies, ...}. Job store is cachetools.TTLCache keyed by job_id with a 15-minute TTL; a separate keyed cache (hash of (scope_type, modalities, targets, indications) tuple) memoizes completed results so a re-run of the same thesis returns instantly.
Not chosen:
- Synchronous endpoint — runs target ~60s with 5 sources × 10 queries × 200ms sleep + parse overhead; a 60-second blocking HTTP request behind a wizard form is bad UX.
- Sync first, async in a follow-up — costs us nothing to ship async now and the frontend shape (polling, progress indicator) becomes the right shape from M1.
Consequence: the FastAPI route uses BackgroundTasks (mirroring the existing briefings.py pattern for /briefing/{landscape_id} trigger). No Redis / Celery added; if the API ever scales horizontally, the only change is swapping TTLCache for a shared store. The endpoint contract stays identical.
Job store shape:
@dataclass
class DiscoveryJob:
job_id: str
status: Literal["running", "complete", "failed"]
submitted_at: datetime
completed_at: datetime | None
result: DiscoveryResult | None
error: str | None
_JOBS: TTLCache[str, DiscoveryJob] = TTLCache(maxsize=200, ttl=900) # 15 min
_RESULT_CACHE: TTLCache[str, DiscoveryResult] = TTLCache(maxsize=50, ttl=900)
Decision 3 — OpenTargets modality filter via keyword-match on mechanismOfAction (was §4.1)¶
Chosen: for each target in landscape.targets, query Open Targets' target → known-drugs path (target(ensemblId: ...) { knownDrugs { rows { drug { name, mechanismsOfAction { rows { mechanismOfAction } } } } } }); filter rows to those whose mechanismOfAction strings contain any landscape-configured modality keyword (siRNA, RNA interference, small interfering RNA, antisense, etc.), with a noise-token blocklist.
Not chosen:
- Filter by
drug.drugType— OT'sdrugTypeis sparse for preclinical/disclosed-only candidates (the regime this PR cares about most). It'd give us false negatives on exactly the Chinese-biotech assets PR #107 highlighted. - Both (drugType first, mechanism fallback) — extra LOC for marginal precision gain when keyword-match is already permissive on purpose.
Consequence: noise-token blocklist mirrors _DEAL_FOCUS_NOISE_TOKENS in scripts/utils/inspect_sirna_seed.py — receptor / cytokine / checkpoint tokens that look like modality keywords but aren't (e.g. "RNA polymerase II", "RNA polymerase III" — not siRNA).
Important side-finding: ogur/sources/opentargets.py::_EFO_IDS has no cardiometabolic entries (hyperlipidemia, ASCVD, hypertension, MASH, obesity, cardio-renal). The existing OpenTargetsSource.fetch() on cardiometabolic-rnai-001 falls back to settings.opentargets_disease_id (NSCLC), which is why the PR #107 dump reported "OT: 1/32 assets, 578 signals" — those signals are mostly NSCLC noise tagged with the wrong landscape_id. Out of scope for this PR, but flagged in docs/sirna_landscape_discovery_vs_deck.md as a known gap. The harvester itself sidesteps this because it queries OT by target, not by indication.
Decision 4 — UI filter: ≥2 source diversity by default + "show single-source" toggle (was §4.4)¶
Chosen: the wizard renders only entities with source_diversity ≥ 2 by default, with an explicit toggle to surface single-source candidates. The harvester returns all candidates — the filter happens client-side in ClarifyingQuestionsPanel.tsx. Each entity carries sources: list[str] so the chip can show provenance on hover.
Not chosen:
- Show everything, sort by diversity — risks 50+ chip overload when 15 entities are the real signal.
- ≥3 sources — misses entities only confirmed by 2 sources, which the §7 acceptance criteria explicitly target (Suzhou Ribo + BeBetter often only hit EPO + CT.gov, both still trustworthy).
Consequence: honors §2 principle #4 ("Human-in-the-loop is a feature") — the analyst sees what the system filtered out via the toggle, not by guessing. The single-source view is the right place to find long-tail sponsors (Chinese-only biotechs whose only Western footprint is one OpenAlex affiliation).
Decision 5 — Hand-curated alias normaliser (was §4.5)¶
Chosen: module-level helper that pipes name candidates through (1) lowercase, (2) strip legal-suffix tokens (Inc / Ltd / Co / LLC / Pharma / Therapeutics / Pharmaceuticals / GmbH / plc), (3) lookup in a hand-curated _ALIASES dict for known multi-spelling entities ("Suzhou Ribo Life Science" / "Suzhou Ribo" / "苏州瑞博" → canonical "Suzhou Ribo Life Science"). Lifted from the existing _company_match_tokens logic in scripts/utils/inspect_sirna_seed.py.
Not chosen:
- Open Targets
drug.crossReferences.companies[]as canonical ID space — better long-term, but misses entities OT doesn't know about (small private Chinese biotechs, preclinical-only sponsors). Bigger refactor for marginal initial gain. - Alias table + expose unmerged variants under
entity.aliases[]— ~5 LOC for a feature whose primary consumer (the analyst eyeballing the wizard) doesn't need it.
Consequence: the alias dict ships small (~20 entries for the cardiometabolic landscape) and grows organically per landscape. Future migration to OT-canonical IDs is a swap of the normaliser implementation without changing the call site.
Why not extend discover_competitors.py¶
scripts/seed/discover_competitors.py is indication-axis (queries OT's disease(efoId: ...) { drugAndClinicalCandidates }) and OT-only. The harvester is modality-axis (queries 5 sources, projects onto sponsor fields) and target-axis (queries OT's target(ensemblId: ...) { knownDrugs }). The two scripts share zero query bodies. They share a philosophy — candidate_drugs is derived, not authored — and the harvester's output writes into the same column.
discover_competitors.py stays unchanged. Existing indication-axis landscapes (NSCLC, immunology) keep calling it; modality-target landscapes call discover_for_modality_target instead. The validation function in ogur/engine/discovery_concepts.py raises if a caller mismatches scope_type.
Architecture¶
Landscape row
(modalities, targets, indications,
source_filters, scope_type=="modality_target")
│
▼
┌────────── discovery_modality ──────────┐
│ │
│ asyncio.gather over 5 source adapters:│
│ ─────────────────────────────────────│
│ patents._harvest_applicants_by_concept│
│ clinicaltrials._harvest_sponsors_… │
│ openalex._harvest_affiliations_… │
│ sec._harvest_filers_by_concept │
│ opentargets._harvest_drugs_by_target_…│
│ │
│ ▼ │
│ raw entities by source │
│ │ │
│ ▼ │
│ alias-normalise + merge by name │
│ │ │
│ ▼ │
│ rank (source_diversity, count) │
└─────────────────┬──────────────────────┘
│
DiscoveryResult { drugs, companies, by_source }
│
▼
POST /api/explore/discover → job_id (202)
GET /api/explore/discover/{job_id} → full result
│
▼
ResearchWizard.tsx
→ editable chips in ClarifyingQuestionsPanel
→ confirm writes to landscape.candidate_drugs
+ landscape.companies
Open questions (deferred)¶
- Auto-refresh on schedule? Once a landscape's chips are confirmed, should the harvester re-run weekly to flag new candidates? No, for this PR. Auto-mutation would surprise the analyst and break Monitor-mode briefing reproducibility. A future "alert on new candidate" feature can read from a separate discovery-log table without touching the confirmed list.
- CDE / Chinese-aggregator integration. These are new sources, separately scoped. The harvester scaffolding here accepts them when added — each new source plugs into the same
_harvest_*_by_concept(landscape)shape and the merger picks them up automatically. - Cross-modality landscapes (siRNA + antibody, etc.). The concept-query registry in
ogur/engine/discovery_concepts.pyis keyed on(scope_type, modality); combining modalities is union over the two query lists. Tested when the second modality-target landscape ships.
Acceptance criteria (mirrors plan §7)¶
discover_for_modality_target(cardiometabolic-rnai-001)returns ≥30 unique companies (the deck listed 40; missing ≤10 is the noise floor).- ≥80% of the deck's 40 companies appear with
source_diversity ≥ 2. - Chinese-origin sponsors (Suzhou Ribo, BeBetter, CSPC, Hengrui, Sanegene Bio) all surface — the patents-applicant axis is load-bearing here.
- Discovery completes in ≤90 seconds end-to-end on a warm cache.
POST /api/explore/discoverround-trip + frontend chip render works.- The closed-world seed path (
--no-discovery) still produces the PR #107 dump byte-identically when run against the same DB snapshot. - Full test suite passes (currently 1057 tests; the harvester adds ~30 new ones).