Skip to content

ADR-0005: Sponsor / asset discovery harvester for modality-target landscapes

Status: Accepted Date: 2026-06-08 Driver: Khalil Related: PR #107 (closed-world coverage test that this work flips into discovery), ADR-0001 (Landscape.candidate_drugs is "derived, not authored"), docs/sirna_landscape_gap_analysis_brief.md §0 (closed-world vs discovery framing)

Context

PR #107 introduced scope_type="modality_target" landscapes and the staging row cardiometabolic-rnai-001. The PR's seed (scripts/seed/seed_cardiometabolic_rnai.py) hand-curates candidate_drugs (30 entries) and companies (40 entries) from a reference deck, then asks every Ogur source "do you have any signal for each named entity?" — a closed-world coverage test, not an Explore-mode discovery run.

This ADR specifies the discovery harvester that flips that mechanism: given only a thesis (modality + targets + indications), produce a ranked sponsor + drug list by projecting source hits onto applicant / sponsor / author-affiliation / filer fields, with source_diversity as the trust signal.

The harvester is the real Explore-mode answer to "the analyst doesn't know the players yet." It also closes the asymmetry called out in docs/sirna_landscape_gap_analysis_brief.md §0: PR #107's "13/32 assets, 2/12 deals" figure can only be interpreted relative to a list someone else collected. With the harvester, both axes of the gap analysis are inferred from the same input.

Decision

Build ogur/engine/discovery_modality.py — the modality-target analogue of scripts/seed/discover_competitors.py (which is indication-axis, OT-only). Five source adapters project hits onto a sponsor-like field; a merge stage produces ranked DiscoveredEntity rows; an async /api/explore/discover endpoint exposes the harvester to the Explore wizard.

Five design decisions resolved up front, captured here so the implementation has a single reference.


Decision 1 — Per-landscape patent CPC config (was §4.2)

Chosen: add Landscape.source_filters JSON column. Default {"patents": {"cpc_prefixes": ["A61"]}} preserves existing NSCLC / immunology behaviour. cardiometabolic-rnai-001 gets {"patents": {"cpc_prefixes": ["A61", "C12N"]}} because siRNA composition patents file under C12N15 (nucleic acid biotechnology) more often than A61 (therapeutics).

Not chosen:

  • Global broadening to (A61 OR C12N) — would add noise to immunology / NSCLC landscapes, which doesn't matter much in practice but is the wrong shape long-term.
  • Defer until the call site reveals which is cleaner — the call site is unambiguous (_CPC_PHARMA_PREFIX is referenced in _search_lens_with_must_clause and _search_epo_with_cql; both need to switch from a single prefix to a list).

Consequence: this PR pulls in what was "Proposal 2" from PR #107's discussion. Schema migration mirrors _ensure_modality_target_schema in seed_cardiometabolic_rnai.py — idempotent ALTER TABLE ... ADD COLUMN with a JSON default. Future per-source filter knobs (date windows, max-results caps) live in the same JSON blob without further schema churn.

Plumbing notes:

  • _CPC_PHARMA_PREFIX = "A61" becomes a method that reads from the landscape's source_filters.
  • Both _search_lens_* (class_cpc.symbol prefix-match) and _search_epo_* (CQL cl=<prefix> operator) accept a list and OR the prefixes together ({"bool": {"should": [{"prefix": ...}, ...]}} for Lens; cl=A61 OR cl=C12N for EPO CQL).

Decision 2 — Async API with in-memory job store (was §4.3)

Chosen: POST /api/explore/discover returns 202 + {job_id, status_url}. GET /api/explore/discover/{job_id} returns {status, drugs, companies, ...}. Job store is cachetools.TTLCache keyed by job_id with a 15-minute TTL; a separate keyed cache (hash of (scope_type, modalities, targets, indications) tuple) memoizes completed results so a re-run of the same thesis returns instantly.

Not chosen:

  • Synchronous endpoint — runs target ~60s with 5 sources × 10 queries × 200ms sleep + parse overhead; a 60-second blocking HTTP request behind a wizard form is bad UX.
  • Sync first, async in a follow-up — costs us nothing to ship async now and the frontend shape (polling, progress indicator) becomes the right shape from M1.

Consequence: the FastAPI route uses BackgroundTasks (mirroring the existing briefings.py pattern for /briefing/{landscape_id} trigger). No Redis / Celery added; if the API ever scales horizontally, the only change is swapping TTLCache for a shared store. The endpoint contract stays identical.

Job store shape:

@dataclass
class DiscoveryJob:
    job_id: str
    status: Literal["running", "complete", "failed"]
    submitted_at: datetime
    completed_at: datetime | None
    result: DiscoveryResult | None
    error: str | None

_JOBS: TTLCache[str, DiscoveryJob] = TTLCache(maxsize=200, ttl=900)  # 15 min
_RESULT_CACHE: TTLCache[str, DiscoveryResult] = TTLCache(maxsize=50, ttl=900)

Decision 3 — OpenTargets modality filter via keyword-match on mechanismOfAction (was §4.1)

Chosen: for each target in landscape.targets, query Open Targets' target → known-drugs path (target(ensemblId: ...) { knownDrugs { rows { drug { name, mechanismsOfAction { rows { mechanismOfAction } } } } } }); filter rows to those whose mechanismOfAction strings contain any landscape-configured modality keyword (siRNA, RNA interference, small interfering RNA, antisense, etc.), with a noise-token blocklist.

Not chosen:

  • Filter by drug.drugType — OT's drugType is sparse for preclinical/disclosed-only candidates (the regime this PR cares about most). It'd give us false negatives on exactly the Chinese-biotech assets PR #107 highlighted.
  • Both (drugType first, mechanism fallback) — extra LOC for marginal precision gain when keyword-match is already permissive on purpose.

Consequence: noise-token blocklist mirrors _DEAL_FOCUS_NOISE_TOKENS in scripts/utils/inspect_sirna_seed.py — receptor / cytokine / checkpoint tokens that look like modality keywords but aren't (e.g. "RNA polymerase II", "RNA polymerase III" — not siRNA).

Important side-finding: ogur/sources/opentargets.py::_EFO_IDS has no cardiometabolic entries (hyperlipidemia, ASCVD, hypertension, MASH, obesity, cardio-renal). The existing OpenTargetsSource.fetch() on cardiometabolic-rnai-001 falls back to settings.opentargets_disease_id (NSCLC), which is why the PR #107 dump reported "OT: 1/32 assets, 578 signals" — those signals are mostly NSCLC noise tagged with the wrong landscape_id. Out of scope for this PR, but flagged in docs/sirna_landscape_discovery_vs_deck.md as a known gap. The harvester itself sidesteps this because it queries OT by target, not by indication.

Decision 4 — UI filter: ≥2 source diversity by default + "show single-source" toggle (was §4.4)

Chosen: the wizard renders only entities with source_diversity ≥ 2 by default, with an explicit toggle to surface single-source candidates. The harvester returns all candidates — the filter happens client-side in ClarifyingQuestionsPanel.tsx. Each entity carries sources: list[str] so the chip can show provenance on hover.

Not chosen:

  • Show everything, sort by diversity — risks 50+ chip overload when 15 entities are the real signal.
  • ≥3 sources — misses entities only confirmed by 2 sources, which the §7 acceptance criteria explicitly target (Suzhou Ribo + BeBetter often only hit EPO + CT.gov, both still trustworthy).

Consequence: honors §2 principle #4 ("Human-in-the-loop is a feature") — the analyst sees what the system filtered out via the toggle, not by guessing. The single-source view is the right place to find long-tail sponsors (Chinese-only biotechs whose only Western footprint is one OpenAlex affiliation).

Decision 5 — Hand-curated alias normaliser (was §4.5)

Chosen: module-level helper that pipes name candidates through (1) lowercase, (2) strip legal-suffix tokens (Inc / Ltd / Co / LLC / Pharma / Therapeutics / Pharmaceuticals / GmbH / plc), (3) lookup in a hand-curated _ALIASES dict for known multi-spelling entities ("Suzhou Ribo Life Science" / "Suzhou Ribo" / "苏州瑞博" → canonical "Suzhou Ribo Life Science"). Lifted from the existing _company_match_tokens logic in scripts/utils/inspect_sirna_seed.py.

Not chosen:

  • Open Targets drug.crossReferences.companies[] as canonical ID space — better long-term, but misses entities OT doesn't know about (small private Chinese biotechs, preclinical-only sponsors). Bigger refactor for marginal initial gain.
  • Alias table + expose unmerged variants under entity.aliases[] — ~5 LOC for a feature whose primary consumer (the analyst eyeballing the wizard) doesn't need it.

Consequence: the alias dict ships small (~20 entries for the cardiometabolic landscape) and grows organically per landscape. Future migration to OT-canonical IDs is a swap of the normaliser implementation without changing the call site.

Why not extend discover_competitors.py

scripts/seed/discover_competitors.py is indication-axis (queries OT's disease(efoId: ...) { drugAndClinicalCandidates }) and OT-only. The harvester is modality-axis (queries 5 sources, projects onto sponsor fields) and target-axis (queries OT's target(ensemblId: ...) { knownDrugs }). The two scripts share zero query bodies. They share a philosophycandidate_drugs is derived, not authored — and the harvester's output writes into the same column.

discover_competitors.py stays unchanged. Existing indication-axis landscapes (NSCLC, immunology) keep calling it; modality-target landscapes call discover_for_modality_target instead. The validation function in ogur/engine/discovery_concepts.py raises if a caller mismatches scope_type.

Architecture

                                       Landscape row
                              (modalities, targets, indications,
                              source_filters, scope_type=="modality_target")
                       ┌────────── discovery_modality ──────────┐
                       │                                        │
                       │  asyncio.gather over 5 source adapters:│
                       │  ─────────────────────────────────────│
                       │  patents._harvest_applicants_by_concept│
                       │  clinicaltrials._harvest_sponsors_…    │
                       │  openalex._harvest_affiliations_…      │
                       │  sec._harvest_filers_by_concept        │
                       │  opentargets._harvest_drugs_by_target_…│
                       │                                        │
                       │            ▼                           │
                       │      raw entities by source            │
                       │            │                           │
                       │            ▼                           │
                       │  alias-normalise + merge by name       │
                       │            │                           │
                       │            ▼                           │
                       │  rank (source_diversity, count)        │
                       └─────────────────┬──────────────────────┘
                              DiscoveryResult { drugs, companies, by_source }
                            POST /api/explore/discover  →  job_id (202)
                            GET  /api/explore/discover/{job_id}  →  full result
                                 ResearchWizard.tsx
                          → editable chips in ClarifyingQuestionsPanel
                          → confirm writes to landscape.candidate_drugs
                                  + landscape.companies

Open questions (deferred)

  1. Auto-refresh on schedule? Once a landscape's chips are confirmed, should the harvester re-run weekly to flag new candidates? No, for this PR. Auto-mutation would surprise the analyst and break Monitor-mode briefing reproducibility. A future "alert on new candidate" feature can read from a separate discovery-log table without touching the confirmed list.
  2. CDE / Chinese-aggregator integration. These are new sources, separately scoped. The harvester scaffolding here accepts them when added — each new source plugs into the same _harvest_*_by_concept(landscape) shape and the merger picks them up automatically.
  3. Cross-modality landscapes (siRNA + antibody, etc.). The concept-query registry in ogur/engine/discovery_concepts.py is keyed on (scope_type, modality); combining modalities is union over the two query lists. Tested when the second modality-target landscape ships.

Acceptance criteria (mirrors plan §7)

  • discover_for_modality_target(cardiometabolic-rnai-001) returns ≥30 unique companies (the deck listed 40; missing ≤10 is the noise floor).
  • ≥80% of the deck's 40 companies appear with source_diversity ≥ 2.
  • Chinese-origin sponsors (Suzhou Ribo, BeBetter, CSPC, Hengrui, Sanegene Bio) all surface — the patents-applicant axis is load-bearing here.
  • Discovery completes in ≤90 seconds end-to-end on a warm cache.
  • POST /api/explore/discover round-trip + frontend chip render works.
  • The closed-world seed path (--no-discovery) still produces the PR #107 dump byte-identically when run against the same DB snapshot.
  • Full test suite passes (currently 1057 tests; the harvester adds ~30 new ones).