Skip to content

Autonomous Explore-mode discovery — pressure-test read-out

What this is. A read-out of the open-world discovery pressure test built in scripts/eval/sirna_autonomous_discovery.py. It answers a question the closed-world coverage test (scripts/utils/inspect_sirna_seed.py, PR #107/#113) structurally cannot: given only a one-sentence thesis — not the analyst's hand-curated list of 30 drugs + 40 companies — can Ogur independently re-derive the entity coverage of the "Dual Targeting siRNA" consulting deck (32 assets, 12 deals)?

Run it.

export OGUR_GAP_ASSETS_CSV="…/16_benchmark_ogur_vs_sanofi_siRNA…01_Assets.csv"
export OGUR_GAP_DEALS_CSV="…/16_benchmark_ogur_vs_sanofi_siRNA…02_Deals.csv"
# Fast, deterministic, free — concept-searches the existing 11,614-signal corpus:
uv run python scripts/eval/sirna_autonomous_discovery.py \
  "Map the competitive landscape of dual-targeting siRNA therapies for cardiometabolic disease" \
  --corpus db --rounds 3
# The real test — fires concept queries at the live source APIs:
uv run python scripts/eval/sirna_autonomous_discovery.py "<thesis>" --corpus live --rounds 2

Artifacts land in archived_data/: sirna_discovered_companies_<date>.csv, sirna_discovered_assets_<date>.csv, sirna_gap_analysis_<date>.md (the per-run gap report, with a reproducibility line). --no-llm swaps the LLM decomposition for the deterministic fallback so a run is free and reproducible.


How it works

thesis ─▶ (1) decompose ─▶ (2) generate concept queries ─▶ (3) discover across 5 sources
              (5) score vs deck ◀─ (4) iterate: round N+1 queries seeded by round N entities
  1. Decompose (Sonnet, or a deterministic fallback) the sentence into a structured scope — modalities × modality_subclass × targets × indications × horizon. This is the analyst-overridable intermediate (mirrors the frontend's ClarifyingQuestionsPanel); it is written to the gap report's reproducibility line.
  2. Generate concept queries — cross-product of modality × {target, indication} plus the subclass phrasings ("dual-targeting", "divalent", "bispecific"), capped per round.
  3. Discover — fire the queries at the five sources via their existing concept-capable entry points, projecting each hit onto a sponsor-like field: CT.gov leadSponsor.name, patent applicants[].extracted_name, OpenAlex institutions[].display_name, SEC filer display_names[0], OpenTargets drugs by target (target(ensemblId){knownDrugs}, filtered on mechanismOfAction). Entities are alias-normalised and research orgs dropped.
  4. Iterate — round 2 seeds per-discovered-sponsor queries, gated by a trust threshold (source diversity) so round-1 noise isn't amplified. Converge when M consecutive rounds add no new entity.
  5. Score — reuse the closed-world matcher (scripts/eval/sirna_discovery/ shared with inspect_sirna_seed.py) in reverse: recall vs the deck, precision of what we surfaced, the four-bucket matrix, and the FN gap partitioned into corpus ceiling vs discovery-missed.

Results

metric live (real test) db (corpus proxy) closed-world baseline
company recall 4/36 (11%) 9/36 (25%)
asset recall (sponsor-attribution) 2/32 (6%) 6/32 (19%) 15/32 (47%)
asset recall (code-level) 4/32 (12%) 3/32 (9%)
deal recall 0/12 (0%) 0/12 (0%) 5/12 (42%)
company precision 50% 33%

Two recall bars are reported deliberately: sponsor-attribution (surfacing "Suzhou Ribo" credits its deck assets) vs code-level (must surface "SR122" itself). The gap between them isolates discovery failures from code-disclosure failures.

Post-review refinement (PR #117 Codex review)

Three changes from the review, and what each measured — instructive because two of them moved nothing, which is the point:

  • EPO fallback projection bug fixed. _live_patents() kept the Lens projector after falling back to EPO, whose flat record shape ({applicant, …}) the Lens projector silently drops — so the patent axis was doubly dead (Lens 401 and mis-projection). Fixed (_project_epo()). Live recall unmoved (11%) — the patent axis is now honestly measured as weak on this landscape (EPO returns almost nothing for these concept phrases), not silently broken.
  • Trust-gate relaxed + round-2 kept on-modality. Letting recurring single-source sponsors seed round 2 floods unless the per-sponsor sweep is filtered to the sponsor's siRNA records (it otherwise returns their whole trial book). With the on-modality filter, live precision rose 36% → 50% with recall held; without it, precision cratered to 5%.
  • db snapped back to baseline. The on-modality filter pruned exactly the entities the relaxed gate tried to add — because the addressable misses lack a modality token in their text. This is the same retrieval wall, proven from both directions: every keyword knob trades precision against recall but cannot break the ~11% live ceiling. Only ingestion-time enrichment (tagging modality on the signal, where the raw text doesn't) escapes it — see §"Verdict".

What worked

  • Decomposition is the strong link. From the one sentence alone the system recovered 5 of the deck's 6 target genes (PCSK9, ANGPTL3, APOC3, LPA, AGT) and the load-bearing "dual-targeting" subclass — the hard inference the naïve modalities=["siRNA"], indications=["cardiometabolic disease"] call misses.
  • It found a player the deck missed. The FP-interest pass surfaced Eddingpharm's EDP167"a double-stranded siRNA targeting ANGPTL3 … for dyslipidemia" — a genuine dual-axis RNAi-cardiometabolic asset absent from the analyst's 32. This is the headline value of an open-world tool: it is not bounded by the curator's prior knowledge. (Off-deck, the run also surfaced Silence Therapeutics, Ionis/Isis, and The Medicines Company — inclisiran's originator — which a manual analyst pass would promote.)
  • Precision is respectable once research-institution noise is filtered (16% → 36% live after the affiliation filter learned non-English institution forms — Inserm, Erasmus MC, Hôpitaux, etc.).

What didn't — and why

  • Live recall is lower than the corpus proxy (11% vs 25% companies). This is the central finding. The 11,614-signal corpus was built by entity-keyed closed-world queries plus broad per-indication sweeps, so it already contains more of the right signals than a fresh concept-keyed live sweep reaches. The bottleneck is not the corpus — it is translating scope into queries that hit.
  • The patent axis is dead. Lens.org has been 401-ing since PR #107; the EPO fallback 404s on concept phrases and is weak on Chinese applicants. Patents first-surfaced 0 deck companies. For a China-heavy landscape this is the single most damaging gap — composition patents are where preclinical Chinese siRNA assets would otherwise appear.
  • OpenTargets-by-target returns 0 asset codes. OT's drugType/known-drugs data is sparse for the disclosed-only, preclinical Chinese candidates this deck is built from — exactly the regime where it adds least.
  • Deal recall is 0. Deals need co-occurrence evidence (both counterparties in one filing, or bilateral SEC cross-filings). Concept-filtered backing signals don't carry that, and SEC EDGAR is US-listed only — 7 of 12 deals have a China-only counterparty.
  • Only CT.gov (3) and OpenAlex (1) first-surfaced any deck company.

The gap is mostly not a discovery failure

Partitioning the 32 assets:

  • 16 — corpus ceiling. 11 code-only Chinese biotech assets (Pattern A) + 5 placeholder/private-sponsor assets (Pattern B) have no Western public footprint in any of our five sources. No query strategy recovers these; they need CDE / Chinese aggregators / HKEX (new sources, out of scope here). This caps achievable recall at ~50% against the current source set.
  • 11 — discovery-missed (addressable). Evidence for these exists in the corpus but the concept queries didn't reach it (e.g. BEBT-701, ARO-DIMER-PA, RNS681, SR122). This is the real "discovery intelligence" gap and the right place to invest.

Verdict and what would move the number

An autonomous Explore backend is viable for scope capture and net-new discovery, not yet for recall parity with a hand-curated deck on this landscape. The thesis→scope step is production-grade; the scope→query→reach step is where recall leaks, and the dominant single fix is external (Lens renewal).

Ranked levers for the addressable 11:

  1. Renew Lens.org — restores the patent-applicant axis, the load-bearing one for Chinese composition filings. Highest expected uplift, lowest engineering.
  2. Sharper CT.gov concept→param mapping — drive query.intr=<modality> × query.cond=<indication> rather than a single query.term, and promote round-2 per-sponsor sweeps more aggressively (the in-corpus evidence for the 11 discovery-missed assets is mostly CT.gov).
  3. A co-occurrence pass for deals — score acquirer×licensor co-mention directly (as the closed-world deal matcher does) instead of relying on concept-filtered backing signals.
  4. New sources for the ceiling 16 — CDE (chinadrugtrials.org.cn), Chinese pharma aggregators, HKEX. Separately scoped; the harness already accepts new _harvest_*_by_concept adapters.

See archived_data/sirna_gap_analysis_<date>.md for the per-run matrix and docs/sirna_landscape_gap_analysis_brief.md for the Pattern A/B/C source-gap analysis this read-out's ceiling partition cross-references.


De-circularization experiment (2026-06-19)

Question: the headline db numbers search a corpus that was seeded by querying the deck's own 30 drug codes + 40 company names by name (across all sources, incl. the entity-keyed HKEX/CNInfo). So "open-world db recall" partly re-finds what entity-keyed seeding put there. How much of it is circularity?

Method: run live discovery (concept queries → live APIs, no pre-seeded corpus, LLM decompose, round-1 coverage fix) and compare to the entity-seeded db (22.4k corpus).

metric circular db (entity-seeded, all sources) de-circularized live (concept-only, no corpus)
asset recall (sponsor) 38% (12/32) 9% (3/32)
company recall 39% (14/36) 19% (7/36)
deal recall 42% (5/12) 0% (0/12)

Findings: - ~75% of asset recall and 100% of deal recall was circularity — the corpus pre-stocked with the answer. The honest open-world number is 9% asset / 19% company / 0% deal. - Fixing EPO (#119) did not move the open-world number (9%→9%). EPO's working coverage is entity-keyed (applicant search): round-2 applicant queries returned thousands of hits for big-pharma names (Novartis 6,442; BMS 4,590) while concept full-text queries ("dual-targeting siRNA PCSK9") return ~zero. EPO/HKEX/CNInfo are closed-world (you-know-the-company) sources — they cannot surface an unknown company from a concept. - The cap on open-world recall is the concept→company bridge. Deals (0%) live only in entity-keyed filings; recovering them requires deriving the counterparty list from the concept scope (via OT / CT.gov / SEC full-text / the Target→Company graph), then feeding those names to the entity-keyed sources in round 2. This is the next lever. - Round-2 expansion is currently near-useless (+1 novel) — it expands high-diversity big-pharma names, not the dual-siRNA innovators; the OT-by-target axis leaks on LLM-proposed genes with no Ensembl mapping (DGAT2, LDLR, ANGPTL4, HSD17B13).

Artifacts: archived_data/sirna_decirc_live_{companies,gap}_20260619.*.


The concept→company bridge (2026-06-19)

Why: open-world deal recall was 0/12 because the entity-keyed China filing sources (where China-China deals live) weren't in the discovery loop, and no free external target→company ontology exists (OpenTargets has no company field — verified by schema introspection). So we built the bridge ourselves, three parts:

  1. cn_filings (HKEX/CNInfo) wired into round-2 per-sponsor (_live_cn_filings).
  2. Self-built Target→Company derivation (derive_companies_from_corpus): scan our corpus for companies co-occurring with a modality + target/indication token → ~40 companies (vs round-1's ~7) → seed round-2. Query-seeds only; never scored as discovered (no recall inflation).
  3. Filing sources exempt from the round-2 on-modality gate — the deal filings name the deal, not the modality (D06 Merck×Hengrui: 27 co-occurring filings, 0 modality tokens), so gating them blocked deal recovery.

Measured (db + derivation — note: SEMI-CIRCULAR, see caveat):

metric before bridge after bridge (+ precision guard + AGT) closed-world
company recall 39% 50% (18/36)
asset recall (sponsor) 38% 47% (15/32) 47% (15/32, ceiling)
deal recall 42% 75% (9/12) 92% (11/12)
precision 26% 23%

(Raw bridge before the precision guard was 56% company / 7% precision — round-2 filing per-sponsor projected every co-named filer. A cardio-relevance guard on the filing company-hits restored precision to 23% with deal recall + the asset ceiling held. The AGT decompose fix — cover all cardiometabolic axes — recovered ⅚ deck targets, up from 4/6.)

What this proves: the bridge mechanism works end-to-end — derive companies from the corpus ontology → query the entity-keyed filing sources in round-2 → deal co-occurrence → deal recall. Open-world db asset recall now equals the closed-world entity-keyed ceiling, and deal recall reaches 9/12.

Two honest caveats: - Semi-circular. This is db mode over the entity-seeded corpus, and the derivation reads that same corpus. It proves the mechanism, not a clean open-world number (the honest live, no-corpus number is still 9%/19%/0% — gated on surfacing the China sponsors live, which needs CDE + patents, not the entity-seeded corpus). The bridge becomes genuinely open-world precisely as those sources feed real sponsors in. - Precision cratered to 7% (284 companies, 20 on deck). Round-2 filing per-sponsor projects every co-named filer. Tightening that projection (relevance-filter the round-2 filing company hits) is the next lever.