Autonomous Explore-mode discovery — pressure-test read-out¶
What this is. A read-out of the open-world discovery pressure test built in
scripts/eval/sirna_autonomous_discovery.py. It answers a question the
closed-world coverage test (scripts/utils/inspect_sirna_seed.py, PR #107/#113)
structurally cannot: given only a one-sentence thesis — not the analyst's
hand-curated list of 30 drugs + 40 companies — can Ogur independently re-derive
the entity coverage of the "Dual Targeting siRNA" consulting deck (32 assets,
12 deals)?
Run it.
export OGUR_GAP_ASSETS_CSV="…/16_benchmark_ogur_vs_sanofi_siRNA…01_Assets.csv"
export OGUR_GAP_DEALS_CSV="…/16_benchmark_ogur_vs_sanofi_siRNA…02_Deals.csv"
# Fast, deterministic, free — concept-searches the existing 11,614-signal corpus:
uv run python scripts/eval/sirna_autonomous_discovery.py \
"Map the competitive landscape of dual-targeting siRNA therapies for cardiometabolic disease" \
--corpus db --rounds 3
# The real test — fires concept queries at the live source APIs:
uv run python scripts/eval/sirna_autonomous_discovery.py "<thesis>" --corpus live --rounds 2
Artifacts land in archived_data/: sirna_discovered_companies_<date>.csv,
sirna_discovered_assets_<date>.csv, sirna_gap_analysis_<date>.md (the
per-run gap report, with a reproducibility line). --no-llm swaps the LLM
decomposition for the deterministic fallback so a run is free and reproducible.
How it works¶
thesis ─▶ (1) decompose ─▶ (2) generate concept queries ─▶ (3) discover across 5 sources
│
(5) score vs deck ◀─ (4) iterate: round N+1 queries seeded by round N entities
- Decompose (Sonnet, or a deterministic fallback) the sentence into a
structured scope —
modalities × modality_subclass × targets × indications × horizon. This is the analyst-overridable intermediate (mirrors the frontend'sClarifyingQuestionsPanel); it is written to the gap report's reproducibility line. - Generate concept queries — cross-product of modality × {target, indication} plus the subclass phrasings ("dual-targeting", "divalent", "bispecific"), capped per round.
- Discover — fire the queries at the five sources via their existing
concept-capable entry points, projecting each hit onto a sponsor-like field:
CT.gov
leadSponsor.name, patentapplicants[].extracted_name, OpenAlexinstitutions[].display_name, SEC filerdisplay_names[0], OpenTargets drugs by target (target(ensemblId){knownDrugs}, filtered onmechanismOfAction). Entities are alias-normalised and research orgs dropped. - Iterate — round 2 seeds per-discovered-sponsor queries, gated by a trust threshold (source diversity) so round-1 noise isn't amplified. Converge when M consecutive rounds add no new entity.
- Score — reuse the closed-world matcher (
scripts/eval/sirna_discovery/shared withinspect_sirna_seed.py) in reverse: recall vs the deck, precision of what we surfaced, the four-bucket matrix, and the FN gap partitioned into corpus ceiling vs discovery-missed.
Results¶
| metric | live (real test) | db (corpus proxy) | closed-world baseline |
|---|---|---|---|
| company recall | 4/36 (11%) | 9/36 (25%) | — |
| asset recall (sponsor-attribution) | 2/32 (6%) | 6/32 (19%) | 15/32 (47%) |
| asset recall (code-level) | 4/32 (12%) | 3/32 (9%) | — |
| deal recall | 0/12 (0%) | 0/12 (0%) | 5/12 (42%) |
| company precision | 50% | 33% | — |
Two recall bars are reported deliberately: sponsor-attribution (surfacing "Suzhou Ribo" credits its deck assets) vs code-level (must surface "SR122" itself). The gap between them isolates discovery failures from code-disclosure failures.
Post-review refinement (PR #117 Codex review)¶
Three changes from the review, and what each measured — instructive because two of them moved nothing, which is the point:
- EPO fallback projection bug fixed.
_live_patents()kept the Lens projector after falling back to EPO, whose flat record shape ({applicant, …}) the Lens projector silently drops — so the patent axis was doubly dead (Lens 401 and mis-projection). Fixed (_project_epo()). Live recall unmoved (11%) — the patent axis is now honestly measured as weak on this landscape (EPO returns almost nothing for these concept phrases), not silently broken. - Trust-gate relaxed + round-2 kept on-modality. Letting recurring single-source sponsors seed round 2 floods unless the per-sponsor sweep is filtered to the sponsor's siRNA records (it otherwise returns their whole trial book). With the on-modality filter, live precision rose 36% → 50% with recall held; without it, precision cratered to 5%.
- db snapped back to baseline. The on-modality filter pruned exactly the entities the relaxed gate tried to add — because the addressable misses lack a modality token in their text. This is the same retrieval wall, proven from both directions: every keyword knob trades precision against recall but cannot break the ~11% live ceiling. Only ingestion-time enrichment (tagging modality on the signal, where the raw text doesn't) escapes it — see §"Verdict".
What worked¶
- Decomposition is the strong link. From the one sentence alone the system
recovered 5 of the deck's 6 target genes (
PCSK9, ANGPTL3, APOC3, LPA, AGT) and the load-bearing "dual-targeting" subclass — the hard inference the naïvemodalities=["siRNA"], indications=["cardiometabolic disease"]call misses. - It found a player the deck missed. The FP-interest pass surfaced Eddingpharm's EDP167 — "a double-stranded siRNA targeting ANGPTL3 … for dyslipidemia" — a genuine dual-axis RNAi-cardiometabolic asset absent from the analyst's 32. This is the headline value of an open-world tool: it is not bounded by the curator's prior knowledge. (Off-deck, the run also surfaced Silence Therapeutics, Ionis/Isis, and The Medicines Company — inclisiran's originator — which a manual analyst pass would promote.)
- Precision is respectable once research-institution noise is filtered (16% → 36% live after the affiliation filter learned non-English institution forms — Inserm, Erasmus MC, Hôpitaux, etc.).
What didn't — and why¶
- Live recall is lower than the corpus proxy (11% vs 25% companies). This is the central finding. The 11,614-signal corpus was built by entity-keyed closed-world queries plus broad per-indication sweeps, so it already contains more of the right signals than a fresh concept-keyed live sweep reaches. The bottleneck is not the corpus — it is translating scope into queries that hit.
- The patent axis is dead. Lens.org has been 401-ing since PR #107; the EPO fallback 404s on concept phrases and is weak on Chinese applicants. Patents first-surfaced 0 deck companies. For a China-heavy landscape this is the single most damaging gap — composition patents are where preclinical Chinese siRNA assets would otherwise appear.
- OpenTargets-by-target returns 0 asset codes. OT's
drugType/known-drugs data is sparse for the disclosed-only, preclinical Chinese candidates this deck is built from — exactly the regime where it adds least. - Deal recall is 0. Deals need co-occurrence evidence (both counterparties in one filing, or bilateral SEC cross-filings). Concept-filtered backing signals don't carry that, and SEC EDGAR is US-listed only — 7 of 12 deals have a China-only counterparty.
- Only CT.gov (3) and OpenAlex (1) first-surfaced any deck company.
The gap is mostly not a discovery failure¶
Partitioning the 32 assets:
- 16 — corpus ceiling. 11 code-only Chinese biotech assets (Pattern A) + 5 placeholder/private-sponsor assets (Pattern B) have no Western public footprint in any of our five sources. No query strategy recovers these; they need CDE / Chinese aggregators / HKEX (new sources, out of scope here). This caps achievable recall at ~50% against the current source set.
- 11 — discovery-missed (addressable). Evidence for these exists in the corpus but the concept queries didn't reach it (e.g. BEBT-701, ARO-DIMER-PA, RNS681, SR122). This is the real "discovery intelligence" gap and the right place to invest.
Verdict and what would move the number¶
An autonomous Explore backend is viable for scope capture and net-new discovery, not yet for recall parity with a hand-curated deck on this landscape. The thesis→scope step is production-grade; the scope→query→reach step is where recall leaks, and the dominant single fix is external (Lens renewal).
Ranked levers for the addressable 11:
- Renew Lens.org — restores the patent-applicant axis, the load-bearing one for Chinese composition filings. Highest expected uplift, lowest engineering.
- Sharper CT.gov concept→param mapping — drive
query.intr=<modality>×query.cond=<indication>rather than a singlequery.term, and promote round-2 per-sponsor sweeps more aggressively (the in-corpus evidence for the 11 discovery-missed assets is mostly CT.gov). - A co-occurrence pass for deals — score acquirer×licensor co-mention directly (as the closed-world deal matcher does) instead of relying on concept-filtered backing signals.
- New sources for the ceiling 16 — CDE (
chinadrugtrials.org.cn), Chinese pharma aggregators, HKEX. Separately scoped; the harness already accepts new_harvest_*_by_conceptadapters.
See archived_data/sirna_gap_analysis_<date>.md for the per-run matrix and
docs/sirna_landscape_gap_analysis_brief.md for the Pattern A/B/C source-gap
analysis this read-out's ceiling partition cross-references.
De-circularization experiment (2026-06-19)¶
Question: the headline db numbers search a corpus that was seeded by querying the deck's own 30 drug codes + 40 company names by name (across all sources, incl. the entity-keyed HKEX/CNInfo). So "open-world db recall" partly re-finds what entity-keyed seeding put there. How much of it is circularity?
Method: run live discovery (concept queries → live APIs, no pre-seeded corpus, LLM decompose, round-1 coverage fix) and compare to the entity-seeded db (22.4k corpus).
| metric | circular db (entity-seeded, all sources) | de-circularized live (concept-only, no corpus) |
|---|---|---|
| asset recall (sponsor) | 38% (12/32) | 9% (3/32) |
| company recall | 39% (14/36) | 19% (7/36) |
| deal recall | 42% (5/12) | 0% (0/12) |
Findings:
- ~75% of asset recall and 100% of deal recall was circularity — the corpus
pre-stocked with the answer. The honest open-world number is 9% asset / 19% company /
0% deal.
- Fixing EPO (#119) did not move the open-world number (9%→9%). EPO's working coverage
is entity-keyed (applicant search): round-2 applicant queries returned thousands of
hits for big-pharma names (Novartis 6,442; BMS 4,590) while concept full-text queries
("dual-targeting siRNA PCSK9") return ~zero. EPO/HKEX/CNInfo are closed-world
(you-know-the-company) sources — they cannot surface an unknown company from a concept.
- The cap on open-world recall is the concept→company bridge. Deals (0%) live only in
entity-keyed filings; recovering them requires deriving the counterparty list from the
concept scope (via OT / CT.gov / SEC full-text / the Target→Company graph), then
feeding those names to the entity-keyed sources in round 2. This is the next lever.
- Round-2 expansion is currently near-useless (+1 novel) — it expands high-diversity
big-pharma names, not the dual-siRNA innovators; the OT-by-target axis leaks on
LLM-proposed genes with no Ensembl mapping (DGAT2, LDLR, ANGPTL4, HSD17B13).
Artifacts: archived_data/sirna_decirc_live_{companies,gap}_20260619.*.
The concept→company bridge (2026-06-19)¶
Why: open-world deal recall was 0/12 because the entity-keyed China filing sources
(where China-China deals live) weren't in the discovery loop, and no free external
target→company ontology exists (OpenTargets has no company field — verified by
schema introspection). So we built the bridge ourselves, three parts:
- cn_filings (HKEX/CNInfo) wired into round-2 per-sponsor (
_live_cn_filings). - Self-built
Target→Companyderivation (derive_companies_from_corpus): scan our corpus for companies co-occurring with a modality + target/indication token → ~40 companies (vs round-1's ~7) → seed round-2. Query-seeds only; never scored as discovered (no recall inflation). - Filing sources exempt from the round-2 on-modality gate — the deal filings name the deal, not the modality (D06 Merck×Hengrui: 27 co-occurring filings, 0 modality tokens), so gating them blocked deal recovery.
Measured (db + derivation — note: SEMI-CIRCULAR, see caveat):
| metric | before bridge | after bridge (+ precision guard + AGT) | closed-world |
|---|---|---|---|
| company recall | 39% | 50% (18/36) | — |
| asset recall (sponsor) | 38% | 47% (15/32) | 47% (15/32, ceiling) |
| deal recall | 42% | 75% (9/12) | 92% (11/12) |
| precision | 26% | 23% | — |
(Raw bridge before the precision guard was 56% company / 7% precision — round-2 filing per-sponsor projected every co-named filer. A cardio-relevance guard on the filing company-hits restored precision to 23% with deal recall + the asset ceiling held. The AGT decompose fix — cover all cardiometabolic axes — recovered ⅚ deck targets, up from 4/6.)
What this proves: the bridge mechanism works end-to-end — derive companies from the corpus ontology → query the entity-keyed filing sources in round-2 → deal co-occurrence → deal recall. Open-world db asset recall now equals the closed-world entity-keyed ceiling, and deal recall reaches 9/12.
Two honest caveats: - Semi-circular. This is db mode over the entity-seeded corpus, and the derivation reads that same corpus. It proves the mechanism, not a clean open-world number (the honest live, no-corpus number is still 9%/19%/0% — gated on surfacing the China sponsors live, which needs CDE + patents, not the entity-seeded corpus). The bridge becomes genuinely open-world precisely as those sources feed real sponsors in. - Precision cratered to 7% (284 companies, 20 on deck). Round-2 filing per-sponsor projects every co-named filer. Tightening that projection (relevance-filter the round-2 filing company hits) is the next lever.