Live siRNA discovery with the CDE China-registry bridge — recall readout¶
Date: 2026-06-24 · Branch: eval/sirna-live-discovery
Thesis (the only input): "Map the competitive landscape of dual-targeting siRNA therapies for cardiometabolic disease."
Ground truth: the hand-built Sanofi-CI reference deck — 32 assets, 12 deals, and the companies behind them.
This is the live version of the closed-world discovery report: it gives the system one sentence and lets it query the live public web cold, now with the newly merged CDE China trial registry (#138) and CJK entity normalization (#147) in the loop. The question: how much closer to the closed-world ceiling can we get when the system can finally read the China registry?
TL;DR¶
- CDE breaks the cold-start chicken-and-egg for company/asset discovery. Cold-live company recall 10→13 / 36, asset-by-sponsor 6→10 / 32, asset-by-code 5→7 / 32. Every gain is China-registry-driven: the system cold-surfaced Suzhou Ribo and Visirna (and their codes QLC7401, BSA204, SR122) that no Western source exposes by topic.
- Live deal recall stays 0/12 — and we know exactly why (three stacked, addressable blockers, §5). It is not a wiring bug and not something to fudge: closing it needs the CJK layer to go from exact-match to contains-match, plus a discovered-entity→stock-listing resolver.
- The single highest-value next lever is CJK normalization. CDE already recovered three more deck firms — Hansoh, Hengrui, CSPC — but their full legal CDE names don't reduce to the canonical token, so they score as misses. Fixing that alone is worth ~+3 companies and is the gateway to the China deal sources.
- Anti-cheating verified. The cold-live run loads no seeded DB, derives no companies from prior corpora, and never feeds the deck's known asset codes into CDE. Recall uses the same matchers as the closed-world test — only the signal set differs. (§2)
1. Headline numbers¶
| Metric | Cold live, no CDE | Cold live, +CDE | Closed-world (db, bridge) | Closed-world ceiling* |
|---|---|---|---|---|
| Company recall | 10/36 (28%) | 13/36 (36%) | 18/36 (50%) | — |
| Asset recall — by sponsor | 6/32 (19%) | 10/32 (31%) | 16/32 (50%) | 15/32 |
| Asset recall — by code | 5/32 (16%) | 7/32 (22%) | 9/32 (28%) | — |
| Deal recall | 0/12 (0%) | 0/12 (0%) | 8/12 (67%) | 11/12 |
| Company precision | 4% | 4% | 28% | — |
* Ceiling = "given the analyst's list, is each item public at all?" — the best any system could do with today's sources.
Read it as: CDE moves cold-live discovery roughly a third of the way from the no-CDE baseline toward the closed-world bridge on companies and assets (e.g. sponsor recall 19% → 31%, toward 50%). Deals are the untouched frontier (§5).
The two cold-live
+CDEcolumns are identical with and without the SEC rate-limit fix — confirming the deal gap is architectural, not SEC throttling. SEC pacing is kept in (it stops the SEC axis silently dying under the concurrent fan-out) but it does not change recall here.
2. Why this is an honest cold-open-world number (anti-cheating)¶
The whole point of the live test is that nothing leaks the answer. Audited and enforced:
- No seeded DB.
--corpus livenever loadsogur.db;needs_corpusis false on the cold path. The 22k-signal corpus that powers the closed-world column is untouched. - No corpus-derived seeds. Round-2 expansion uses only entities discovered this run; the self-built Target→Company derivation defaults off for live.
- CDE is driven by topic, not by answers. In discovery, CDE runs two axes only:
- round 1 — Chinese indication ∩
drugs_type=chemical. Carries no company or asset names. Anything it returns is genuinely discovered. - round ≥2 —
drugs_namerecovery on asset codes discovered live from other sources earlier this run (capped, code-shaped only). It never receives the deck's known codes.The seed config's
drugs_name = [BSA204, HS-02, …](the deck's codes) is coverage probing — legitimate for the closed-world column, a leak for discovery. The live path uses a different code source by construction; seeharvest_cdeinscripts/eval/sirna_discovery/sources.py. - Same matchers as closed-world. Recall/deal scoring calls the identical
sirna_match.signal_matches_asset/classify_deal_match. Only the set of signals differs.
3. What CDE recovered cold (the wins)¶
Companies surfaced by source (deck firms first-surfaced): clinicaltrials 6, openalex 3, cde 3, sec 1. CDE is tied for the #2 discovery source despite being a single registry — because it is the only topic-searchable window into China-only filings.
Deck assets CDE put on the board (cold):
| Asset | Code | Company | How CDE surfaced it |
|---|---|---|---|
| A09 | SR122 | Suzhou Ribo | sponsor via cde:indications (苏州瑞博 → CJK-normalized → Suzhou Ribo) |
| A10 | BSA204 | Visirna | code via signal:cde — the BSA204 recovery the source was built for, now cold |
| A25 | SR126 | Suzhou Ribo | sponsor via cde:indications |
| A29 | BW-00112 (+combo) | — | code via signal:cde |
| A31 | Enlicitide / MK-7262 | Merck | sponsor via cde:indications |
CJK normalization (#147) working as designed — distinct raw surface forms merged to one entity:
- Visirna ← Visirna Therapeutics HK Limited, 维亚臻生物技术(上海), 维亚臻生物技术(苏州)
- Qilu Pharma ← Qilu Pharmaceutical Co., Ltd., 齐鲁制药有限公司
Non-deck players the system flagged as real (the open-world bonus — "who did the deck miss?"): Eddingpharm (EDP167, an ANGPTL3 siRNA for dyslipidemia — a genuine cardiometabolic asset absent from the deck), plus Ionis, Dicerna, Mirna Therapeutics (CFB siRNA). These are the kind of leads an analyst would want surfaced.
4. Errors (where it's still weak)¶
- Precision is 4% — dominated by OpenAlex author-affiliation noise: hospitals, universities and unrelated orgs that co-author siRNA papers leak through as "companies" (Borgess Health, Houston Methodist, Stryker, Oldham Council, …). This is a precision/affiliation-filter problem, orthogonal to CDE, and is why the open-world precision looks much worse than closed-world's 28%.
- CDE indication coverage is partial. It found Hansoh/Hengrui/CSPC trials but not Innovent, Akeso, BeBetter, Sino Biopharma, Jumpcan — their siRNA assets either aren't registered under the lipid indications queried or aren't in CDE yet. Broadening the Chinese indication set (e.g. obesity 肥胖症, hypertension subtypes) would widen the net.
- 15 of 32 deck assets are corpus-ceiling FNs — pre-clinical, no public footprint anywhere (no trial, filing, or listing). Even a perfect system caps near the 15/32 ceiling on by-sponsor recall; the gap partition shows 0 "discovery-missed" assets in the live+CDE run (everything reachable that we missed, we now reach).
5. Why live deals are still 0 — and what it would take¶
The closed-world test gets 8/12 deals; cold-live gets 0. Three blockers stack, none of them fudgeable:
- CJK exact-match gap (the gateway). CDE did surface the deal counterparties — but under full legal CJK names the normalizer can't reduce:
| Deck firm (FN) | CDE surface form | Why it misses | Deals it gates |
|---|---|---|---|
| Hansoh Pharma | 上海翰森生物医药科技 / 江苏豪森药业 | table has 翰森, not these compounds |
D07 Regeneron×Hansoh |
| Jiangsu Hengrui | 山东盛迪医药 / 上海拓界生物医药 | subsidiaries, not 恒瑞 |
D06 Merck×Hengrui |
| CSPC Pharma | 石药集团中奇制药技术 | contains 石药, ≠ core |
D05 AstraZeneca×CSPC |
The #147 layer does exact core-key match; these need contains/subsidiary→parent resolution. Fixing this is worth ~+3 companies and unlocks the firms that are cleanly listed on HKEX/cninfo.
-
China deal sources are resolution-bound. Even once a firm is recognized: HKEX matches by token-subset (Suzhou Ribo lists as
RIBOLIFE-B— no token overlap with "Suzhou Ribo Life Science"; Qilu Pharma isn't HK-listed), and cninfo (where Merck×Hengrui D06 lives) needs a curated English→stock-code map the live discovery path intentionally doesn't carry (carrying one would re-introduce the known-company leak). A discovered-entity → listing resolver is the missing component. -
Live filings are metadata-thin. The live SEC/HKEX projections store only the filing description; the counterparty name lives in the filing body. The closed-world corpus has enriched body/prospectus text (that's how it scores deals) — the live path would need body enrichment (e.g. the HKEX prospectus path, which names a biotech's whole deal book) to match co-occurrence.
Net: deals are a real, scoped follow-up, not a quick eval tweak — and forcing them (hand-mapping deck firms or feeding stock codes) would be exactly the cheating we audited out.
6. Closed-world re-run with CDE (what the registry adds to the seed)¶
(The user also asked: does querying CDE recover anything in the closed-world corpus?)
scripts/eval/seed_cde_into_db.py crawls CDE for the landscape — drugs_name coverage-probe over
the deck's known codes (legitimate here: measuring recoverability, exactly what inspect_sirna_seed
does per source) plus the validated Chinese indication queries — and upserts into ogur.db. Then
a --corpus db discovery run measures the delta.
57 CDE signals were added to the corpus (22,422 → 22,479).
Discovery-harvester recall — unchanged:
| Metric | Closed-world (db) pre-CDE | Closed-world (db) +CDE | Δ |
|---|---|---|---|
| Company recall | 18/36 (50%) | 18/36 (50%) | 0 |
| Asset — by sponsor | 16/32 (50%) | 16/32 (50%) | 0 |
| Asset — by code | 9/32 (28%) | 9/32 (28%) | 0 |
| Deal recall | 8/12 (67%) | 8/12 (67%) | 0 |
But coverage does improve — and that's the real answer to "can we recover anything from CDE?"
Running the per-source matcher (the inspect_sirna_seed methodology) over just the 57 CDE signals:
- CDE matches 4 deck assets: A01, A10, A29, A30.
- 1 is CDE-unique — matched by no other source in the entire 22k corpus: A10 BSA204 (Visirna) — precisely the asset CDE was built to recover.
So why didn't db-discovery pick up A10? The db discovery harvester gates concept/per-sponsor
matches on a modality token (siRNA/RNAi) in the signal text — and a CDE trial registration
reads "BSA204注射液 … 高胆固醇血症", never "siRNA". The coverage is in the corpus; the harvester's
relevance gate can't see it. Fix: treat CDE like the filing sources (sec/hkex/cninfo),
which are already exempt from the modality gate, or tag CDE signals with the landscape modality at
seed time. That single change would convert CDE's coverage into discovery recall (A10 → +1 asset,
and more as the CDE crawl widens).
Net: in closed-world, CDE adds 1 net-new recoverable deck asset (A10/Visirna) today, gated behind a one-line harvester change — not a re-architecture.
7. Reproducibility¶
# gold deck (set once)
export OGUR_GAP_ASSETS_CSV=".../16_benchmark_ogur_vs_sanofi_siRNA_vDRAFT.xlsx - 01_Assets.csv"
export OGUR_GAP_DEALS_CSV=".../16_benchmark_ogur_vs_sanofi_siRNA_vDRAFT.xlsx - 02_Deals.csv"
uv sync --extra dev --extra visual && uv run playwright install chromium # CDE needs Playwright
THESIS="Map the competitive landscape of dual-targeting siRNA therapies for cardiometabolic disease"
# cold-live baseline (no CDE)
uv run python scripts/eval/sirna_autonomous_discovery.py "$THESIS" --corpus live --no-cde --no-llm --rounds 3
# cold-live + CDE bridge (the headline)
uv run python scripts/eval/sirna_autonomous_discovery.py "$THESIS" --corpus live --no-llm --rounds 3
# closed-world bridge (db)
uv run python scripts/eval/sirna_autonomous_discovery.py "$THESIS" --corpus db --no-llm --rounds 3
# closed-world + CDE: add CDE to the corpus, then re-run db
uv run python scripts/eval/seed_cde_into_db.py
uv run python scripts/eval/sirna_autonomous_discovery.py "$THESIS" --corpus db --no-llm --rounds 3
Scope is the deterministic floor (--no-llm) for reproducibility and to hold scope identical across
all arms, so the only variables are CDE on/off and corpus. It equals the analyst's recommended
cardiometabolic breadth (6 targets, the lipid/CV indication set).
8. Recommendations (ranked by recall-per-effort)¶
- Exempt CDE from the db-discovery harvester's modality-token gate (one line — treat it like the
sec/hkex/cninfofiling sources, which are already exempt). CDE trial text never says "siRNA", so_harvest_dbhides coverage already in the corpus (proven: A10/Visirna). Cheapest possible win; converts coverage → recall in the closed-world/db arm. (Live already bypasses this gate —harvest_cdeprojects CDE hits directly — which is why the live arm shows the CDE gains.) - CJK normalization: exact-match → contains + subsidiary→parent. Unlocks Hansoh/Hengrui/CSPC (~+3 companies) and is the prerequisite for every China-deal source. Highest leverage. (#127)
- Discovered-entity → stock-listing resolver. Map a discovered firm to its HKEX short name / cninfo org-id at runtime (no pre-baked company list), so the deal sources fire on cold-discovered firms. Gateway to live deals.
- OpenAlex affiliation filter. Drop non-commercial affiliations (the bulk of the 4% precision drag) so the open-world company list is analyst-usable.
- Broaden CDE Chinese indications + add the HKEX-prospectus body path for co-occurrence-grade deal text.