siRNA dual-targeting — competitive landscape gap analysis¶
For: the analyst working the gap analysis.
Status: Real coverage baseline (post-plumbing). Numbers reflect commit d2fe602 + the PR #113 matcher fixes.
0. Read this first — what this exercise is and isn't¶
This is the closed-world coverage ceiling against our current sources, not an Explore-mode discovery run. That distinction matters for how to read the numbers.
The mechanism in this PR is:
- Read the 32 asset codes and 40 company names off the reference deck by hand.
- Store them on the landscape row as
candidate_drugs+companies. - Every source iterates those lists and runs per-drug / per-sponsor queries against its API (CT.gov Pass 3 adds a per-sponsor sweep on top of the per-drug pass).
- The inspect script tests whether matching signals came back, applying hyphen normalisation, whitespace-tolerant company-token comparison, and SEC-source-restricted cross-filing.
Implication for the numbers. The "15/32 assets, 5/12 deals" figure answers: "for each row in the deck that we've named by hand, do public sources have any signal carrying that name?" It does NOT answer: "what does Ogur think the dual-siRNA cardiometabolic landscape is, from first principles?" That second question — concept-keyed discovery, projecting hits onto applicant / sponsor / affiliation fields, ranking by source diversity — is the real Explore-mode work, and it's deferred to a follow-up PR (#109, the sponsor-discovery harvester).
This is the closed-world ceiling. The Explore-mode harvester is evaluated against THIS number (15/32 + 5/12), not the original PR #107 baseline (13/32 + 2/12).
What the closed-world test is good for: revealing source-level data gaps. A row at 0/32 means public sources don't carry that asset under the code we used — a real coverage gap to close with new sources (CDE, Chinese aggregators) or query strategies (Pattern A in §3). A row at 200/32 means the asset's components are mature comparators absorbing the signal pool (Pattern D in §3).
What it isn't good for: answering whether Ogur would surface the same 32 assets if you handed it only the thesis (modality + targets + indications). That requires the discovery harvester.
When you read §2 ("coverage measured") and §3 ("where the gaps are"), keep this framing in mind: the gaps are real in the sense that there's no public signal for an asset we know exists — but the evaluation framework is one-sided. The discovery side of the question is on the dev team's side to build next.
1. What we did¶
We seeded a closed-world modality-target staging landscape against the reference deck ("Dual Targeting siRNA — Competitive Landscape", 03-Apr-2026; 32 assets, 12 deals, 9 insights, 7 view archetypes) and tested whether the existing Ogur source plumbing surfaces signals for each row.
The landscape — cardiometabolic-rnai-001 (scope_type="modality_target") — was hand-curated from the deck:
- Modalities —
siRNA,RNAi - Targets — PCSK9 (anchor) + ANGPTL3, APOC3, LPA, AGT, CFB (the 6 gene symbols on the deck's
Target paircolumn, normalised to HGNCapprovedSymbolform) - Indications — hyperlipidemia, ASCVD, hypertension, MASH, obesity, cardio-renal (6 canonical labels collapsing the deck's
Indicationcolumn) - Candidate drugs — the 26 codable assets from the deck + 4 comparator anchors (Inclisiran, Zilebesiran, Evolocumab, HRS-5346)
- Companies — 40 originators + counterparties from the deck's
Company,Partner/Licensor,Acquirer/Licensee,Target/Licensorcolumns
Plumbing changes vs PR #107:
- CT.gov Pass 3 (per-sponsor sweep, ADR-0005 opt-in via
source_filters.clinicaltrials.enable_sponsor_sweep). For each company inlandscape.companies, indication-scoped queries onquery.leadandquery.spons. Recovers trials whose candidate-drug code isn't yet in CT.gov but whose sponsor is — confirmed firing in the re-seed (CT.gov signal pool grew 3,391 → 8,745, +5,354). - Deal matcher rewrite — per-deal bucket pass with three evidence paths: same-signal (acq + lic in one signal, any source), SEC cross-filing (both counterparties have SEC presence), code path (deal-focus drug code in any signal).
- Hyphen-normalised code matching — deck
ARO-DIMER-PA↔ CT.govARO-DIMERPAnow bind. - Parenthesised-target placeholder recognition — A27 Alnylam's
(AGT/ANGPTL3)is treated as a target-pair descriptor, fires the company+target backstop. - Whitespace-tolerant company tokens —
SanegeneBio(deck) ↔Sanegene Bio(signal text) emit a common token set. - SEC-source-restricted cross-filing —
cross_filing_matchrequires SEC presence on both sides, eliminating the 11/12 false-positive blowup observed under loose semantics.
We ran all 8 functional Ogur sources (Holo3 stays at zero — playwright not installed locally; openFDA is naturally empty for unapproved assets; Lens.org returned 401 so patents fell back to EPO OPS, which itself returned 403/404 for many Chinese applicants). Two coverage CSVs in archived_data/ are the empirical answer.
2. Coverage — measured¶
Corpus: 11,614 signals in landscape_id="cardiometabolic-rnai-001" (up from 8,480 in PR #107).
Per-source totals¶
| Source | Signals | Notes |
|---|---|---|
| ClinicalTrials.gov | 8,745 | +5,354 vs #107 from Pass 3 sponsor sweep |
| SEC EDGAR | 1,655 | Roughly unchanged from #107 |
| Open Targets | 566 | ~Unchanged |
| EPO patents | 552 | Lens 401 — EPO fallback; many Chinese-applicant searches returned 403/404 |
| OpenAlex | 89 | Slight uplift |
| Conferences (Europe PMC) | 5 | Same |
| PubMed | 2 | Same |
| openFDA | 0 | Naturally empty (no approved assets) |
| Holo3 (conference + pipeline) | 0 | playwright not installed locally |
Top-line¶
| After #107 | Post-fix | Net | |
|---|---|---|---|
| Assets (any source) | 13 / 32 (41%) | 15 / 32 (47%) | +2 (A02, A27) |
| Deals (strict evidence) | 2 / 12 (17%) | 5 / 12 (42%) | +3 (D01, D04, D12) |
Assets covered (15)¶
A01 BEBT-701 · A02 ARO-DIMER-PA · A03 RNS681 · A09 SR122 · A11 HS-02 · A16 Hengrui PCSK9/Lp(a) · A17 Hengrui PCSK9/ANGPTL3 · A25 SR126 · A26 VERVE-101/102 · A27 Alnylam (AGT/ANGPTL3) · A29 BW-00112 · A30 YS2302018+AZD0780 · A31 Enlicitide+MK-7262 · A32 obicetrapib+Repatha · A33 BEECH-217
Deals covered (5)¶
| Deal | Path | Evidence |
|---|---|---|
| D01 Innovent–SanegeneBio | same-signal | Innovent SEC mentions SanegeneBio; recovered via whitespace-tolerant tokens |
| D04 Novartis–Argo | same-signal | 2 signals contain both names |
| D07 Regeneron–Hansoh | SEC cross-filing + code path | both SEC-listed + 8 signals carry HRS-5346 |
| D08 Lilly–Verve | same-signal + SEC cross-filing | strongest evidence (11 same-signal + 41/74 SEC) |
| D12 Alnylam–Tenaya | SEC cross-filing only | 99 / 97 SEC sigs each, zero co-occurrence; bilateral 8-K pattern |
3. Where the gaps are — and why¶
The 17 missing assets and 7 missing deals are not random. The new dump's deal_matched, same_signal_match, cross_filing_match, and code_match boolean columns plus the per-counterparty signal counts let the analyst disambiguate "no corpus presence" from "presence without co-occurrence" without re-querying the DB.
Pattern A — Code-bearing assets whose codes don't appear anywhere in the corpus¶
Aggressive corpus-wide search (strip all non-alphanumerics, substring-match across 11,614 signals in all 6 sources) returns zero hits for the following deck codes:
| Asset | Code | Sponsor | What CT.gov has for this sponsor | Most likely recovery path |
|---|---|---|---|---|
| A04 | BEBT-706 | BeBetter Med | BEBT-701 (a different program) | CDE registry |
| A05/A07/A15 | HJY-21/22/23 | Sino Biopharma / CTTQ | (CTTQ trials don't carry HJY codes) | CDE registry |
| A06/A12 | STP271G / STP237G | Sirnaomics | STP705 (a different program) | CDE registry |
| A10 | BSA204 | Visirna | VSA003 (different prefix — B↔V transcription error?) | Analyst verification needed |
| A14 | 2MW7141 | Kalexo Bio / Mabwell | (zero CT.gov footprint) | CDE registry |
| A19 | SGB-BS01 | Sanegene Bio | SGB-3403 / SGB-3908 / SGB-7342 (different programs) | CDE registry |
| A21 | Csl103 | Curigin (Korean) | (zero CT.gov footprint) | Korean MFDS |
| A23 | SNK-3468 | SynerK | (zero CT.gov footprint) | CDE registry |
Why missing: these codes appear neither in CT.gov, SEC, patents, PubMed, nor OpenAlex. The deck author sourced them from non-public channels — corporate pipeline slides, IR presentations, internal IND filings. Pass 3 sponsor sweep confirmed that BeBetter, Sirnaomics, Visirna, Sanegene Bio do have CT.gov trial footprints, but for other programs from their pipelines.
Where the analyst should look:
- CDE (China Center for Drug Evaluation) — chinadrugtrials.org.cn — the China-native trial registry. Most of these have published INDs there that never cross-register to CT.gov.
- Chinese pharma aggregators — PharmCube, Yaozh, Insight DataBase, BioSpace-China — for IND announcements and corporate pipeline updates.
- Renewed Lens.org subscription — Lens has direct CNIPA indexing; EPO catches Chinese filings only via PCT equivalents and is currently rate-limited for Chinese applicants in our corpus.
Pattern B — Placeholder-code assets with no public sponsor footprint¶
A08 Youngen · A13 Basecure · A18/A20 Thalia Therapeutics · A22 Argonaute RNA · A28 Corsera Health.
Why missing: the placeholder-code + company + target backstop fires correctly for these, but the sponsor has zero CT.gov / SEC / patent footprint in our corpus (small / private / preclinical-only firms). The sponsor-discovery harvester is the recovery mechanism.
Pattern C — Deals that don't reach evidence threshold¶
Looking at the per-counterparty signal counts in the dump:
| Deal | Acq sigs | Lic sigs | Reason no evidence |
|---|---|---|---|
| D02 Qilu–Suzhou Ribo | 32 | 4 | Pure China-China; never co-occur |
| D03 Boehringer–Suzhou Ribo | 327 | 4 | Boehringer in corpus, Ribo too, no co-occurrence; Ribo has no SEC |
| D05 AZ–CSPC | 423 | 16 | AZ has SEC; CSPC matches via name but no co-occurrence and no CSPC SEC |
| D06 Merck–Hengrui | 716 | 16 | Hengrui appears as CT.gov sponsor only; no SEC presence |
| D09 Lilly–SanegeneBio | 408 | 4 | Both names appear separately; no Lilly filing mentions SanegeneBio in our corpus |
| D10 Madrigal–Suzhou Ribo | 18 | 4 | Madrigal has SEC; Ribo doesn't; no co-occurrence |
| D11 Akeso–Hubei Jumpcan | 0 | 0 | Both China-only, zero corpus footprint |
Where the analyst should look:
- HKEX disclosures (hkexnews.hk) — for CSPC (HK-listed), Hansoh (HK-listed), Hengrui (HK-listed since 2021), and any Chinese counterparty with an HK secondary listing.
- Chinese pharma news aggregators — same as Pattern A. The only Western-accessible feeds for China-only deals.
- PR Newswire / GlobalNewswire archives — paid, but they often carry the bilingual press release for cross-border deals when SEC misses them.
Pattern D — Comparators / combination arms absorbing huge signal counts¶
A32 obicetrapib+Repatha (242 signals) · A26 VERVE-101/102 (43) · A29 BW-00112 (30) · A30 YS2302018+AZD0780 (26) · A31 Enlicitide+MK-7262 (25).
These look like wins, but read the deck closely: these are combination arms and gene-editing assets the deck author included as context, not as primary dual-targeting siRNA candidates. They light up because their components (obicetrapib, Repatha/evolocumab, AZD0780, MK-7262) are mature drugs with rich trial / SEC / publication corpora. The dual-siRNA story does NOT depend on these — the analyst should treat their coverage as a sanity check, not as gap-analysis signal.
4. Unfixed obvious next steps¶
PR #107 named four deferred items. Status now:
- ✅ Patent CPC broadening (A61 + C12N) — shipped with ADR-0005 / #111.
- ✅ CT.gov sponsor sweep — Pass 3 implemented in #113, opt-in per landscape. Recovered A02 directly and confirmed sponsor footprints for the 11 still-missing Chinese-sponsor codes.
- ✅ Smarter deal matcher — per-deal bucket pass with same-signal / SEC cross-filing / code paths. Recovered D01, D04, D12 (PR #107 named D04, D09, D12 as targets; D09 still unmatched because Lilly's SEC filings genuinely don't mention SanegeneBio in our corpus).
- ⏸ Renew Lens.org API access — manual account action. Probe returned 401 on 2026-06-09. Without renewal, CNIPA-indexed Chinese composition patents stay missing; EPO fallback is partial (403/404 on Chinese applicants).
- ⏸ Sponsor-discovery harvester — real Explore-mode answer to "analyst doesn't know the players yet". Scoped to PR #109. Targets the 17 still-missing assets (11 specific-code gaps + 6 placeholder-sponsor gaps) and the China-counterparty deals (D02, D03, D06, D11). Evaluated against this brief's 15/32 + 5/12 baseline.
5. The analyst's task¶
Two CSVs live in archived_data/:
sirna_coverage_dump_<date>.csv— 352 rows, one per (asset_id, source). Columns includesignal_count,latest_signal_date,sample_signal_summary(truncated 300 chars),target_pair_in_raw_text,modality_subclass_in_raw_text, and anotescolumn flagging placeholder-code rows.sirna_deals_coverage_dump_<date>.csv— 12 rows, one per deal. Columns includedeal_matched(strict yes/no),same_signal_match/cross_filing_match/code_match(which path fired),matched_acquirer_signal_count/matched_licensor_signal_count(coverage diagnostic),matched_signal_count(broad union),matched_sources, and thehas_upfront_in_text/has_total_in_text/has_geo_terms_in_textflags.
Both load cleanly in Excel / Numbers. The match-path booleans let the analyst quickly partition deals into "covered with bilateral evidence" (deal_matched=True), "both names in corpus but no co-occurrence" (deal_matched=False, both per-counterparty counts > 0 — a Pattern C / B signal), and "one or both counterparties absent" (one count = 0 — pure data gap).
What to do with the CSVs¶
- Fill the yellow columns of the original reference-deck CSVs (sheets
01_Assets,02_Deals,03_Insights,04_Structure_Views) using the dumps. The gap-type column (G1 missing-automatable, G2 partial, G3 wrong-value, G4 missing-provenance, G5 out-of-scope) is where the analytical work lives. - Verify the A10 BSA204 ↔ VSA003 question. Visirna's CT.gov trials carry
VSA003. The deck hasBSA204. The B↔V prefix swap with a 3-digit numerical change (204 → 003) is suspicious — could be a transcription error in the deck, a different Visirna program, or an internal-vs-clinical code distinction. Worth direct sponsor / IR verification before declaring it a coverage gap. - For each missing asset / deal, propose the recovery option from §3. Be concrete: "would surface via CDE for ~6 of the 8 specific-code-missing assets" beats "needs Chinese sources".
- For each "covered" row, sanity-check the
sample_signal_summary— does the signal actually substantiate the deck's claim, or did the matcher over-match? D01 Innovent-SanegeneBio specifically merits a manual SEC-text check. - Flag the Pattern D over-coverage — A32 etc. are not siRNA dual-target stories. Mark them as out-of-scope context.
Reporting back¶
Final write-up in this file's replacement (docs/sirna_landscape_gap_analysis.md). Include: per-field coverage table (the measured per-row reality replacing §2 here); ranked recovery options with estimated uplift, effort, cost, recommendation; China-coverage sub-audit per §3 Patterns A + B + C; open questions for the dev team.
Appendix — quick reference¶
- Deck date: 03-Apr-2026; data cutoff ~Q1 2026.
- Asset count: 32 (A01–A33 with A24 skipped by the deck author).
- Deal count: 12 (D01–D12).
- Acknowledged deck gaps: the IND-Enabling detail slides 6/7 and the second deal-details page (slide 2/2) are missing. Rows tagged
Map (slide 4)orTimeline (detail = missing slide 2/2)have less ground-truth depth. - Target-pair normalisation in the deck:
A + Bwith PCSK9 listed first when present; expect fuzzy matching across/,x,&,and,Lp(a)vsLPAvsAPOCvsAPOC3. - Ogur reference docs:
CLAUDE.md(repo overview),docs/architecture.md(source contracts, signal model),docs/data-sources.md(per-source auth + rate limits),docs/adr/0005-sponsor-discovery-harvester.md(per-landscapesource_filterscontract).