Skip to content

siRNA dual-targeting — competitive landscape gap analysis

For: the analyst working the gap analysis. Status: Real coverage baseline (post-plumbing). Numbers reflect commit d2fe602 + the PR #113 matcher fixes.


0. Read this first — what this exercise is and isn't

This is the closed-world coverage ceiling against our current sources, not an Explore-mode discovery run. That distinction matters for how to read the numbers.

The mechanism in this PR is:

  1. Read the 32 asset codes and 40 company names off the reference deck by hand.
  2. Store them on the landscape row as candidate_drugs + companies.
  3. Every source iterates those lists and runs per-drug / per-sponsor queries against its API (CT.gov Pass 3 adds a per-sponsor sweep on top of the per-drug pass).
  4. The inspect script tests whether matching signals came back, applying hyphen normalisation, whitespace-tolerant company-token comparison, and SEC-source-restricted cross-filing.

Implication for the numbers. The "15/32 assets, 5/12 deals" figure answers: "for each row in the deck that we've named by hand, do public sources have any signal carrying that name?" It does NOT answer: "what does Ogur think the dual-siRNA cardiometabolic landscape is, from first principles?" That second question — concept-keyed discovery, projecting hits onto applicant / sponsor / affiliation fields, ranking by source diversity — is the real Explore-mode work, and it's deferred to a follow-up PR (#109, the sponsor-discovery harvester).

This is the closed-world ceiling. The Explore-mode harvester is evaluated against THIS number (15/32 + 5/12), not the original PR #107 baseline (13/32 + 2/12).

What the closed-world test is good for: revealing source-level data gaps. A row at 0/32 means public sources don't carry that asset under the code we used — a real coverage gap to close with new sources (CDE, Chinese aggregators) or query strategies (Pattern A in §3). A row at 200/32 means the asset's components are mature comparators absorbing the signal pool (Pattern D in §3).

What it isn't good for: answering whether Ogur would surface the same 32 assets if you handed it only the thesis (modality + targets + indications). That requires the discovery harvester.

When you read §2 ("coverage measured") and §3 ("where the gaps are"), keep this framing in mind: the gaps are real in the sense that there's no public signal for an asset we know exists — but the evaluation framework is one-sided. The discovery side of the question is on the dev team's side to build next.

1. What we did

We seeded a closed-world modality-target staging landscape against the reference deck ("Dual Targeting siRNA — Competitive Landscape", 03-Apr-2026; 32 assets, 12 deals, 9 insights, 7 view archetypes) and tested whether the existing Ogur source plumbing surfaces signals for each row.

The landscape — cardiometabolic-rnai-001 (scope_type="modality_target") — was hand-curated from the deck:

  • ModalitiessiRNA, RNAi
  • Targets — PCSK9 (anchor) + ANGPTL3, APOC3, LPA, AGT, CFB (the 6 gene symbols on the deck's Target pair column, normalised to HGNC approvedSymbol form)
  • Indications — hyperlipidemia, ASCVD, hypertension, MASH, obesity, cardio-renal (6 canonical labels collapsing the deck's Indication column)
  • Candidate drugs — the 26 codable assets from the deck + 4 comparator anchors (Inclisiran, Zilebesiran, Evolocumab, HRS-5346)
  • Companies — 40 originators + counterparties from the deck's Company, Partner/Licensor, Acquirer/Licensee, Target/Licensor columns

Plumbing changes vs PR #107:

  • CT.gov Pass 3 (per-sponsor sweep, ADR-0005 opt-in via source_filters.clinicaltrials.enable_sponsor_sweep). For each company in landscape.companies, indication-scoped queries on query.lead and query.spons. Recovers trials whose candidate-drug code isn't yet in CT.gov but whose sponsor is — confirmed firing in the re-seed (CT.gov signal pool grew 3,391 → 8,745, +5,354).
  • Deal matcher rewrite — per-deal bucket pass with three evidence paths: same-signal (acq + lic in one signal, any source), SEC cross-filing (both counterparties have SEC presence), code path (deal-focus drug code in any signal).
  • Hyphen-normalised code matching — deck ARO-DIMER-PA ↔ CT.gov ARO-DIMERPA now bind.
  • Parenthesised-target placeholder recognition — A27 Alnylam's (AGT/ANGPTL3) is treated as a target-pair descriptor, fires the company+target backstop.
  • Whitespace-tolerant company tokensSanegeneBio (deck) ↔ Sanegene Bio (signal text) emit a common token set.
  • SEC-source-restricted cross-filingcross_filing_match requires SEC presence on both sides, eliminating the 11/12 false-positive blowup observed under loose semantics.

We ran all 8 functional Ogur sources (Holo3 stays at zero — playwright not installed locally; openFDA is naturally empty for unapproved assets; Lens.org returned 401 so patents fell back to EPO OPS, which itself returned 403/404 for many Chinese applicants). Two coverage CSVs in archived_data/ are the empirical answer.

2. Coverage — measured

Corpus: 11,614 signals in landscape_id="cardiometabolic-rnai-001" (up from 8,480 in PR #107).

Per-source totals

Source Signals Notes
ClinicalTrials.gov 8,745 +5,354 vs #107 from Pass 3 sponsor sweep
SEC EDGAR 1,655 Roughly unchanged from #107
Open Targets 566 ~Unchanged
EPO patents 552 Lens 401 — EPO fallback; many Chinese-applicant searches returned 403/404
OpenAlex 89 Slight uplift
Conferences (Europe PMC) 5 Same
PubMed 2 Same
openFDA 0 Naturally empty (no approved assets)
Holo3 (conference + pipeline) 0 playwright not installed locally

Top-line

After #107 Post-fix Net
Assets (any source) 13 / 32 (41%) 15 / 32 (47%) +2 (A02, A27)
Deals (strict evidence) 2 / 12 (17%) 5 / 12 (42%) +3 (D01, D04, D12)

Assets covered (15)

A01 BEBT-701 · A02 ARO-DIMER-PA · A03 RNS681 · A09 SR122 · A11 HS-02 · A16 Hengrui PCSK9/Lp(a) · A17 Hengrui PCSK9/ANGPTL3 · A25 SR126 · A26 VERVE-101/102 · A27 Alnylam (AGT/ANGPTL3) · A29 BW-00112 · A30 YS2302018+AZD0780 · A31 Enlicitide+MK-7262 · A32 obicetrapib+Repatha · A33 BEECH-217

Deals covered (5)

Deal Path Evidence
D01 Innovent–SanegeneBio same-signal Innovent SEC mentions SanegeneBio; recovered via whitespace-tolerant tokens
D04 Novartis–Argo same-signal 2 signals contain both names
D07 Regeneron–Hansoh SEC cross-filing + code path both SEC-listed + 8 signals carry HRS-5346
D08 Lilly–Verve same-signal + SEC cross-filing strongest evidence (11 same-signal + 41/74 SEC)
D12 Alnylam–Tenaya SEC cross-filing only 99 / 97 SEC sigs each, zero co-occurrence; bilateral 8-K pattern

3. Where the gaps are — and why

The 17 missing assets and 7 missing deals are not random. The new dump's deal_matched, same_signal_match, cross_filing_match, and code_match boolean columns plus the per-counterparty signal counts let the analyst disambiguate "no corpus presence" from "presence without co-occurrence" without re-querying the DB.

Pattern A — Code-bearing assets whose codes don't appear anywhere in the corpus

Aggressive corpus-wide search (strip all non-alphanumerics, substring-match across 11,614 signals in all 6 sources) returns zero hits for the following deck codes:

Asset Code Sponsor What CT.gov has for this sponsor Most likely recovery path
A04 BEBT-706 BeBetter Med BEBT-701 (a different program) CDE registry
A05/A07/A15 HJY-21/22/23 Sino Biopharma / CTTQ (CTTQ trials don't carry HJY codes) CDE registry
A06/A12 STP271G / STP237G Sirnaomics STP705 (a different program) CDE registry
A10 BSA204 Visirna VSA003 (different prefix — B↔V transcription error?) Analyst verification needed
A14 2MW7141 Kalexo Bio / Mabwell (zero CT.gov footprint) CDE registry
A19 SGB-BS01 Sanegene Bio SGB-3403 / SGB-3908 / SGB-7342 (different programs) CDE registry
A21 Csl103 Curigin (Korean) (zero CT.gov footprint) Korean MFDS
A23 SNK-3468 SynerK (zero CT.gov footprint) CDE registry

Why missing: these codes appear neither in CT.gov, SEC, patents, PubMed, nor OpenAlex. The deck author sourced them from non-public channels — corporate pipeline slides, IR presentations, internal IND filings. Pass 3 sponsor sweep confirmed that BeBetter, Sirnaomics, Visirna, Sanegene Bio do have CT.gov trial footprints, but for other programs from their pipelines.

Where the analyst should look: - CDE (China Center for Drug Evaluation)chinadrugtrials.org.cn — the China-native trial registry. Most of these have published INDs there that never cross-register to CT.gov. - Chinese pharma aggregators — PharmCube, Yaozh, Insight DataBase, BioSpace-China — for IND announcements and corporate pipeline updates. - Renewed Lens.org subscription — Lens has direct CNIPA indexing; EPO catches Chinese filings only via PCT equivalents and is currently rate-limited for Chinese applicants in our corpus.

Pattern B — Placeholder-code assets with no public sponsor footprint

A08 Youngen · A13 Basecure · A18/A20 Thalia Therapeutics · A22 Argonaute RNA · A28 Corsera Health.

Why missing: the placeholder-code + company + target backstop fires correctly for these, but the sponsor has zero CT.gov / SEC / patent footprint in our corpus (small / private / preclinical-only firms). The sponsor-discovery harvester is the recovery mechanism.

Pattern C — Deals that don't reach evidence threshold

Looking at the per-counterparty signal counts in the dump:

Deal Acq sigs Lic sigs Reason no evidence
D02 Qilu–Suzhou Ribo 32 4 Pure China-China; never co-occur
D03 Boehringer–Suzhou Ribo 327 4 Boehringer in corpus, Ribo too, no co-occurrence; Ribo has no SEC
D05 AZ–CSPC 423 16 AZ has SEC; CSPC matches via name but no co-occurrence and no CSPC SEC
D06 Merck–Hengrui 716 16 Hengrui appears as CT.gov sponsor only; no SEC presence
D09 Lilly–SanegeneBio 408 4 Both names appear separately; no Lilly filing mentions SanegeneBio in our corpus
D10 Madrigal–Suzhou Ribo 18 4 Madrigal has SEC; Ribo doesn't; no co-occurrence
D11 Akeso–Hubei Jumpcan 0 0 Both China-only, zero corpus footprint

Where the analyst should look: - HKEX disclosures (hkexnews.hk) — for CSPC (HK-listed), Hansoh (HK-listed), Hengrui (HK-listed since 2021), and any Chinese counterparty with an HK secondary listing. - Chinese pharma news aggregators — same as Pattern A. The only Western-accessible feeds for China-only deals. - PR Newswire / GlobalNewswire archives — paid, but they often carry the bilingual press release for cross-border deals when SEC misses them.

Pattern D — Comparators / combination arms absorbing huge signal counts

A32 obicetrapib+Repatha (242 signals) · A26 VERVE-101/102 (43) · A29 BW-00112 (30) · A30 YS2302018+AZD0780 (26) · A31 Enlicitide+MK-7262 (25).

These look like wins, but read the deck closely: these are combination arms and gene-editing assets the deck author included as context, not as primary dual-targeting siRNA candidates. They light up because their components (obicetrapib, Repatha/evolocumab, AZD0780, MK-7262) are mature drugs with rich trial / SEC / publication corpora. The dual-siRNA story does NOT depend on these — the analyst should treat their coverage as a sanity check, not as gap-analysis signal.

4. Unfixed obvious next steps

PR #107 named four deferred items. Status now:

  1. Patent CPC broadening (A61 + C12N) — shipped with ADR-0005 / #111.
  2. CT.gov sponsor sweep — Pass 3 implemented in #113, opt-in per landscape. Recovered A02 directly and confirmed sponsor footprints for the 11 still-missing Chinese-sponsor codes.
  3. Smarter deal matcher — per-deal bucket pass with same-signal / SEC cross-filing / code paths. Recovered D01, D04, D12 (PR #107 named D04, D09, D12 as targets; D09 still unmatched because Lilly's SEC filings genuinely don't mention SanegeneBio in our corpus).
  4. Renew Lens.org API access — manual account action. Probe returned 401 on 2026-06-09. Without renewal, CNIPA-indexed Chinese composition patents stay missing; EPO fallback is partial (403/404 on Chinese applicants).
  5. Sponsor-discovery harvester — real Explore-mode answer to "analyst doesn't know the players yet". Scoped to PR #109. Targets the 17 still-missing assets (11 specific-code gaps + 6 placeholder-sponsor gaps) and the China-counterparty deals (D02, D03, D06, D11). Evaluated against this brief's 15/32 + 5/12 baseline.

5. The analyst's task

Two CSVs live in archived_data/:

  • sirna_coverage_dump_<date>.csv — 352 rows, one per (asset_id, source). Columns include signal_count, latest_signal_date, sample_signal_summary (truncated 300 chars), target_pair_in_raw_text, modality_subclass_in_raw_text, and a notes column flagging placeholder-code rows.
  • sirna_deals_coverage_dump_<date>.csv — 12 rows, one per deal. Columns include deal_matched (strict yes/no), same_signal_match / cross_filing_match / code_match (which path fired), matched_acquirer_signal_count / matched_licensor_signal_count (coverage diagnostic), matched_signal_count (broad union), matched_sources, and the has_upfront_in_text / has_total_in_text / has_geo_terms_in_text flags.

Both load cleanly in Excel / Numbers. The match-path booleans let the analyst quickly partition deals into "covered with bilateral evidence" (deal_matched=True), "both names in corpus but no co-occurrence" (deal_matched=False, both per-counterparty counts > 0 — a Pattern C / B signal), and "one or both counterparties absent" (one count = 0 — pure data gap).

What to do with the CSVs

  1. Fill the yellow columns of the original reference-deck CSVs (sheets 01_Assets, 02_Deals, 03_Insights, 04_Structure_Views) using the dumps. The gap-type column (G1 missing-automatable, G2 partial, G3 wrong-value, G4 missing-provenance, G5 out-of-scope) is where the analytical work lives.
  2. Verify the A10 BSA204 ↔ VSA003 question. Visirna's CT.gov trials carry VSA003. The deck has BSA204. The B↔V prefix swap with a 3-digit numerical change (204 → 003) is suspicious — could be a transcription error in the deck, a different Visirna program, or an internal-vs-clinical code distinction. Worth direct sponsor / IR verification before declaring it a coverage gap.
  3. For each missing asset / deal, propose the recovery option from §3. Be concrete: "would surface via CDE for ~6 of the 8 specific-code-missing assets" beats "needs Chinese sources".
  4. For each "covered" row, sanity-check the sample_signal_summary — does the signal actually substantiate the deck's claim, or did the matcher over-match? D01 Innovent-SanegeneBio specifically merits a manual SEC-text check.
  5. Flag the Pattern D over-coverage — A32 etc. are not siRNA dual-target stories. Mark them as out-of-scope context.

Reporting back

Final write-up in this file's replacement (docs/sirna_landscape_gap_analysis.md). Include: per-field coverage table (the measured per-row reality replacing §2 here); ranked recovery options with estimated uplift, effort, cost, recommendation; China-coverage sub-audit per §3 Patterns A + B + C; open questions for the dev team.


Appendix — quick reference

  • Deck date: 03-Apr-2026; data cutoff ~Q1 2026.
  • Asset count: 32 (A01–A33 with A24 skipped by the deck author).
  • Deal count: 12 (D01–D12).
  • Acknowledged deck gaps: the IND-Enabling detail slides 6/7 and the second deal-details page (slide 2/2) are missing. Rows tagged Map (slide 4) or Timeline (detail = missing slide 2/2) have less ground-truth depth.
  • Target-pair normalisation in the deck: A + B with PCSK9 listed first when present; expect fuzzy matching across /, x, &, and, Lp(a) vs LPA vs APOC vs APOC3.
  • Ogur reference docs: CLAUDE.md (repo overview), docs/architecture.md (source contracts, signal model), docs/data-sources.md (per-source auth + rate limits), docs/adr/0005-sponsor-discovery-harvester.md (per-landscape source_filters contract).