Dual-targeting siRNA landscape: how much can the system map from a single sentence?¶

What this report measures¶

Our analyst built, by hand, a reference list for the dual-targeting siRNA space: 32 drugs and 12 deals, plus the companies behind them. That hand-built list is our "right answer" — the thing we score against.

This report asks: how much of that list can the system reproduce on its own, given only a one-sentence brief — "Map the competitive landscape of dual-targeting siRNA therapies for cardiometabolic disease" — and nothing else?

To keep the result honest, we ran it three different ways. They sound similar but mean very different things, and the difference between them is the main point of this report. Each is defined below.

How we score¶

"Found 15 of 32 drugs" means: of the 32 drugs on the analyst's list, the system surfaced 15. We report this as a fraction and a percentage. Higher is better.
Precision means: of everything the system flagged, how much was actually on the analyst's list rather than noise. (A system that flags 1,000 companies to catch 15 real ones scores well on "found" but terribly on precision.)

The three tests (defined)¶

Test 1 — Given the analyst's list: "is each item public at all?" We take the finished list and, for each drug and deal, ask the public sources: is there any public record of this specific named item? Because we already hold the answer, this isn't really discovery — it measures whether the public sources contain the information at all, i.e. the best any system could do with today's sources. Like an open-book exam where you're handed the answer key and just confirm each answer exists.

Test 2 — From the one-sentence brief, searching data we had already collected. We throw the list away and give the system only the one sentence. It works out the targets, diseases and drug type the sentence implies, then searches a database we had built up over earlier runs. The catch: that database was filled, in part, by previously looking up those very company and drug names — so finding them again is partly re-finding what we had already filed away. This number is therefore flattering: it shows the method works, but it is helped by data that already knew the answer. Like a closed-book exam where the textbook on your desk happens to have been written from the answer key.

Test 3 — From the one-sentence brief, searching the live web from scratch. The strictest and truest test. Give the system only the one sentence and let it query the live public sources fresh, with no pre-built database to lean on. Whatever it finds, it found cold. Like a closed-book exam with no notes — you find everything from scratch.

Results¶

What the system found	Test 1: given the list	Test 2: brief + saved data	Test 3: brief + from scratch
Companies	—	18 of 36 (50%)	7 of 36 (19%)
Drugs (via their company)	17 of 32 (53%)	15 of 32 (47%)	3 of 32 (9%)
Deals	11 of 12 (92%)	9 of 12 (75%)	0 of 12 (0%)
Precision (share that was real)	—	~23%	~14%

Read Tests 2 and 3 together:

Test 2 nearly matches Test 1's ceiling on drugs (47% vs the 53% maximum the sources even allow) and finds 9 of 12 deals. The method — read the sentence, work out the companies, look up their filings, match the deals — clearly works end to end.
Test 3 (cold, from scratch) drops to 9% of drugs and 0 deals. That collapse is the real finding — and it is expected, not a fault. The next section explains exactly why.

Why finding things "from scratch" is so hard (and why that's expected)¶

Five reasons that stack on each other:

You can't look up a company whose name you don't have yet. The richest sources for deals and for Chinese drugs — Hong Kong exchange filings, mainland-China filings, US regulatory filings, and patents-by-owner — are searched by company name. There is no "show me every siRNA company" button; you request a specific company's documents. Starting from a sentence, we don't have the names yet. That is the core chicken-and-egg.
The "search-by-topic" sources only work if the document literally says the words. Clinical- trial registries and scientific papers can be searched by topic — but an early-stage Chinese record typically reads "a study of BEBT-701 in high cholesterol," never "dual-targeting siRNA against PCSK9." The words we would search for aren't in the text, so the company never appears.
There is no free directory mapping "biological target → companies working on it." We checked: the main open drug database (OpenTargets) lists the drugs for a target but not the companies. The directories that do map targets to companies (Cortellis, Pharmaprojects, GlobalData) are paid products — the very tools Ogur competes with. So we can't cheaply turn the brief's targets into a company list; we have to build that directory ourselves.
Half the list isn't public anywhere. 15 of the 32 drugs are early-stage (pre-human-trial) with no public footprint at all — no trial, no filing, no stock listing. Even a perfect system caps at ~53% (that is Test 1's ceiling).
The flattering 47% / 75% leans on the pre-built database. Strip that away and search cold, and you get the true 9% / 0%.

Plain version: the hard part isn't reading the documents — it's knowing which companies to look up in the first place, when the best sources require the name to search and no free directory connects the biology to the companies.

Does casting a wider net help? (we tested it)¶

A natural question: Test 3 fires specific searches ("siRNA against PCSK9"). What if we cast a deliberately broad net instead — pull every siRNA trial, filing and paper across the cardiometabolic diseases, with no company names — and rebuild the company list from scratch that way?

We ran exactly that. It surfaced 50 companies, of which 9 were on the analyst's list — 23% of the 39 named companies, versus 19% for the targeted searches. Two things to read carefully:

This is a company count, not a drug count. The "9% of drugs" elsewhere in this report and this "23% of companies" are different measures — it is not a jump from 9 to 23. On the same measure (companies), broadening moved 19% → 23%: a small lift, then a plateau around 20–25%, nowhere near the 50% that leans on saved data.
The broad net is good at mapping the field, just not this deck. Among the 50 it surfaced were real RNAi-cardiometabolic players the analyst's deck doesn't even list — Arbutus, Dicerna, Quark, Silence, Sirius, Sylentis, AdARx, Eddingpharm. That's useful intelligence in its own right. But the deck's preclinical / Chinese picks (BeBetter, Rona, Visirna, Hengrui, …) have no siRNA-cardiometabolic record in any topic-searchable source, so no breadth of query reaches them.

Why it plateaus: the ceiling is which sources you can search by topic at all — not how the search is phrased. About half the list lives only on China's drug-trial registry (CDE) and on by-name exchange filings, which cannot be searched by topic. Reaching those companies means adding those registries (the plan below), not broadening the search.

What we built to lift Test 2 toward the ceiling¶

To get from the cold 9% / 0% up to Test 2's 47% / 75%, we built a four-step chain:

Break the one sentence into the targets, diseases and drug type it implies. (The model did this well — it worked out 5 of the 6 target genes from the sentence alone.)
Build a company list from our own collected data — since no free external directory exists. This surfaced ~40 candidate companies from the targets.
Look up those companies' filings in the by-name sources (Hong Kong / mainland-China / US) — which is how the China-side deals (Qilu–Suzhou Ribo, Merck–Hengrui) finally come back.
A noise filter so the company list isn't mostly junk, and a fix so no target is silently dropped.

This is why Test 2 reaches the source ceiling. It becomes a genuine "from scratch" result the moment that database is filled by cold sources instead of by looking names up — which is the plan below.

How to make the "from scratch" number real — the plan¶

Ranked by impact:

Build the "target → company" directory ourselves (the key enabler). No free external one exists, so we aggregate open feeds into our own: patents-by-owner (which includes Chinese owners), the China drug-trial registry (CDE), and trial sponsors. Once this directory exists, the system can go brief → targets → companies → look up their filings, breaking the chicken-and-egg. This is the same asset the paid incumbents spent years building — by aggregation, not magic.
Add the early-stage registries for the 15 missing drugs: China's CDE, Korea's MFDS, and company pipeline pages.
Turn the patents-by-owner feed back on — renew our patents subscription (currently failing) or add the free Google Patents dataset, which indexes Chinese owners. This is the strongest cold way to get from a topic to a company name, and it directly feeds the directory in step 1.
Tag every record at collection time with its drug type and target, so topic-search works even when the document doesn't spell out "siRNA."
Add a dated press-release archive — closes the one remaining deal and turns the news feed into a backward-looking source, not just a forward monitor.

The common thread: the finding machinery is done and proven — Test 2 reaches the source ceiling. What's left is upstream: feeding the company directory and the early-stage data from sources that don't require knowing the company name first.

One-line summary¶

Given a single sentence, the system rebuilds the landscape to the limit of what public sources contain (47% of drugs, 9 of 12 deals) when it can lean on data we had already collected — but searching the live web cold, from scratch, it finds 9% of drugs and 0 deals. That gap is not a bug: it is the basic difficulty of finding companies you can't yet name, in sources that require the name to search, with no free directory linking biology to companies. Closing it means building that directory ourselves — from patents, China/Korea registries, and trial sponsors — which is exactly the asset the established players built.