ADR-0002: Knowledge Graph Roadmap for Asset-Centric Intelligence¶

Status: Proposed Date: 2026-04-27 Driver: Khalil Related: ADR-0001 (monitoring-setup inputs), PR introducing SignalDrug (this branch), CLAUDE.md §"Core principles" (knowledge graph is infrastructure, never UI)

Context¶

The Signal table records every piece of intelligence the platform ingests, but it carries entity references as scalar nullable columns — drug_name, company, target, indication, moa, phase. That schema reflects 2025-era reality: each source had one canonical attribution, and the UI was a flat feed.

Two things have changed since:

Sources increasingly produce multi-entity signals. A Sanofi 10-K mentions a full pipeline. A patent on "anti-IL-4Rα antibodies" is relevant to dupilumab, lebrikizumab, and every IL-4Rα candidate. A review paper on biologics in atopic dermatitis covers 5+ drugs. Scalar columns force a lossy choice or NULL — and NULL is the dominant outcome (647/665 SEC signals, 100/100 patents in the immunology corpus had drug_name=NULL before this PR).
The product framing is asset-centric and franchise-aware. Asset Detail (/asset/dupilumab) needs every signal touching that drug. FranchisePortfolio needs all signals across the type-2 inflammation axis. Inspector needs signals reachable from the entity in focus. None of these views are answered well by an exact-match scalar filter.

ADR-0001 reshapes the retrieval layer (which drugs to fetch); this ADR reshapes the attribution layer (which entities each signal is about). The two are complementary: better retrieval reduces fan-out volume, better attribution makes the residual fan-out queryable.

A "knowledge graph" framing is the natural endpoint, but committing to a graph database (Neo4j, RDF) would over-engineer the current scale. The path adopted here is typed entity tables + typed edge tables in SQL + a small Python traversal helper — the same shape Palantir's Foundry ontology uses under the hood.

Decision¶

Adopt an entity-edge model expressed in the existing SQL schema, evolved one edge type at a time. Stop adding scalar entity columns to Signal. New entity links go into dedicated edge tables with provenance and confidence as first-class fields.

The SignalDrug linkage table introduced in this PR is the first edge type. Subsequent edges follow the same pattern.

Entities and edges (immunology-focused)¶

The current schema already has most entity tables. The gap is edges.

Existing entity tables¶

Table	Identity	Has rows today?	Notes
`Drug` (`DrugProfile`)	`normalized_name`	yes (~50)	brand, company, target, MoA, phase as columns
`Target`	`normalized_name`	yes (~12)	display_name, target_class, uniprot_id
`Company` (`CompanyProfile`)	`normalized_name`	yes (sparse)	hq, stage_focus, market_cap_tier
`Landscape`	`id`	yes (2)	now carries `indications[]`, `targets[]`, `moa[]`, `horizon` after ADR-0001
`Signal`	`id`	yes (~4k)	source-of-truth signal record
`KIQ`	`id`	yes	analyst-authored questions
`EvidenceRecord`	`id`	yes	extracted endpoint results

Existing edge tables¶

Table	Cardinality	Notes
`DrugSynonym`	drug ↔ alias	name resolution, not a real edge
`DrugTarget`	drug ↔ target (M2M)	populated by `build_target_graph.py`
`SignalDrug`	signal ↔ drug (M2M)	new in this PR

Missing edge tables (proposed, in priority order)¶

Edge	Why it matters in immunology	Example query it enables
`SignalTarget`	Patents and review papers describe the target, not the drug. The IL-4Rα patent without a drug name still belongs on the dupilumab page through the target.	"Patents touching IL-4Rα filed in the last 90 days."
`SignalCompany`	Replace the scalar `Signal.company` column. Co-marketed assets (Sanofi/Regeneron) and parent/subsidiary structures (Sanofi → Genzyme) need M2M.	"All Sanofi or Regeneron 10-K activity touching the franchise."
`SignalIndication`	A trial often spans indications (Dupixent CRSwNP + Asthma); a press release may cover several.	"Cross-indication phase transitions in the atopic march."
`DrugIndication`	A drug → many indications, and each pair has its own phase, label status, payer status.	"Drugs approved in AD that are in phase 2+ for asthma."
`IndicationCluster`	The atopic march is real and analyst-meaningful but absent from the schema today.	"Comorbidity-adjacent indications for AD."
`TargetPathway`	The type-2 axis (IL-4, IL-5, IL-13, TSLP, IL-33) and the JAK family are pathway groupings that drive competitive thinking.	"All drugs blocking the type-2 axis with phase 2+ activity."
`DrugMoA`	MoA differentiates within a target — JAK1-selective vs. dual JAK½ is a real positioning axis. ADR-0001 §"Why target and/or MoA" calls this out.	"JAK1-selective candidates vs. dual JAKi."
`DrugDrug`	Combos and head-to-head comparators.	"Head-to-head trials against dupilumab."
`TrialDrug`	One trial → many arms; today inferred from `Signal.drug_name`.	"Active phase 3 trials with ≥1 arm we monitor."

Edge schema (canonical shape)¶

Every edge follows the SignalDrug shape — composite PK, link method, confidence, optional evidence pointer:

class <FromEntity><ToEntity>(SQLModel, table=True):
    from_id: str = Field(foreign_key="<from>.id", primary_key=True)
    to_id:   str = Field(foreign_key="<to>.id",   primary_key=True)
    link_method: str       # "source_column" | "synonym_text_match" | "llm_extraction" | "open_targets" | "manual"
    confidence: float      # 0.0–1.0; downstream UI thresholds via <ConfidenceBadge>
    matched_evidence: str | None = None   # the snippet, OT id, or PMID that justified the edge
    created_at: datetime

The constants link_method accepts are deliberately a small enum, not free text — every method has a calibrated confidence range, and adding a method is a code change, not a config change.

Why immunology specifically¶

Immunology is the most demanding test of this model in our current portfolio because competition is rarely target-exclusive. Three structural features force graph thinking:

Pathway-level competition. The type-2 inflammation axis is a network — IL-4, IL-5, IL-13, TSLP, IL-33, and their receptors all converge on Th2 effector function. Dupilumab (IL-4Rα, blocks IL-4 + IL-13 signaling) competes with tezepelumab (TSLP), mepolizumab (IL-5), tralokinumab (IL-13), and lebrikizumab (IL-13) — not because they share a target, but because they're all knocking down overlapping bits of the same axis. A target-only edge model misses this. A target → pathway edge plus a drug → target → pathway traversal answers "who else is in this axis."
MoA differentiation within a target. JAK is the cleanest example: JAK1 (abrocitinib, upadacitinib), JAK½ (baricitinib), JAK⅓ (tofacitinib), TYK2 (deucravacitinib) all hit the JAK family, but the safety/efficacy profile and the regulatory framing differ enough that an analyst sorting by "JAK inhibitor" gets a useless bucket. ADR-0001 already requires moa as a landscape input; the graph needs a drug ↔ moa edge to honor it at query time.
Indication overlap and the atopic march. AD ↔ asthma ↔ allergic rhinitis ↔ EoE ↔ CRSwNP are clinically linked: patients with severe AD are 2–3× likelier to develop asthma; the Dupixent franchise is the commercial expression of that biology. A drug studied in 4 of these 5 indications belongs in the same competitive cluster across all of them. Today this is encoded only in the frontend mock (SIMULATED_ASSETS reuses company: 'Sanofi / Regeneron' as a stringly-typed proxy). It needs a real IndicationCluster table with edges into Indication.

These three axes — pathway, MoA, indication cluster — are the first-order queries the Asset Detail and FranchisePortfolio views should answer cheaply. Today none of them are queryable without ad-hoc Python loops.

Architecture¶

        ┌─────────────────────────────────────────────────────┐
        │              Entity tables (typed nodes)            │
        │  Drug · Target · Company · Indication · Pathway     │
        │  MoA · Trial · Endpoint · KIQ · Landscape           │
        └─────────────────────────────────────────────────────┘
                               △
                               │  FK references
                               │
        ┌─────────────────────────────────────────────────────┐
        │              Edge tables (typed M2M)                │
        │  SignalDrug ✅      SignalTarget                    │
        │  SignalCompany      SignalIndication                │
        │  DrugIndication     IndicationCluster               │
        │  TargetPathway      DrugMoA                         │
        │  DrugDrug           TrialDrug                       │
        └─────────────────────────────────────────────────────┘
                               △
                               │  reads
                               │
        ┌─────────────────────────────────────────────────────┐
        │     ogur/graph/ — thin Python traversal helper      │
        │                                                     │
        │  signals_for_drug(drug, hops=1)                     │
        │  signals_for_drug(drug, via="target", hops=2)       │
        │  competitors_of(drug, by="pathway")                 │
        │  drugs_in_franchise(landscape)                      │
        └─────────────────────────────────────────────────────┘
                               △
                               │  used by
                               │
        ┌─────────────────────────────────────────────────────┐
        │   API routes · briefing engine · analyzers · UI     │
        └─────────────────────────────────────────────────────┘

Key architectural choices:

No graph DB. All edges live in SQL. Composite PK + indexes give us cheap lookups; Python composes traversals. The cost of adding Neo4j (operational complexity, model duplication, migration churn) is not justified at our scale (~4k signals, ~50 drugs).
Edges carry provenance. link_method + confidence + matched_evidence mean every edge can be defended on the page where it surfaces — directly serving the "provenance is non-negotiable" principle in CLAUDE.md.
The traversal helper is thin. A few well-named functions, not a query language. If we ever outgrow that, the helper boundary is the migration point — swap the implementation, keep the call sites.
Scalar columns on Signal stay during migration. Signal.drug_name remains the source-authored attribution; the SignalDrug table is a derived index built on top of it. Same pattern when we deprecate Signal.company later: edge first, column removal last.

Implementation phases¶

Each phase is one PR. Phases are independently shippable — none of them are blocked on another beyond what's stated.

Phase 0 — `SignalDrug` (this PR, complete)¶

SignalDrug model + idempotent backfill enricher (column copy + word-boundary synonym match).
/api/signals?drug_name=X reads column ∪ linkage table.
~21 new tests; full suite green.
Demo win: unblocks SEC/conference signals on Asset Detail. Patent visibility waits for ADR-0001 phase 2 to widen retrieval.

Phase 1 — `SignalTarget` and LLM extraction pass (~3 days)¶

Add SignalTarget table (signal ↔ target M2M).
Backfill via target name regex + Haiku-based extraction over signals where deterministic match misses (target families, code names, indirect references like "anti-Th2 antibody").
Demo win: patents about IL-4Rα/IL-13 surface on dupilumab/lebrikizumab even when the abstract names no drug — closes the residual gap from ADR-0001 phase 2 retrieval.

Phase 2 — `SignalCompany` (~2 days)¶

Replace the scalar Signal.company column with the edge table.
Includes a normalization pass — "REGENERON PHARMACEUTICALS, INC." and "Regeneron Pharmaceuticals" become one company id.
Demo win: Sanofi/Regeneron co-ownership of the Dupixent franchise becomes queryable; SEC filings find the right asset(s).

Phase 3 — `IndicationCluster` and `TargetPathway` (~3 days)¶

Two tables encoding the atopic march and the type-2 axis respectively.
Seeded by hand for immunology-001 from clinical literature; documented in docs/data-sources.md.
Demo win: FranchisePortfolio can answer "all signals across the type-2 axis" without per-target unions.

Phase 4 — `ogur/graph/` traversal helper (~2 days)¶

Replace the ad-hoc joins in API routes and analyzers with named primitives.
Ship signals_for_drug(drug, hops=1, min_confidence=0.5) first; expand on demand.
Outcome: API code stops reasoning about edge tables directly; future edges are additive, not invasive.

Phase 5 — column deprecations (~1 day)¶

Once edges supersede them, remove Signal.drug_name, Signal.company, Signal.target, Signal.indication, Signal.moa columns.
Migration script: backfill then drop. Behind a feature flag if any live consumer remains.

Consequences¶

Easier: - Multi-entity signals stop being a lossy ingestion problem. - Confidence becomes a first-class concept, hooking directly into the <ConfidenceBadge> UI primitive (docs/design/ux-spec.md §6.2). - Adding a new entity type or edge type is local — one model, one enricher, one test file. No global refactor. - Asset Detail / FranchisePortfolio can ask graph-shaped questions without ad-hoc SQL in route handlers.

Harder: - More tables to keep in sync across enrichers, seeders, and tests. - Edge backfills have to run in the right order (e.g. SignalDrug before any traversal that joins through it). This PR's make signal-drug-links target is the model. - Confidence calibration is an ongoing concern — 0.7 for synonym text match was eyeballed; a labeled eval set should pin it.

Still unsolved (deliberately deferred): - Generic edge table. If we end up with 6+ typed edge tables that all share the same shape, a Edge(from_type, from_id, to_type, to_id, type, confidence, evidence) table becomes worth considering. Premature today; revisit after Phase 4. - Temporal edges. The competitive landscape changes — a drug discontinued in AD is still relevant to the franchise as a historical comparator. Edges may need valid_from / valid_to. Not in scope for the next year of phases. - Cross-landscape edges. A drug in oncology that flips into immunology (e.g. anti-CD20 in lupus) needs two-landscape membership. Currently Landscape.candidate_drugs is a flat list per landscape; cross-landscape sharing is unmodeled. - Entity resolution as a first-class step. DrugSynonym covers drug aliases; companies and targets need analogous resolution layers. Phase 2 (SignalCompany) builds the first; the rest can wait.

Concrete query examples (after Phase 4)¶

Question an analyst asks	Today	After Phase 4
"All signals about dupilumab in the last 30 days."	`WHERE drug_name='dupilumab'` — misses SEC/patents	`signals_for_drug('dupilumab', hops=1)`
"All signals about IL-4Rα antagonism."	not possible	`signals_for_drug('dupilumab', via='target', hops=2)`
"All signals about the type-2 inflammation axis."	not possible	`signals_for_drug('dupilumab', via='pathway', hops=3)`
"Has anything happened to a Dupixent comparator this week?"	manual roster	`signals_for_franchise('dupixent', signal_types=[…])`
"Patents threatening dupilumab IP."	not possible	`signals_for_drug('dupilumab', via='target', source='lens')` filtered by claim text
"JAK1-selective candidates with phase 2+ activity in AD."	not possible	`drugs_in_pathway('JAK', moa='JAK1-selective', indication='AD', phase_min=2)`

Open questions¶

Confidence calibration. What labeled eval set do we use to tune the per-link_method confidence values? Proposal: a 100-signal manual labeling pass on the immunology corpus, seeded by analyst review. Scope target: end of next demo cycle.
LLM extraction model. Haiku is the default for cheap classification; a smaller open model (e.g. gliner for entity extraction, already in the codebase for the BIOPSY pipeline) might be enough for the deterministic-miss fallback. Decision deferred to Phase 1.
Edge invalidation on source updates. When a signal is re-ingested (raw_data changes), do edges re-derive from scratch or merge? Current SignalDrug enricher only inserts; it doesn't reconcile. A clean answer is "edges are deterministic functions of (signal, enricher version) — bump the version and rebuild." Out of scope here.
UI surface for confidence. Does the frontend expose confidence to analysts (a slider), or threshold it server-side? Current <ConfidenceBadge> shows it, but the API hides edge confidence behind a binary "is this a match" today. Worth a UX spec amendment alongside Phase 1.
Generic vs. typed edge tables. When does the cost of N typed tables exceed the cost of one generic edge table with discriminator columns? Likely after Phase 4; revisit then.