Architecture¶
Full system walkthrough. If a code reference below conflicts with what's in the repo, trust the code — ping the doc owner to fix.
§1 Ten-thousand-foot view¶
┌──────────────────────────────────────────────────────────────────────────┐
│ INGESTION (seed scripts) │
│ │
│ ClinicalTrials.gov ─┐ │
│ PubMed ─┤ │
│ OpenFDA ─┤ │
│ Open Targets ─┼──► Source adapters ──► normalize to Signal ──► │
│ Europe PMC ─┤ (ogur/sources/) content_hash dedup │
│ OpenAlex ─┤ │
│ SEC EDGAR ─┤ │
│ Lens / EPO OPS ─┤ │
│ Holo3 (Playwright) ─┘ │
│ │
└──────────────────────────────────┬───────────────────────────────────────┘
│
▼ SQLite (ogur.db) — Signals + Profiles + Targets
│
┌──────────────────────────────────┼───────────────────────────────────────┐
│ INTELLIGENCE ENGINE (ogur/engine/) │
│ │
│ ChangeDetector ──► AgentOrchestrator ──► Enricher ──► Synthesizer │
│ (pure Python) (5 DomainAgents, (pure (Sonnet — │
│ Haiku scoring) Python) streaming, │
│ KIQ-aware) │
│ │ │
│ KIQs + ENTITY CATALOG ─────────────┤ │
│ ▼ │
│ Verification gate (KIQ shape + │
│ entity-ref check) │
│ │
│ Evidence pipeline (CLI-only) ──► extractor/ ──► EvidenceRecord + │
│ (Haiku + GLiNER) ProtocolProfile │
│ │
│ QueryEngine (Haiku, ad-hoc Q&A) │
└──────────────────────────────────┬───────────────────────────────────────┘
│
▼ Briefing rows, per-tab analyses, ask responses
│
┌──────────────────────────────────┼───────────────────────────────────────┐
│ API (ogur/api/ — FastAPI) │
│ /health · /api/signals · /api/briefing/… · /api/ask · │
│ /api/landscapes/{id}/evidence/comparative │
└──────────────────────────────────┬───────────────────────────────────────┘
│
▼ HTTP/JSON
│
┌──────────────────────────────────┼───────────────────────────────────────┐
│ FRONTEND (frontend/ — Vite + React) │
│ Portfolio · Asset Detail (6 tabs) · Global Signals · Global Ask │
└──────────────────────────────────────────────────────────────────────────┘
Three hard boundaries:
1. Ingestion is separate from synthesis. scripts/seed_* scrapes; scripts/generate_briefing reads. The pipeline never re-scrapes.
2. DetectedChange is a value object. It references a Signal row — it never creates one. The dedup invariant (SHA-256 content hash, unique in DB) is preserved through the pipeline.
3. Engine writes to DB only through store/. FastAPI routes call engine functions, not DB queries, except for read-only list endpoints.
§2 Data model¶
SQLModel tables in ogur/models/:
Signal (signal.py)¶
The atomic unit of intelligence. Every source normalizes to this.
id UUID, primary key
source "clinicaltrials" | "pubmed" | "openfda" | "opentargets" |
"conferences" | "openalex" | "sec" | "lens" | "epo" |
"holo_conference" | "holo_pipeline"
source_id Original source ID (NCT, PMID, accession, etc.)
signal_type Enum — 20+ types (see below)
severity "high" | "medium" | "low"
drug_name Normalized generic (e.g. "pembrolizumab")
drug_brand_name "Keytruda"
company "Merck"
indication "Non-Small Cell Lung Cancer"
target "PD-1"
moa One-sentence mechanism of action
phase "Phase 3", "Approved", etc.
title str
summary str
raw_data JSON string — full source payload (re-parse if normalization changes)
detected_at timestamp ingestion saw it
event_date timestamp the underlying event happened (if known)
landscape_id FK to Landscape
content_hash SHA-256 truncated to 16 hex — UNIQUE constraint
SignalType enum (signal.py:7):
| Category | Types |
|---|---|
| Trial lifecycle | phase_transition, trial_registered, trial_amendment, trial_status_change, trial_enrollment, protocol_amendment |
| Literature | publication, conference_abstract |
| Regulatory | fda_approval, label_change, safety_signal, regulatory_event |
| Pipeline | pipeline_update, early_pipeline |
| Corporate | press_release, ma_announcement, licensing_deal, investment_round, leadership_change |
| IP | patent_filing |
| Visual intelligence | earnings_narrative, job_posting, kol_activity |
DrugProfile + DrugSynonym (drug.py)¶
Assembled from signals via scripts/build_drug_profiles.py. DrugSynonym maps every known alias ("Keytruda", "MK-3475", "lambrolizumab", "pembro") to the canonical generic name, so signals from different sources coalesce.
CompanyProfile (company.py)¶
Primary-keyed by normalized_name (no UUID — company name is the natural identifier). Aggregates recent_deals JSON list capped at 20.
Target + DrugTarget (target.py)¶
Bipartite graph: Target is a gene/protein node (HGNC symbol PK), DrugTarget is a weighted edge where evidence_count increments when a new source confirms the same pair. Built by scripts/build_target_graph.py from DrugProfile data.
Briefing (briefing.py)¶
Stores the synthesizer's structured JSON as columns. landscape_id is overloaded to encode composite identifiers:
landscape_id format |
Produced by |
|---|---|
nsclc-001 |
Landscape-level briefing (generate_briefing.py) |
nsclc-001-pembrolizumab |
Drug-level briefing (generate_drug_briefing.py) |
nsclc-001-pembrolizumab-overview |
Per-tab analyzer (overview) |
nsclc-001-pembrolizumab-trials |
Per-tab analyzer (trials) |
nsclc-001-pembrolizumab-competitive |
Per-tab analyzer (competitive) |
Full column inventory:
- Synthesis fields —
executive_summary,signal_analyses(JSON list),strategic_implications,watchlist(JSON list),predictions(JSON list). These mirror the synthesizer's structured output. - Harness fields (added with the verification gate):
kiq_answers— JSON list of structured KIQ responses (one per active KIQ; see §4.8).schema_valid— tri-state (None/True/False).Nonemeans no validator ran (e.g. a briefing for a landscape with no KIQs);True/Falsemean the verification gate passed/failed after synthesis retries.schema_errors— JSON list of error strings emitted by the verification gate (see §4.8).- Window + metadata —
period_start,period_end(the briefing's lookback window),signals_count(post-classification count),model_used(the synthesizer model ID at generation time, e.g.claude-sonnet-4-6),generated_at(defaultdatetime.utcnow()at insert). - Identity —
id(UUID PK),landscape_id(indexed; overloaded for composite keys per the table above).
KIQ (kiq.py)¶
A Key Intelligence Question — the intent-capture layer. One row per question per scope: question text, time_horizon enum (TACTICAL / OPERATIONAL / STRATEGIC), priority int, active bool. Seeded by scripts/seed_kiqs.py with stable IDs, so re-runs idempotently merge instead of duplicating. Every briefing is generated against the active KIQs for its scope, and the synthesizer produces one structured answer block per active KIQ.
Path C — landscape-level vs. drug-specific KIQs. KIQ scoping mirrors the Briefing landscape_id overload. Class-level questions ("How is the JAK inhibitor class evolving?") attach to the parent landscape (immunology-001). Drug-specific questions ("How is dupilumab differentiating against lebrikizumab?") attach to the drug-composite ID (immunology-001-dupilumab). When generate_drug_briefing.py runs for dupilumab, it loads KIQs keyed on immunology-001-dupilumab only — class-level KIQs do not bleed into individual drug briefings. The one-time migration that moved legacy parent-only KIQs into the new scopes is scripts/migrate_immunology_kiqs.py.
EvidenceRecord + ProtocolProfile (evidence.py)¶
Stored separately from Briefing — these are the durable "structured trial outcomes" tables.
ProtocolProfile— one row per trial (PKtrial_id). Holds parsed protocol fields: population, line of therapy, primary endpoint, biomarker selection, blinding, etc.EvidenceRecord— N rows per trial. Each is a single (arm, endpoint, value, unit, CI, HR, p, comparator) outcome with araw_excerptprovenance string so the UI can show "this number came from here." Content-hashed on(trial_id, drug, endpoint, arm, subgroup, source_id)for dedup.
Briefing references trials by NCT ID inside signal_analyses / predictions prose; it does not hold foreign keys into Evidence. The relationship is intentionally one-way — Evidence is a query target, not a Briefing dependency.
Landscape (landscape.py)¶
Scope definition — indication, therapeutic area, tracked conditions/targets/companies (all stored as JSON strings). Currently two are seeded: nsclc-001 (oncology) and immunology-001 (dupilumab / atopic dermatitis).
§3 Ingestion¶
The Source base class (ogur/sources/base.py)¶
Abstract fetch(landscape) → list[Signal] contract. Each source:
- Gets a shared httpx.AsyncClient with 30 s timeout
- Uses tenacity retry: 3 attempts, exponential backoff 2→10 s, does not retry on 4xx client errors except as noted per source
- Computes content_hash via Source.compute_hash(source, source_id, signal_type) — deterministic, 16 hex chars from SHA-256
Source catalog¶
Ten production sources implemented in ogur/sources/, one file per source — except patents.py, which is a single adapter producing two Signal.source values (lens and epo) depending on which upstream returned the row. The base.py and visual_base.py files are abstract bases, not sources themselves.
See data-sources.md for per-source authentication, rate limits, and external documentation links.
Seed scripts¶
scripts/seed_nsclc.py— runs every source concurrently for thensclc-001landscape, logs counts per source, writes rows viaupsert_signal(ignores duplicates oncontent_hash).scripts/seed_immunology.py— same pattern forimmunology-001.- Source failures are caught and logged — one failing API never crashes the whole run.
DrugProfile assembly¶
scripts/build_drug_profiles.py reads the signals already in the DB and aggregates one DrugProfile per distinct (normalized_name, landscape) — picking the most advanced phase seen, the most frequent company attribution, and the first target hit. It never hits the network.
§4 Intelligence engine¶
The core pipeline lives in ogur/engine/pipeline.py and chains four stages.
§4.1 Detect (detector.py)¶
Pure Python, no LLM calls. ChangeDetector.detect(since):
- Reads all Signals + DrugProfiles from the DB (one transaction each).
- Runs seven detectors producing
DetectedChangevalue objects: _detect_new_drugs— drugs with signals but no DrugProfile row_detect_phase_changes— signal phase > profile phase (uses_PHASE_RANKtable)_detect_trial_status_changes—TRIAL_STATUS_CHANGEsignals within window_detect_regulatory_events— approvals, label changes, safety signals_detect_new_publications— publications tagged to tracked drugs_detect_corporate_events— SEC filings, M&A, licensing deals, leadership changes_detect_patent_filings— Lens / EPO patent rows within window- Sorts by severity (
high→medium→low). - Returns
list[DetectedChange]— each holds a reference to the existing Signal row.
Why a value object and not a DB table? Because the same underlying Signal can be a "change" multiple times (e.g., phase transition + regulatory event). A transient in-memory object keeps the dedup invariant simple.
§4.2 Classify (AgentOrchestrator)¶
Routes each DetectedChange to a DomainAgent based on signal.source, then each agent scores its own batch with a Haiku-class LLM using a domain-specific prompt suffix.
| Domain | Sources | Agent class |
|---|---|---|
clinical |
clinicaltrials |
ClinicalAgent |
regulatory |
openfda |
RegulatoryAgent |
scientific |
pubmed, conferences |
ScientificAgent |
biological |
opentargets |
BiologicalAgent |
company |
sec |
CompanyAgent |
Scoring: 1 = noise, 10 = critical. Threshold: ≥ 5. After scoring, the orchestrator rebalances by source quota (configured in settings.min_signals_per_source) so e.g. PubMed always has ≥ 2 slots in the final cut, then caps at max_signals_for_synthesis (default 20).
Concurrent chunked dispatch. SignalClassifier splits each agent's batch into 50-entry chunks and dispatches them in parallel through a ThreadPoolExecutor (5 workers). One Haiku timeout no longer tanks the whole run — only the affected chunk falls back to neutral score 5, and the classifier only fails outright if every chunk fails. This was added when batch sizes started bumping into the SDK's non-streaming token ceiling.
Fallback: If the Haiku call throws for a whole agent, each agent drops to severity-ordering so the pipeline completes. See DomainAgent.classify.
§4.3 Enrich (enricher.py)¶
Pure Python, DB reads only. For each scored DetectedChange, produces an EnrichedChange with:
related_signals— other signals on the same drug within the windowdrug_profile— the cached DrugProfile rowcompany_profile— the cached CompanyProfile row (if any)competitor_context— other drugs in the same landscape at comparable phaseenrichment_sources— names of the sources that contributed to this change
§4.4 Synthesize (synthesizer.py)¶
One streamed Sonnet call. Takes enriched changes + drug profiles + active KIQs + a KnownIds catalog → structured JSON with:
executive_summary
signal_analyses[] { signal_id, drug, headline, what_happened, why_it_matters,
cross_source_connections, confidence, severity }
strategic_implications
watchlist[]
predictions[]
kiq_answers[] { kiq_id, finding, evidence, uncertainty, implication, confidence }
metadata { signals_count, model, generated_at }
The prompt (see synthesizer.py:19) emphasizes: lead with what changed, connect across sources, be specific (drug names, NCT, dates), assess confidence, flag what to watch next.
ENTITY CATALOG. The pipeline pre-loads canonical drug, company, and target IDs from the DB (see §4.8) and injects them as a structured block in the system prompt. The synthesizer is told to reference entities only by these IDs, which keeps EntityChips on the frontend resolvable without fuzzy matching.
Streaming transport. Synthesis switched to streaming after intermittent pre-response timeouts on long contexts (issue #32). Streaming also keeps the --mock-llm / --capture-fixture dev-loop short-circuits cheap — fixtures replay token-by-token in test mode.
Schema-retry loop. The synthesizer wraps generation in a verification gate (see §4.8). If validate_kiq_answers rejects the output, the synthesizer retries up to MAX_SCHEMA_RETRIES (= 2) with the validator's error messages appended to the prompt. Entity-reference mismatches are logged but no longer drive retries — they were too aggressive a gate, since prose can legitimately mention an entity not in the catalog.
Upsert to DB via briefings store — the primary key is landscape_id, so the latest briefing replaces the previous one. schema_valid and schema_errors are persisted alongside so the API can surface gate state.
§4.5 Query (query.py)¶
Ad-hoc Q&A endpoint. Given a question + landscape_id:
- Extract keywords (stopword-stripped).
- Pre-filter signals by
DrugSynonymlookup + landscape target list (this is the vector-search seam — when we swap to embeddings, only this step changes). - Score remaining signals by keyword overlap + recency decay.
- Stuff top-N signals into a Haiku prompt.
- Return
{answer, key_signals[], sources_used}.
Also pulls the latest landscape Briefing into context so answers inherit recent synthesis.
§4.6 Per-tab analyzers (ogur/engine/analyzers/)¶
For each drug, three parallel analyzers produce AssetDetail tab content — each uses Haiku, because this is structured extraction, not cross-drug synthesis:
| Analyzer | Output shape (schemas.py) |
|---|---|
| OverviewAnalyzer | OverviewOut — MoA, drug class, key differentiators, safety signals, data gaps |
| TrialsAnalyzer | TrialsOut — list of TrialRecord (NCT, phase, status, primary endpoint) |
| CompetitiveAnalyzer | CompetitiveOut — competitors, route_matrix, threat_register, white_space |
Results are cached as Briefing rows with composite landscape_id (see §2 data model).
§4.7 Evidence pipeline (evidence_pipeline.py)¶
A separate, CLI-only orchestrator that builds the durable structured-trial-outcomes layer. Run via scripts/run_evidence_pipeline.py; not invoked during a normal briefing.
For each landscape, it:
- Reads top-N drugs by signal volume from the DB.
- Pre-filters signals before any LLM call via the
_OUTCOME_BEARING_SIGNAL_TYPESfrozenset. Currently narrow: onlyPUBLICATIONandCONFERENCE_ABSTRACTpass through to the outcomes extractor. Press releases, licensing deals, patent filings, label changes, regulatory events, and trial-registration / status-change rows are all skipped here — they either yielded zero-outcome payloads or are handled by the CT.gov protocol-parser path below. The narrow allow-list dominated ~60% of immunology-001 spend before it was added; expanding it (to e.g.LABEL_CHANGE,FDA_APPROVAL) is intentional future work, not an oversight. The eligibility split (signals_processed/signals_skipped_ineligible) is surfaced in the per-drug pilot report so the filter doesn't silently swallow signal volume. - For ClinicalTrials.gov v2 signals, parses the protocol JSON via
parsers/ct_gov_v2.pyand upserts aProtocolProfile. - For abstract / label / conference text, calls
OutcomesExtractor(Haiku tool-use) and upsertsEvidenceRecordrows. - Idempotent: dedup is via
compute_evidence_content_hash(...).--force-replaceoverwrites existing rows for prompt-iteration runs. - Emits a per-drug + aggregate markdown report (confidence distribution, null-field rate, error breakdown) — written to
evals/pilot_immunology/report.mdby default.
Budget gates: --budget-usd halts the run if the live API spend crosses a threshold, and --dry-run calls the extractors without DB writes for cost estimation.
Why a separate pipeline? Evidence is expensive to build (Haiku per outcome) and rarely changes once written. Coupling it to the briefing pipeline would re-pay that cost every cycle. Evidence accumulates in its own tables and is queried on demand by /api/assets/{drug}/competitors/evidence and the matcher (which scores trial-protocol similarity for competitive-context callouts).
§4.8 KIQs + verification gate¶
KIQs (kiq.py, kiqs store) are the intent-capture surface for the briefing — see §2 data model. Loaded by the pipeline at the start of each run via list_kiqs(landscape_id, active_only=True).
Verification (verification.py) is two pure-Python validators that run between the synthesizer's response and persistence:
validate_kiq_answers(kiq_answers, expected_kiqs)— checks every active KIQ has an answer with required keys (kiq_id,finding,evidence,uncertainty,implication,confidence), confidence is in{high, medium, low}, andfinding/evidenceclear minimum lengths. Drives retries in the synthesizer.validate_entity_references(text_blocks, known_ids)— ensures drug / company / target IDs in synthesizer prose resolve in the catalog and any NCT IDs match the regex. Warn-only post-issue-#32 — logs toschema_errorsbut doesn't retry.
KnownIds is a TypedDict prepared once by the pipeline (pipeline.py) using list_drugs / list_companies / list_targets from the stores, then handed to both the synthesizer (for ENTITY CATALOG injection) and the verification gate. Single source of truth ⇒ no drift between what the prompt advertises and what the validator accepts.
§4.9 Entity & outcomes extractors (ogur/engine/extractor/)¶
Sub-package owned by the evidence pipeline. None of these run during a normal briefing.
| Module | Role |
|---|---|
entity_extractor.py |
Biomarker / mutation / drug-target / indication NER. Dual backend: GLiNER local model (F1 ≈ 0.88 on the BIOPSY-derived gold set) or a Claude fallback when ML deps aren't installed. |
compound_span_postprocessor.py |
Splits gene+mutation compounds (KRAS G12C-mutated → adds standalone KRAS biomarker span). Curated oncology gene list; never replaces original spans. |
outcomes_extractor.py |
Structures (endpoint, arm, value, unit, CI, HR, p, comparator) tuples from abstract / label / conference text via Haiku tool-use. Preserves raw_excerpt for UI provenance. Handles non-efficacy categories (safety, PK, PRO) and "not reached" time-to-event patterns. |
parsers/ct_gov_v2.py |
Pure-Python parser over ClinicalTrials.gov v2 JSON → ProtocolProfile fields. No LLM. |
matcher.py |
Compute-on-read similarity score between two trial protocols, with weighted field comparisons (endpoint > comparator > population > biomarker > histology > LoT > blinding). |
eval.py, outcomes_eval.py |
Eval harnesses. See evals.md. |
§5 API layer (ogur/api/)¶
Single FastAPI app constructed in app.py. create_tables() on startup creates the SQLite schema if missing. Four routers under /api: signals, briefings, query, and evidence (head-to-head card route — see below). See api-reference.md for every endpoint.
Background tasks. POST endpoints that trigger generation (briefing, drug briefing, per-tab analyzer) return 202 Accepted and enqueue the work via FastAPI BackgroundTasks — synchronous for now; APScheduler is Phase 3 full.
Comparative evidence (routes/evidence.py). GET /api/landscapes/{landscape_id}/evidence/comparative produces head-to-head cards from EvidenceRecord rows. The route is deterministic, no LLM — for each (drug, headline endpoint) pair it picks the highest-confidence row, then most-recent on tie. Drugs without paired evidence are omitted entirely; rows without a comparator_value are filtered out (single-arm rows would render misleading H2H cards). Endpoint-name fuzzing handles trial-arm variations ("EASI-75 at week 16" → "EASI-75"). Per-landscape config (headline endpoints, included drugs) is currently a v1 hardcoded dict in the route module — keyed by landscape_id. Move to DB-stored config when more than two landscapes need cards.
§6 Frontend (frontend/)¶
Vite + React 18 + TypeScript. State via Zustand + TanStack Query. Four top-level views, one Inspector panel. See frontend.md and design/ux-spec.md.
The frontend talks to the backend via the Vite dev-server proxy (/api/* → http://localhost:8000). The universal Inspector panel is a 380 px right rail with a Zustand-driven content slot that context-switches on object type (drug, trial, signal, company).
§7 Model tiering & cost¶
| Stage | Model | Cost per run (approx) | Why this tier |
|---|---|---|---|
| Classifier + DomainAgents | claude-haiku-4-5-20251001 |
~$0.002 | Batch scoring; each signal scored once |
| Per-tab analyzers | Haiku | ~$0.01 per (drug, tab) | Structured extraction, not cross-drug reasoning |
| QueryEngine | Haiku | ~$0.001 per question | Interactive, needs to feel fast |
| Synthesizer | claude-sonnet-4-6 |
~$0.05–0.20 | Cross-source synthesis, long context |
| Holo3 visual extraction | holo3-122b-a10b |
Varies by screenshot count | Vision over IR pages + congress portals |
Model IDs live in config.py — changing either is one line.
§8 Dedup invariants¶
The dedup contract is load-bearing. Three invariants:
-
Every Signal has a
content_hashand it's UNIQUE at DB level. Enforced by the SQLModelField(unique=True)constraint. Violation ⇒ IntegrityError on insert, whichupsert_signalcatches and treats as a dedup hit. -
content_hash = sha256(source + source_id + signal_type)[:16]. Deterministic. Two runs of the same source producing the same signal collapse. A signal that genuinely changes (phase transition, new abstract) produces a new hash becausesource_idorsignal_typediffers. -
The engine never inserts Signal rows. It only reads them and wraps them in
DetectedChange/EnrichedChange. If this ever changes, the invariant breaks and the dedup contract must be re-derived.
§9 Testing architecture¶
In-memory SQLite + StaticPool so every Session shares one connection — required because get_session() is called in multiple places (detector, enricher, query, stores) and each call gets its own Session.
Fixtures in tests/conftest.py:
| Fixture | Provides |
|---|---|
db_engine |
Fresh in-memory engine per test, all tables created |
db_session |
A Session bound to db_engine |
mock_get_session |
A context manager that wraps db_engine — injected via patch |
patch_db |
Patches get_session in every engine + store module |
patch_sources_db |
Patches get_session in source modules that call the DB |
api_client |
FastAPI TestClient with get_db overridden to use the test DB |
Golden rule: patch where the symbol is used (e.g. ogur.engine.detector.get_session), not where it's defined (ogur.store.database.get_session). Otherwise imports happen before the patch takes effect.
See testing.md for usage patterns.
§10 What's deliberately not built¶
- No node-edge graph visualization in UI. The knowledge graph (Target ↔ Drug ↔ Signal) is infrastructure, used for retrieval + classification. Users see typed cards and EntityChips. See design/ux-spec.md §6.4.
- No COGS, KOL sentiment, or internal clinical data. Public sources only.
- No chatbot. Synthesis is structured JSON → rendered cards. QueryEngine answers ad-hoc questions but doesn't hold state.
- No scheduled ingestion yet. Cron-triggered seed + briefing is Phase 3 full.