Research¶

Research and prior art that directly informs Ogur's design. Every reference here is something we've read and can defend — not a bibliography.

Provenance-first design¶

"Every claim the product displays must carry a source chip and be one click from the raw source. No orphan text." — design/ux-spec.md §2

This principle descends from enterprise knowledge-graph practice (Palantir Foundry / Gotham) rather than from any single paper. The specific motivation: pharma CI analysts are liable for the decisions their briefings support. A briefing is a signed document, not a recommendation engine. Every claim needs a provenance edge.

Why it matters for the engine: we never discard raw_data on a Signal — if normalization logic changes, we re-parse. The DB schema is immutable-facts-plus-derived-views; synthesis is a view over signals, never the other way around.

Named-entity recognition on pharma pipeline data¶

BIOPSY (Kognitic, EMNLP 2025)¶

Paper: Biomedical Information for Patients, Studies, and You — a benchmark dataset of annotated clinical trial descriptions for pharma-specific NER (drug, target, indication, biomarker, trial design).
Relevance: the right benchmark for moving from regex-based normalization to ML-based extraction (e.g. pulling "HER2-low positive advanced breast cancer" as structured entities).
Status — actively used. Phase B of the evidence-synthesis work shipped: entity_extractor.py is graded against a BIOPSY-derived 25-sample gold set, with both GLiNER (local, F1 ≈ 0.88) and Claude (fallback, F1 ≈ 0.91) backends benchmarked against the same gold. See evals.md — Entity extraction for the harness and architecture.md §4.9 for how the extractor slots into the evidence pipeline.

Agent architecture¶

Ogur's domain-agent pattern (AgentOrchestrator routing to ClinicalAgent / RegulatoryAgent / etc.) draws from the coordinator/executor split observed in Claude Code's architecture.

The structural inspiration is documented in design/agentic-architecture-notes.md. Key patterns we adopted:

Specialist agents with declared contracts. Each DomainAgent owns a frozenset of Signal.source values — explicit ownership, not implicit routing.
Coordinator/executor split. The pipeline script is the coordinator; the seed scripts are the executor. They never mix responsibilities.
Fallback by design. Every LLM call has a non-LLM fallback (severity-sort for classifier, keyword scoring for query). A flaky model never takes down the pipeline.

What we did NOT adopt (relevant because it's interesting and we considered it):

KAIROS-style always-on agents. Scheduled ingestion is Phase 3 full — we're still batch-only.
Swarm bus / inter-agent messaging. Our agents don't communicate — they hand back scored pairs to the orchestrator, which merges. We'd add a bus if we introduced agents that negotiate (e.g. a regulatory agent asking a clinical agent for trial status context). Not needed yet.

Model tiering¶

Haiku for classification and extraction. Batched scoring, structured output, cost-sensitive. We use claude-haiku-4-5-20251001 (model ID in config.py).
Sonnet for synthesis. Long-context cross-source reasoning. We use claude-sonnet-4-6.

This split descends directly from the observation that synthesis is a few-per-day operation where quality dominates cost, whereas classification is a many-per-run operation where cost dominates quality (and quality is bounded by the 1–10 integer output space anyway).

Paper-grounded justification: not one, but the pattern appears across the Anthropic model-selection documentation (https://docs.anthropic.com/en/docs/intelligence-classification) — use the cheapest model that clears the bar for the task.

Competitive positioning vs. Citeline¶

Citeline (Informa Pharma Intelligence) is the incumbent pharma pipeline database. Its strength is exhaustive static snapshot — every trial, every drug, every company, curated by humans. Its weakness is the analyst workflow downstream of the snapshot: cross-source linking, temporal monitoring, synthesis, briefing generation.

Ogur's bet: we don't try to match Citeline's pipeline completeness. We build the workflow that Citeline forces analysts to do manually. The sale is:

"Add Ogur as one line item. Cut the consultancy retainer that currently does your weekly briefings."

Not:

"Rip out Citeline."

This framing originates from internal customer-development conversations (Q1 2026) with pharma CI leads who said they'd never champion removing Citeline internally — it's their safety net for pipeline coverage — but would happily add a tool that cuts the post-snapshot work.

See CLAUDE.md §Competitive positioning for the canonical statement.

Visual intelligence — why screenshots, not scrapers¶

Our Holo3-backed sources screenshot pages and send them to a vision model. Alternative approaches we considered and rejected:

Approach	Why not
Per-site CSS selectors	ASCO, ESMO, AACR all change structure year-over-year. Selectors rot faster than we can maintain them.
Headless Chrome + readability-mode	Loses tables — which is exactly what we need from pipeline pages.
PDF parsing (pipeline PDFs)	70% of company IR pipeline pages are HTML, not PDF. Need a uniform strategy.
GPT-4V / Claude Vision	Works, but Holo3 has zero-data-retention and pharma-specific fine-tuning. Pharma CI confidentiality is non-negotiable.

For the per-source implementation (visual_base + holo_conference + holo_pipeline adapters), see data-sources.md — Holo3.

Retrieval¶

Current: keyword + synonym pre-filter¶

QueryEngine scores signals by: 1. Look up every drug mention in the question against DrugSynonym → canonical drug names. 2. Filter signals to those canonical drugs plus the landscape's target list. 3. Score remaining signals by keyword overlap + recency decay. 4. Top-N → Haiku.

The key design decision: structured pre-filters sit before scoring. When we swap keyword scoring for embedding similarity (Phase 3 full), steps 1–2 stay, only step 3 changes. This is the "vector-search seam" called out in architecture.md §4.5.

Considered: pure vector search¶

We didn't jump straight to embeddings because: - Current signal volume (~1,500 per landscape) is small enough that keyword scoring is instant. - Structured pre-filtering does more work than a cosine-similarity ranker would do. A question about "KRAS inhibitors" reliably surfaces adagrasib / sotorasib / MRTX0902 via DrugSynonym, whereas raw embedding similarity would also surface generic "RAS pathway" papers. - Embeddings add infrastructure (Voyage AI, or a self-hosted model) and cost we don't yet need.

When we cross ~10k signals per landscape, this reasoning flips.

External references¶

Linked here because they're cited elsewhere in the docs and we don't want broken internal anchors.

Topic	Link
uv package manager	https://docs.astral.sh/uv/
SQLModel	https://sqlmodel.tiangolo.com/
FastAPI	https://fastapi.tiangolo.com/
TanStack Query	https://tanstack.com/query/latest
TanStack Virtual	https://tanstack.com/virtual/latest
shadcn/ui	https://ui.shadcn.com/
Framer Motion	https://www.framer.com/motion/
Zustand	https://github.com/pmndrs/zustand
Anthropic model IDs	https://docs.anthropic.com/en/docs/about-claude/models
Playwright (Python)	https://playwright.dev/python/