Data sources¶

Every public source Ogur integrates with. Each entry has: what it is, auth, rate limit, signal types it produces, the file that implements it, and a link to the upstream documentation.

Provenance principle: every Signal stored in the DB records its source verbatim — so a row's provenance is one join away. This is non-negotiable (see design/ux-spec.md §2).

Catalogue¶

Source	Backend impl	Upstream docs	Auth
ClinicalTrials.gov v2	clinicaltrials.py	https://clinicaltrials.gov/data-api/api	None
PubMed (NCBI Entrez)	pubmed.py	https://www.ncbi.nlm.nih.gov/books/NBK25501/	Optional API key
OpenFDA	openfda.py	https://open.fda.gov/apis/	None
Open Targets Platform	opentargets.py	https://platform-docs.opentargets.org/	None
Europe PMC	conferences.py	https://europepmc.org/RestfulWebService	None
OpenAlex	openalex.py	https://docs.openalex.org/	Optional API key
SEC EDGAR	sec.py	https://www.sec.gov/search-filings/edgar-application-programming-interfaces	`User-Agent` required
Lens.org	patents.py	https://docs.api.lens.org/	API key
EPO OPS	patents.py	https://developers.epo.org	Consumer key + secret
H Company Holo3	visual_base.py, holo_conference.py, holo_pipeline.py	https://hcompany.ai	API key

All sources share the Source base class — common retry policy (tenacity, 3 attempts, 2→10 s backoff, don't retry 4xx), shared httpx.AsyncClient, and the compute_hash(source, source_id, signal_type) helper that produces the 16-hex dedup key.

ClinicalTrials.gov¶

Docs: https://clinicaltrials.gov/data-api/api
Auth: None
Rate limit: 10 req/s (configured in config.py)
Signal types produced: trial_registered, trial_status_change (the source emits exactly these two; phase-transition / amendment / enrollment changes surface as fresh trial_status_change rows when CT.gov updates the study, not as their own types)
Normalization: intervention names are checked against DrugSynonym — a Keytruda mention becomes pembrolizumab so downstream joins work.
Query strategy: iterate landscape.conditions (JSON array), call /studies?query.cond=…, parse status changes from the study history.

Notes¶

The v2 API replaced v1 in mid-2024; we target v2 exclusively.
The API returns full study records — we extract what we need and park the rest in Signal.raw_data so re-parsing is cheap when normalization logic evolves.

PubMed¶

Docs: https://www.ncbi.nlm.nih.gov/books/NBK25501/ (NCBI Entrez Programming Utilities)
Client: BioPython Bio.Entrez — NCBI requires a contact email, set to ogur@clinicalsim.ai (pubmed.py:27)
Auth: optional — PUBMED_API_KEY in .env raises rate limit from 3 req/s to 10 req/s
Signal types produced: publication
Query strategy (pubmed.py:44):
For each tracked DrugProfile (capped at 30 per run to respect rate limits), search title/abstract for "{drug}" AND "{indication}", 90-day lookback.
If no profiles exist yet (cold-start), fall back to indication-only search restricted to Clinical Trial[pt].

OpenFDA¶

Docs: https://open.fda.gov/apis/
Auth: None (API key optional — higher rate limit)
Rate limit: 4 req/s
Signal types produced: fda_approval, label_change, safety_signal
Query strategy: search drugsfda by indication text (not the full condition list — mixing drug names and targets produces noise), then search drug labels by the same indication. Searching by target/MoA is unreliable because pharm_class_epc values are verbose descriptions.

Endpoints used¶

/drug/drugsfda.json — approval events
/drug/label.json — label changes
/drug/event.json — adverse-event signals (FAERS)

Open Targets¶

Docs: https://platform-docs.opentargets.org/
GraphQL endpoint: https://api.platform.opentargets.org/api/v4/graphql
Auth: None
Signal types produced: pipeline_update, phase_transition (from maxClinicalStage changes)
Key role: seeds DrugProfile + DrugSynonym tables that every other source uses for normalization. Also provides uniprot_id for Target nodes.
Retry: 6-attempt exponential backoff 5–60 s specifically to recover from OpenTargets' common 503 responses.
Disease ID: configured per landscape — NSCLC uses EFO_0003060. Look up others at https://www.ebi.ac.uk/ols/ontologies/efo.

This source replaced the proprietary Gosset API (see GOSSET_API_KEY in .env.example, kept for legacy reasons but unused).

Europe PMC¶

Docs: https://europepmc.org/RestfulWebService
Auth: None
Signal types produced: conference_abstract
Query strategy: search with STYPE:"conference abstract" filter. Europe PMC aggregates from PubMed + EuropePMC-specific sources, including ASCO/ESMO/AACR abstract books.

Caveat¶

Europe PMC's conference-abstract ingestion lags 2+ weeks after the event. For real-time conference coverage during ASCO/ESMO season, see the Holo3 conference scraper — it screenshots the live portal.

OpenAlex¶

Docs: https://docs.openalex.org/
Auth: optional — OPENALEX_API_KEY recommended; otherwise uses polite-pool email (openalex_email in config.py)
Why OpenAlex alongside PubMed? OpenAlex indexes 250M+ works spanning PubMed, CrossRef, MAG, and DOAJ — giving broader systematic literature coverage than any single database.
Signal types produced: publication, conference_abstract
Two-pass strategy (openalex.py):
Drug-aware: for each DrugProfile (capped at 30), search title+abstract for drug + indication.
Conference pass: filter type:conference-paper by indication to surface posters (ASCO, ESMO, AACR, AAD).

SEC EDGAR¶

Endpoint: EDGAR Full-Text Search (EFTS) — https://efts.sec.gov/LATEST/search-index
Docs: https://www.sec.gov/search-filings/edgar-application-programming-interfaces
Auth: None, but a descriptive User-Agent header is mandatory. Requests without one receive HTTP 403.
Rate limit: polite default 8 req/s per SEC guidelines (https://www.sec.gov/os/webmaster-faq#developers)
Signal types produced: ma_announcement, licensing_deal, investment_round, leadership_change, press_release
Query strategy (sec.py):
EFTS search for 8-K filings mentioning landscape indication or tracked companies.
For each hit, fetch the primary exhibit (EX-99.1 / EX-99.2) from the EDGAR archive.
Regex-classify exhibit text into one of the five signal types.

Lens.org¶

Docs: https://docs.api.lens.org/
Sign-up / key: https://www.lens.org/lens/user/subscriptions — free 14-day trial, then paid
Auth: LENS_API_KEY in .env
Signal types produced: patent_filing
Why Lens: best global coverage including direct CNIPA (China) indexing, which EPO OPS only catches via PCT equivalents. Preferred backend when LENS_API_KEY is present.
Source field in DB: "lens" (so the UI's source chip correctly identifies provenance).

EPO OPS¶

Docs: https://developers.epo.org
Sign-up: instant, free — https://developers.epo.org/apis/ops
Auth: EPO_CONSUMER_KEY + EPO_CONSUMER_SECRET (OAuth 2 client credentials → bearer token, 20-min TTL)
Signal types produced: patent_filing
Coverage: EP natively + worldwide via INPADOC patent families. Good CN coverage for major players via PCT equivalents.
Role: fallback when LENS_API_KEY isn't configured. Source field is "epo".

See patents.py for the backend-selection logic (Lens first, EPO second, skip gracefully with a warning if neither key is set).

Holo3¶

Provider: H Company (https://hcompany.ai)
Docs: OpenAI-compatible API at https://api.hcompany.ai/v1 — model switching only requires changing the model ID
Model used: holo3-122b-a10b (configurable via HOLO3_MODEL)
Auth: HOLO3_API_KEY in .env
Privacy: zero data retention by default — important for pharma CI confidentiality
Requires: uv sync --extra visual && uv run playwright install chromium

Why it exists¶

Some data lives on pages that have no API and change structure between years: ASCO's abstract portal, company IR pipeline pages (PDF or dynamic HTML), WHO INN proposed lists. Rather than writing per-site CSS selectors that rot, we screenshot and let Holo3's vision model extract structured data.

Visual sources¶

File	Purpose	Signal types
visual_base.py	Base class: Playwright screenshot + Holo3 extraction helpers (`_screenshot`, `_extract_visual`, `_extract_pdf`)	(base only)
holo_conference.py	Live scraping of ASCO / ESMO / AACR abstract portals — real-time coverage without the Europe PMC lag	`conference_abstract`, `kol_activity`
holo_pipeline.py	Screenshot company IR pipeline pages, extract rows into structured pipeline data	`pipeline_update`, `early_pipeline`

For the original design rationale (alternatives considered, prompt strategy, eval criteria), see research.md — Visual intelligence.

Failure semantics¶

Source adapters are allowed to fail partially. In scripts/seed_nsclc.py:

try:
    signals = await source.fetch(landscape)
except Exception as exc:
    logger.warning("Source %s failed: %s", source.name, exc)
    signals = []

This means one flaky API never takes down a whole seed run. The trade-off: you must check per-source counts (make inspect) after every seed, because a silent-zero from a flaky source looks identical to a genuine empty result.

Adding a new source¶

See CONTRIBUTING — "Adding a new data source".

Short version: 1. Create ogur/sources/<name>.py, subclass Source, implement fetch(). 2. Add a test file tests/unit/sources/test_<name>.py. 3. Register in the seed script. 4. If the source has its own domain (not clinical/regulatory/scientific/biological/company), add a DomainAgent subclass in ogur/engine/agents/. 5. Document it in this file with auth, rate limit, and upstream link.