Skip to content

Data sources

Every public source Ogur integrates with. Each entry has: what it is, auth, rate limit, signal types it produces, the file that implements it, and a link to the upstream documentation.

Provenance principle: every Signal stored in the DB records its source verbatim — so a row's provenance is one join away. This is non-negotiable (see design/ux-spec.md §2).


Catalogue

Source Backend impl Upstream docs Auth
ClinicalTrials.gov v2 clinicaltrials.py https://clinicaltrials.gov/data-api/api None
PubMed (NCBI Entrez) pubmed.py https://www.ncbi.nlm.nih.gov/books/NBK25501/ Optional API key
OpenFDA openfda.py https://open.fda.gov/apis/ None
Open Targets Platform opentargets.py https://platform-docs.opentargets.org/ None
Europe PMC conferences.py https://europepmc.org/RestfulWebService None
OpenAlex openalex.py https://docs.openalex.org/ Optional API key
SEC EDGAR sec.py https://www.sec.gov/search-filings/edgar-application-programming-interfaces User-Agent required
Lens.org patents.py https://docs.api.lens.org/ API key
EPO OPS patents.py https://developers.epo.org Consumer key + secret
H Company Holo3 visual_base.py, holo_conference.py, holo_pipeline.py https://hcompany.ai API key

All sources share the Source base class — common retry policy (tenacity, 3 attempts, 2→10 s backoff, don't retry 4xx), shared httpx.AsyncClient, and the compute_hash(source, source_id, signal_type) helper that produces the 16-hex dedup key.


ClinicalTrials.gov

  • Docs: https://clinicaltrials.gov/data-api/api
  • Auth: None
  • Rate limit: 10 req/s (configured in config.py)
  • Signal types produced: trial_registered, trial_status_change (the source emits exactly these two; phase-transition / amendment / enrollment changes surface as fresh trial_status_change rows when CT.gov updates the study, not as their own types)
  • Normalization: intervention names are checked against DrugSynonym — a Keytruda mention becomes pembrolizumab so downstream joins work.
  • Query strategy: iterate landscape.conditions (JSON array), call /studies?query.cond=…, parse status changes from the study history.

Notes

  • The v2 API replaced v1 in mid-2024; we target v2 exclusively.
  • The API returns full study records — we extract what we need and park the rest in Signal.raw_data so re-parsing is cheap when normalization logic evolves.

PubMed

  • Docs: https://www.ncbi.nlm.nih.gov/books/NBK25501/ (NCBI Entrez Programming Utilities)
  • Client: BioPython Bio.Entrez — NCBI requires a contact email, set to ogur@clinicalsim.ai (pubmed.py:27)
  • Auth: optional — PUBMED_API_KEY in .env raises rate limit from 3 req/s to 10 req/s
  • Signal types produced: publication
  • Query strategy (pubmed.py:44):
  • For each tracked DrugProfile (capped at 30 per run to respect rate limits), search title/abstract for "{drug}" AND "{indication}", 90-day lookback.
  • If no profiles exist yet (cold-start), fall back to indication-only search restricted to Clinical Trial[pt].

OpenFDA

  • Docs: https://open.fda.gov/apis/
  • Auth: None (API key optional — higher rate limit)
  • Rate limit: 4 req/s
  • Signal types produced: fda_approval, label_change, safety_signal
  • Query strategy: search drugsfda by indication text (not the full condition list — mixing drug names and targets produces noise), then search drug labels by the same indication. Searching by target/MoA is unreliable because pharm_class_epc values are verbose descriptions.

Endpoints used

  • /drug/drugsfda.json — approval events
  • /drug/label.json — label changes
  • /drug/event.json — adverse-event signals (FAERS)

Open Targets

  • Docs: https://platform-docs.opentargets.org/
  • GraphQL endpoint: https://api.platform.opentargets.org/api/v4/graphql
  • Auth: None
  • Signal types produced: pipeline_update, phase_transition (from maxClinicalStage changes)
  • Key role: seeds DrugProfile + DrugSynonym tables that every other source uses for normalization. Also provides uniprot_id for Target nodes.
  • Retry: 6-attempt exponential backoff 5–60 s specifically to recover from OpenTargets' common 503 responses.
  • Disease ID: configured per landscape — NSCLC uses EFO_0003060. Look up others at https://www.ebi.ac.uk/ols/ontologies/efo.

This source replaced the proprietary Gosset API (see GOSSET_API_KEY in .env.example, kept for legacy reasons but unused).


Europe PMC

  • Docs: https://europepmc.org/RestfulWebService
  • Auth: None
  • Signal types produced: conference_abstract
  • Query strategy: search with STYPE:"conference abstract" filter. Europe PMC aggregates from PubMed + EuropePMC-specific sources, including ASCO/ESMO/AACR abstract books.

Caveat

Europe PMC's conference-abstract ingestion lags 2+ weeks after the event. For real-time conference coverage during ASCO/ESMO season, see the Holo3 conference scraper — it screenshots the live portal.


OpenAlex

  • Docs: https://docs.openalex.org/
  • Auth: optional — OPENALEX_API_KEY recommended; otherwise uses polite-pool email (openalex_email in config.py)
  • Why OpenAlex alongside PubMed? OpenAlex indexes 250M+ works spanning PubMed, CrossRef, MAG, and DOAJ — giving broader systematic literature coverage than any single database.
  • Signal types produced: publication, conference_abstract
  • Two-pass strategy (openalex.py):
  • Drug-aware: for each DrugProfile (capped at 30), search title+abstract for drug + indication.
  • Conference pass: filter type:conference-paper by indication to surface posters (ASCO, ESMO, AACR, AAD).

SEC EDGAR

  • Endpoint: EDGAR Full-Text Search (EFTS) — https://efts.sec.gov/LATEST/search-index
  • Docs: https://www.sec.gov/search-filings/edgar-application-programming-interfaces
  • Auth: None, but a descriptive User-Agent header is mandatory. Requests without one receive HTTP 403.
  • Rate limit: polite default 8 req/s per SEC guidelines (https://www.sec.gov/os/webmaster-faq#developers)
  • Signal types produced: ma_announcement, licensing_deal, investment_round, leadership_change, press_release
  • Query strategy (sec.py):
  • EFTS search for 8-K filings mentioning landscape indication or tracked companies.
  • For each hit, fetch the primary exhibit (EX-99.1 / EX-99.2) from the EDGAR archive.
  • Regex-classify exhibit text into one of the five signal types.

Lens.org

  • Docs: https://docs.api.lens.org/
  • Sign-up / key: https://www.lens.org/lens/user/subscriptions — free 14-day trial, then paid
  • Auth: LENS_API_KEY in .env
  • Signal types produced: patent_filing
  • Why Lens: best global coverage including direct CNIPA (China) indexing, which EPO OPS only catches via PCT equivalents. Preferred backend when LENS_API_KEY is present.
  • Source field in DB: "lens" (so the UI's source chip correctly identifies provenance).

EPO OPS

  • Docs: https://developers.epo.org
  • Sign-up: instant, free — https://developers.epo.org/apis/ops
  • Auth: EPO_CONSUMER_KEY + EPO_CONSUMER_SECRET (OAuth 2 client credentials → bearer token, 20-min TTL)
  • Signal types produced: patent_filing
  • Coverage: EP natively + worldwide via INPADOC patent families. Good CN coverage for major players via PCT equivalents.
  • Role: fallback when LENS_API_KEY isn't configured. Source field is "epo".

See patents.py for the backend-selection logic (Lens first, EPO second, skip gracefully with a warning if neither key is set).


Holo3

  • Provider: H Company (https://hcompany.ai)
  • Docs: OpenAI-compatible API at https://api.hcompany.ai/v1 — model switching only requires changing the model ID
  • Model used: holo3-122b-a10b (configurable via HOLO3_MODEL)
  • Auth: HOLO3_API_KEY in .env
  • Privacy: zero data retention by default — important for pharma CI confidentiality
  • Requires: uv sync --extra visual && uv run playwright install chromium

Why it exists

Some data lives on pages that have no API and change structure between years: ASCO's abstract portal, company IR pipeline pages (PDF or dynamic HTML), WHO INN proposed lists. Rather than writing per-site CSS selectors that rot, we screenshot and let Holo3's vision model extract structured data.

Visual sources

File Purpose Signal types
visual_base.py Base class: Playwright screenshot + Holo3 extraction helpers (_screenshot, _extract_visual, _extract_pdf) (base only)
holo_conference.py Live scraping of ASCO / ESMO / AACR abstract portals — real-time coverage without the Europe PMC lag conference_abstract, kol_activity
holo_pipeline.py Screenshot company IR pipeline pages, extract rows into structured pipeline data pipeline_update, early_pipeline

For the original design rationale (alternatives considered, prompt strategy, eval criteria), see research.md — Visual intelligence.


Failure semantics

Source adapters are allowed to fail partially. In scripts/seed_nsclc.py:

try:
    signals = await source.fetch(landscape)
except Exception as exc:
    logger.warning("Source %s failed: %s", source.name, exc)
    signals = []

This means one flaky API never takes down a whole seed run. The trade-off: you must check per-source counts (make inspect) after every seed, because a silent-zero from a flaky source looks identical to a genuine empty result.


Adding a new source

See CONTRIBUTING — "Adding a new data source".

Short version: 1. Create ogur/sources/<name>.py, subclass Source, implement fetch(). 2. Add a test file tests/unit/sources/test_<name>.py. 3. Register in the seed script. 4. If the source has its own domain (not clinical/regulatory/scientific/biological/company), add a DomainAgent subclass in ogur/engine/agents/. 5. Document it in this file with auth, rate limit, and upstream link.