Data sources¶
Every public source Ogur integrates with. Each entry has: what it is, auth, rate limit, signal types it produces, the file that implements it, and a link to the upstream documentation.
Provenance principle: every Signal stored in the DB records its source verbatim — so a row's provenance is one join away. This is non-negotiable (see design/ux-spec.md §2).
Catalogue¶
| Source | Backend impl | Upstream docs | Auth |
|---|---|---|---|
| ClinicalTrials.gov v2 | clinicaltrials.py | https://clinicaltrials.gov/data-api/api | None |
| PubMed (NCBI Entrez) | pubmed.py | https://www.ncbi.nlm.nih.gov/books/NBK25501/ | Optional API key |
| OpenFDA | openfda.py | https://open.fda.gov/apis/ | None |
| Open Targets Platform | opentargets.py | https://platform-docs.opentargets.org/ | None |
| Europe PMC | conferences.py | https://europepmc.org/RestfulWebService | None |
| OpenAlex | openalex.py | https://docs.openalex.org/ | Optional API key |
| SEC EDGAR | sec.py | https://www.sec.gov/search-filings/edgar-application-programming-interfaces | User-Agent required |
| Lens.org | patents.py | https://docs.api.lens.org/ | API key |
| EPO OPS | patents.py | https://developers.epo.org | Consumer key + secret |
| H Company Holo3 | visual_base.py, holo_conference.py, holo_pipeline.py | https://hcompany.ai | API key |
All sources share the Source base class — common retry policy (tenacity, 3 attempts, 2→10 s backoff, don't retry 4xx), shared httpx.AsyncClient, and the compute_hash(source, source_id, signal_type) helper that produces the 16-hex dedup key.
ClinicalTrials.gov¶
- Docs: https://clinicaltrials.gov/data-api/api
- Auth: None
- Rate limit: 10 req/s (configured in config.py)
- Signal types produced:
trial_registered,trial_status_change(the source emits exactly these two; phase-transition / amendment / enrollment changes surface as freshtrial_status_changerows when CT.gov updates the study, not as their own types) - Normalization: intervention names are checked against
DrugSynonym— a Keytruda mention becomespembrolizumabso downstream joins work. - Query strategy: iterate
landscape.conditions(JSON array), call/studies?query.cond=…, parse status changes from the study history.
Notes¶
- The v2 API replaced v1 in mid-2024; we target v2 exclusively.
- The API returns full study records — we extract what we need and park the rest in
Signal.raw_dataso re-parsing is cheap when normalization logic evolves.
PubMed¶
- Docs: https://www.ncbi.nlm.nih.gov/books/NBK25501/ (NCBI Entrez Programming Utilities)
- Client: BioPython
Bio.Entrez— NCBI requires a contact email, set toogur@clinicalsim.ai(pubmed.py:27) - Auth: optional —
PUBMED_API_KEYin.envraises rate limit from 3 req/s to 10 req/s - Signal types produced:
publication - Query strategy (pubmed.py:44):
- For each tracked
DrugProfile(capped at 30 per run to respect rate limits), search title/abstract for"{drug}"AND"{indication}", 90-day lookback. - If no profiles exist yet (cold-start), fall back to indication-only search restricted to
Clinical Trial[pt].
OpenFDA¶
- Docs: https://open.fda.gov/apis/
- Auth: None (API key optional — higher rate limit)
- Rate limit: 4 req/s
- Signal types produced:
fda_approval,label_change,safety_signal - Query strategy: search
drugsfdaby indication text (not the full condition list — mixing drug names and targets produces noise), then search drug labels by the same indication. Searching by target/MoA is unreliable becausepharm_class_epcvalues are verbose descriptions.
Endpoints used¶
/drug/drugsfda.json— approval events/drug/label.json— label changes/drug/event.json— adverse-event signals (FAERS)
Open Targets¶
- Docs: https://platform-docs.opentargets.org/
- GraphQL endpoint: https://api.platform.opentargets.org/api/v4/graphql
- Auth: None
- Signal types produced:
pipeline_update,phase_transition(frommaxClinicalStagechanges) - Key role: seeds
DrugProfile+DrugSynonymtables that every other source uses for normalization. Also providesuniprot_idfor Target nodes. - Retry: 6-attempt exponential backoff 5–60 s specifically to recover from OpenTargets' common 503 responses.
- Disease ID: configured per landscape — NSCLC uses
EFO_0003060. Look up others at https://www.ebi.ac.uk/ols/ontologies/efo.
This source replaced the proprietary Gosset API (see GOSSET_API_KEY in .env.example, kept for legacy reasons but unused).
Europe PMC¶
- Docs: https://europepmc.org/RestfulWebService
- Auth: None
- Signal types produced:
conference_abstract - Query strategy: search with
STYPE:"conference abstract"filter. Europe PMC aggregates from PubMed + EuropePMC-specific sources, including ASCO/ESMO/AACR abstract books.
Caveat¶
Europe PMC's conference-abstract ingestion lags 2+ weeks after the event. For real-time conference coverage during ASCO/ESMO season, see the Holo3 conference scraper — it screenshots the live portal.
OpenAlex¶
- Docs: https://docs.openalex.org/
- Auth: optional —
OPENALEX_API_KEYrecommended; otherwise uses polite-pool email (openalex_emailin config.py) - Why OpenAlex alongside PubMed? OpenAlex indexes 250M+ works spanning PubMed, CrossRef, MAG, and DOAJ — giving broader systematic literature coverage than any single database.
- Signal types produced:
publication,conference_abstract - Two-pass strategy (openalex.py):
- Drug-aware: for each
DrugProfile(capped at 30), search title+abstract for drug + indication. - Conference pass: filter
type:conference-paperby indication to surface posters (ASCO, ESMO, AACR, AAD).
SEC EDGAR¶
- Endpoint: EDGAR Full-Text Search (EFTS) —
https://efts.sec.gov/LATEST/search-index - Docs: https://www.sec.gov/search-filings/edgar-application-programming-interfaces
- Auth: None, but a descriptive
User-Agentheader is mandatory. Requests without one receive HTTP 403. - Rate limit: polite default 8 req/s per SEC guidelines (https://www.sec.gov/os/webmaster-faq#developers)
- Signal types produced:
ma_announcement,licensing_deal,investment_round,leadership_change,press_release - Query strategy (sec.py):
- EFTS search for 8-K filings mentioning landscape indication or tracked companies.
- For each hit, fetch the primary exhibit (EX-99.1 / EX-99.2) from the EDGAR archive.
- Regex-classify exhibit text into one of the five signal types.
Lens.org¶
- Docs: https://docs.api.lens.org/
- Sign-up / key: https://www.lens.org/lens/user/subscriptions — free 14-day trial, then paid
- Auth:
LENS_API_KEYin.env - Signal types produced:
patent_filing - Why Lens: best global coverage including direct CNIPA (China) indexing, which EPO OPS only catches via PCT equivalents. Preferred backend when
LENS_API_KEYis present. - Source field in DB:
"lens"(so the UI's source chip correctly identifies provenance).
EPO OPS¶
- Docs: https://developers.epo.org
- Sign-up: instant, free — https://developers.epo.org/apis/ops
- Auth:
EPO_CONSUMER_KEY+EPO_CONSUMER_SECRET(OAuth 2 client credentials → bearer token, 20-min TTL) - Signal types produced:
patent_filing - Coverage: EP natively + worldwide via INPADOC patent families. Good CN coverage for major players via PCT equivalents.
- Role: fallback when
LENS_API_KEYisn't configured.Sourcefield is"epo".
See patents.py for the backend-selection logic (Lens first, EPO second, skip gracefully with a warning if neither key is set).
Holo3¶
- Provider: H Company (https://hcompany.ai)
- Docs: OpenAI-compatible API at
https://api.hcompany.ai/v1— model switching only requires changing the model ID - Model used:
holo3-122b-a10b(configurable viaHOLO3_MODEL) - Auth:
HOLO3_API_KEYin.env - Privacy: zero data retention by default — important for pharma CI confidentiality
- Requires:
uv sync --extra visual && uv run playwright install chromium
Why it exists¶
Some data lives on pages that have no API and change structure between years: ASCO's abstract portal, company IR pipeline pages (PDF or dynamic HTML), WHO INN proposed lists. Rather than writing per-site CSS selectors that rot, we screenshot and let Holo3's vision model extract structured data.
Visual sources¶
| File | Purpose | Signal types |
|---|---|---|
| visual_base.py | Base class: Playwright screenshot + Holo3 extraction helpers (_screenshot, _extract_visual, _extract_pdf) |
(base only) |
| holo_conference.py | Live scraping of ASCO / ESMO / AACR abstract portals — real-time coverage without the Europe PMC lag | conference_abstract, kol_activity |
| holo_pipeline.py | Screenshot company IR pipeline pages, extract rows into structured pipeline data | pipeline_update, early_pipeline |
For the original design rationale (alternatives considered, prompt strategy, eval criteria), see research.md — Visual intelligence.
Failure semantics¶
Source adapters are allowed to fail partially. In scripts/seed_nsclc.py:
try:
signals = await source.fetch(landscape)
except Exception as exc:
logger.warning("Source %s failed: %s", source.name, exc)
signals = []
This means one flaky API never takes down a whole seed run. The trade-off: you must check per-source counts (make inspect) after every seed, because a silent-zero from a flaky source looks identical to a genuine empty result.
Adding a new source¶
See CONTRIBUTING — "Adding a new data source".
Short version:
1. Create ogur/sources/<name>.py, subclass Source, implement fetch().
2. Add a test file tests/unit/sources/test_<name>.py.
3. Register in the seed script.
4. If the source has its own domain (not clinical/regulatory/scientific/biological/company), add a DomainAgent subclass in ogur/engine/agents/.
5. Document it in this file with auth, rate limit, and upstream link.