Testing¶

Unit tests live under tests/unit/ and mirror the ogur/ package layout. Shared fixtures in tests/conftest.py.

tests/
  __init__.py
  conftest.py             Shared fixtures + factories
  unit/
    api/                  /api/signals, /api/briefing, /api/ask, /health
    engine/               detector, classifier, enricher, synthesizer, query,
                          pipeline, orchestrator, analyzers,
                          evidence_pipeline, verification
      extractor/          entity_extractor, outcomes_extractor,
                          compound_span_postprocessor, ct_gov_v2_parser,
                          matcher, outcomes_eval
    sources/              One test per source + retry/hash base tests
    store/                signals, briefings, companies, targets,
                          drug_store, evidence_store, kiqs
    scripts/              run_evidence_pipeline_cli (script smoke-tests)

Current state: 785 passing + 1 skipped (test_gliner_extraction_quality, needs the GLiNER ML stack) + 1 deselected (integration marker). make test runs in ≈ 5 seconds because every test uses in-memory SQLite and mocks Anthropic.

Running tests¶

make test              # full suite
make test-v            # verbose
make test-fast         # stop on first failure
make test-file F=tests/unit/engine/test_detector.py

Under the hood: uv run --extra dev python -m pytest tests/. Pytest auto-discovers tests/conftest.py at any level above test files.

Integration tests¶

Any test marked @pytest.mark.integration makes live network calls and is excluded from make test. Run them explicitly:

uv run --extra dev python -m pytest -m integration

These are for hand-verifying source contracts against real APIs. Don't rely on them in CI.

The DB fixture pattern¶

The tricky part of testing is that production code calls get_session() in several modules (detector, enricher, query, every store). For tests to share data between these calls, they must share a single connection — the default SQLAlchemy pool gives each Session its own connection to :memory:, which gives each Session its own empty database.

Solution: StaticPool on in-memory SQLite so every Session hits the same in-process DB.

# tests/conftest.py
engine = create_engine(
    "sqlite:///:memory:",
    connect_args={"check_same_thread": False},
    poolclass=StaticPool,
)

patch_db then replaces get_session in every module that imports it, pointing them all at this shared engine. See conftest.py:58.

Patch-where-used, not where-defined¶

The load-bearing rule. Python imports the name into the calling module at import time. If detector does from ogur.store.database import get_session, then detector.get_session and ogur.store.database.get_session are two different names bound to the same function. Patching the second one doesn't affect the first.

Correct:

with patch("ogur.engine.detector.get_session", mock_get_session):
    ...

Wrong — test passes against the mock but prod code still hits the real DB:

with patch("ogur.store.database.get_session", mock_get_session):
    ...

Every engine and store module that calls get_session is enumerated in the patch_db fixture so tests don't have to remember the list.

Fixtures¶

Fixture	Provides
`db_engine`	Fresh in-memory engine with all tables
`db_session`	`Session` bound to `db_engine`
`mock_get_session`	A `contextmanager` that yields sessions on `db_engine`
`patch_db`	Patches `get_session` in every engine + store module
`patch_sources_db`	Patches `get_session` in source modules
`api_client`	FastAPI `TestClient` with `get_db` dependency-overridden

Factories¶

Import from tests.conftest (the absolute path still works from subdirs):

make_signal(drug_name=…, source=…, signal_type=…, severity=…, …) — default is a Merck pembrolizumab Phase 3 NSCLC trial-registered signal.
make_drug_profile(normalized_name=…, phase=…, company=…, target=…, indication=…)
make_briefing(landscape_id=…, executive_summary=…)
make_landscape(id=…, name=…, indication=…, …)
make_company_profile(normalized_name=…, display_name=…, …)
make_target(normalized_name=…, display_name=…, …)
make_drug_target(drug_name=…, target_name=…, …)
make_kiq(landscape_id=…, question=…, time_horizon=…, priority=…, active=…, id=…) — KIQ row with sensible defaults; mirrors the others.
make_anthropic_mock(response_text) — builds a MagicMock Anthropic client that returns response_text from messages.create.

All factories accept overrides — if you need a specific severity or source, pass it as a kwarg.

API tests¶

Use the api_client fixture (TestClient with get_db overridden):

def test_list_signals(api_client, db_session):
    db_session.add(make_signal(drug_name="pembrolizumab"))
    db_session.commit()

    response = api_client.get("/api/signals?drug_name=pembrolizumab")
    assert response.status_code == 200
    assert len(response.json()) == 1

Source tests¶

Use pytest-httpx to intercept HTTP without patching. The httpx_mock fixture is auto-loaded:

@pytest.mark.asyncio
async def test_source_fetches_and_normalizes(httpx_mock):
    httpx_mock.add_response(
        url="https://api.example.com/search",
        json={"results": [...]},
    )
    signals = await MySource().fetch(make_landscape())
    assert signals[0].source == "mysource"

Sources that hit the DB directly (e.g. PubMed looking up DrugProfiles) need patch_sources_db in addition.

LLM-calling tests¶

Use make_anthropic_mock:

def test_classifier_scores_and_filters(patch_db, db_session):
    mock_client = make_anthropic_mock('{"scores":[8,3,6]}')
    with patch.object(SignalClassifier, "client", mock_client):
        classifier = SignalClassifier()
        result = classifier.classify([change_a, change_b, change_c])
    assert len(result) == 2  # scores 8 + 6 passed threshold (≥5)

The rule: we do not test prompt wording. We test that the classifier correctly parses the response, applies the ≥5 threshold, falls back on API error, and rebalances by source quota. Prompt content is a product decision, not a test surface.

What's not covered¶

No end-to-end tests (seed → briefing → API → frontend). The pipeline test in tests/unit/engine/test_pipeline.py runs the detect/classify/enrich/synthesize chain against an in-memory DB with mocked LLMs — that's the closest we have.
No load tests. The API is currently single-tenant, and signal volume per landscape is ~1,500 rows.
No frontend tests. The React codebase has no test harness. This is on the backlog.

Known warnings¶

~671 DeprecationWarning: datetime.datetime.utcnow() is deprecated messages during the test run. datetime.utcnow() is scheduled for removal in a future Python version; migration to datetime.now(timezone.utc) is a separate cleanup task. Not blocking.