Life sciences and RWE at scale

Build repeatable cohort-to-feature pipelines from FHIR and OMOP on Apache Spark

Pontegra helps life sciences and real-world evidence teams turn longitudinal clinical data into analysis-ready datasets that are fast to build, repeatable to run, and scalable to production volumes.

Powered by: Studyfyr – The Governed Clinical Research Toolkit and Platform on FHIR and OMOP.

What you can achieve

Accelerate time-to-dataset

Move from data access to analysis-ready tables without one-off flattening scripts.

Make studies reproducible

Define cohort logic, sampling rules, and feature definitions as rerunnable pipelines.

Scale to real-world volumes

Run consistent logic across millions of patients and large event histories with Spark-native execution.

Support iterative study development

Start with a scoped sample for early validation, then scale to full runs.

Why Studyfyr for RWE Processing

Most RWE teams lose time on:

  • normalizing longitudinal events,
  • aligning timelines to index dates,
  • generating covariates and outcomes consistently,
  • keeping refresh cycles stable as data grows.

Studyfyr reduces this burden with:

  • FHIR-native query semantics (FHIR search and FHIRPath), also adapted for OMOP,
  • FHIR/OMOP compatible cohort and dataset workflows,
  • Spark scale-out execution,
  • reusable building blocks for cohorts, sampling, features, and outcomes.

Core workflow

Select data source

  • FHIR API (server)
  • FHIR lake (NDJSON/Parquet/Delta)
  • OMOP database/lake extracts

Define cohort logic

  • Entry, exit criteria
  • Choose eligibility period(s) (for example first diagnosis, first prescription, all medication exposures)
  • Inclusion and exclusion rules

Define sampling strategy

  • Periodic sampling (monthly, quarterly)
  • Event-aligned sampling (for example pre/post treatment or admissions/discharges)
  • Outcome-aligned sampling (for example 1,2, … hours before complication in ICU)
  • Rolling windows such as “since index” or “since last timepoint”

Compute features and endpoints

  • Covariates (labs, vitals, diagnoses, medications, utilization)
  • Endpoints and outcomes (events, counts, time-to-event derivations)
  • Feature enumeration over temporal windows (tumbling, extending, session, etc) and aggregations (avg, max, min, last, count)

Produce analysis-ready outputs

  • Relational and feature datasets for HEOR, epidemiology, and AI workflows
  • Reproducible derived tables for downstream R/Python/ML pipelines

Optional: Promote / Publish to organizational level

  • Promote approved pipelines/artifacts for reuse
  • Publish periodic dataset snapshots for governed consumption

Typical RWE Pipeline Patterns

Cohort feasibility and characterization

  • Define cohort logic once
  • Generate counts and baseline characteristics
  • Iterate quickly with scoped runs

Confounders and covariates

  • Compute time-windowed features before index date
  • Derive exposure, comorbidity, and laboratory summary variables

Endpoints and outcomes

  • Generate event tables and derived outcomes
  • Build longitudinal patient-by-timepoint modeling tables

Refreshable production pipelines

  • Rerun with newly arrived data
  • Maintain stable definitions for repeated analysis cycles

Execution Environments

Runs wherever Spark runs, including Kubernetes, YARN, Databricks, EMR, and on-prem clusters.
You can use Studyfyr SDKs in notebook environments using:

  • Pontegra CRT (open source): Spark/Scala SDK for clinical data processing
  • Pontegra Studyfyr (enterprise): governed execution layer with
    • managed service API and control plane UI
    • Python SDK (thin client) for workflow execution

Example Notebook Snippets (SDK)

Example 1: Cohort pipeline

Example 2: Feature dataset pipeline

Governance by Default

Even in processing-first deployments, Studyfyr keeps governance signals built into execution.

For EHDS/SPE/TRE governance details, see the EHDS-aligned solution page.

  • Run manifests capture inputs, definitions, versions, outputs, and timing.
  • Metadata-first audit records support operational traceability (no PHI in audit by default).
  • Controlled publish/export patterns enforce review before organizational or external release.

Frequently asked questions

Does this support both FHIR and OMOP?

Yes. Pipelines can run on FHIR and OMOP sources, including API, database, and lake-based patterns.

Do we need to change our R/Python modeling stack?

No. Studyfyr/CRT produces analysis-ready datasets for your existing R/Python/ML workflows.

Where does it run?

Wherever Spark runs (for example Databricks, EMR, Kubernetes, YARN, and on-prem clusters).

How are reruns and updates handled?

Pipeline runs are reproducible by design, with manifest-backed execution and refresh/recompute lifecycle patterns.

Can this operate in governed environments (SPE/TRE)?

Yes. Controlled execution and output review patterns support secure secondary-use workflows.

Pilot Delivery Plan

A typical pilot focuses on one therapeutic area and delivers the following.

  • connection to one approved source (FHIR or OMOP)
  • one cohort definition with entry/exit/eligibility logic
  • one sampling strategy (periodic or event-aligned)
  • two to three analysis-ready dataset tables
  • run documentation (definitions, parameters, outputs, manifest references)