How DataSurface Works

From model definition to data flowing in minutes. No custom pipeline code required.

What You Get

🐳

2 Docker Images

Deploy on your infrastructure—on-premise, AWS, Azure, or hybrid. You control where it runs.

🐍

Python SDK

Define your data model using a clean Python DSL. Integrate with your existing tooling and workflows.

📚

Complete Documentation

Comprehensive guides, API reference, and examples to get your team up and running quickly.

Runs wherever you need it: on-premise data centers, AWS, Azure, GCP, or any combination. Your data never leaves your control.

Three Simple Steps

1

Define Your Data Model in Git

Declare what data you have (producers), what data you need (consumers), and any transformations using a Python DSL. Store in Git with standard pull request workflows. All changes validated automatically.

# Data Producer declares availability
team.addDatastore(
    name="CustomerDB",
    datasets=[
        Dataset(name="customers", columns=[...]),
        Dataset(name="orders", columns=[...])
    ],
    ingestion=SQLSnapshot(
        credential="db_readonly",
        schedule=CronTrigger("*/5 * * * *")  # Every 5 min
    )
)

# Data Consumer declares requirements  
workspace.addDatasetSink(
    datastore="CustomerDB",
    dataset="customers",
    requirements=ForensicHistory()  # SCD Type 2
)

Pull Request Validation: Automated linting checks authorization, backward compatibility, policy compliance, and schema correctness before merge.

2

Infrastructure Team Assigns Workspaces

Central ops team assigns consumer Workspaces to DataPlatforms (dev/qa/prod environments). DataSurface automatically generates all infrastructure: Airflow DAGs, database schemas, ingestion jobs, merge logic, consumer views.

Merge PR
Model change committed
Auto-Generate
DAGs, schemas, views

Minutes from commit to data flowing

Infrastructure detects model change → regenerates DAGs → jobs start running → data begins ingesting automatically. No manual pipeline building required.

3

Consumers Query Through Governed Views

Consumers access data through automatically generated SQL views. Views enforce workspace permissions and abstract away storage complexity. Data can be replicated to optimal query engines via CQRS for performance.

-- Query live records only (SCD Type 1)
SELECT customer_id, name, email, city
FROM analytics_workspace.customers_live
WHERE region = 'US-WEST';

-- Query full history (SCD Type 2)
SELECT customer_id, name, valid_from, valid_to
FROM analytics_workspace.customers_all
WHERE customer_id = 12345
ORDER BY valid_from;

Automatic Replication: High-volume consumers get their own database via CQRS. Data replicated from primary storage with batch consistency maintained.

The Data Logistics Architecture

📁 DataSurface Model (Git Repository) Producers + Consumers + Policies + Platform Assignments Validated via Pull Requests Generates Infrastructure ⚡ Primary Storage Layers (Multiple) ☁️ AWS Primary Aurora · S3 · Redshift Ingest → Transform → Merge 🏢 On-Premise Primary Postgres · Oracle · DB2 Ingest → Transform → Merge ❄️ Snowflake Primary Native Snowflake Tables Ingest → Transform → Merge Each primary storage runs independently · Data stays where it belongs CQRS Replication ☁️ AWS Aurora 10 Workspaces Analytics Team 🏢 Oracle (On-Site) 25 Workspaces Finance Team 🔷 Azure SQL Server 8 Workspaces Ops Team ❄️ Snowflake 15 Workspaces Data Science 📊 Consumer DB N ... Scale as needed 👁️ Consumers Query Isolated Views Optimized per workload · Scaled independently

The Key Insight: DataSurface supports multiple Primary Storage Layers across AWS, on-premise, and Snowflake—deploy as many of each as you need. Each consumer gets data in their optimal format and technology without impacting producers or other consumers.

The Complete Workflow

🏛️ GovernanceZone / Team Owner Perspective

1. Control Your Domain

Each GovernanceZone is managed by a specific Git repository. Zone owners control their teams, datastores, workspaces—independent from other zones. Central team declares your zone, then you manage everything within it.

# Central team creates declaration
ecosystem.addGovernanceZoneDeclaration(
    name="Finance",
    owningRepo="github.com/company/finance-data"
)

# Finance team defines their zone (in their repo)
zone = GovernanceZone(name="Finance")
team = zone.addTeam(name="Treasury")

2. Set Policies for Your Data

Define who can access your data and for what purpose. Policies are enforced automatically during PR validation and at runtime.

zone.addPolicy(
    AllowDisallowPolicy(
        allow=["FinanceTeam", "AuditTeam"],
        disallow=["*"],  # Block all others
        purpose="Regulatory reporting only"
    )
)

3. Submit Pull Requests to Central

Your changes are submitted as PRs to the central ecosystem repository. Automated validation ensures you only modified your authorized zone. After merge, your data and policies are live across all environments.

Federated Governance at Scale

Each zone operates independently with its own repo and team. M&A integration? Just add a new GovernanceZone. Acquired company retains control of their data while sharing according to your policies.

📤 Data Producer Perspective

1. Declare Your Data

Define your database schema, specify how to connect (credentials, hostname), and set ingestion frequency. Commit to your team's Git repository.

datastore = team.addDatastore(
    name="SalesDB",
    datasets=[customers, orders, products]
)

2. Submit Pull Request

Your changes are validated: schemas checked, authorization verified, policies enforced. Central team reviews and merges.

3. Done

DataSurface starts ingesting your data automatically. You control who can access it through policies. No pipeline code to write or maintain.

📥 Data Consumer Perspective

1. Request the Data You Need

Create a Workspace, specify which datasets you need, and define your requirements (live-only vs. full history, latency, retention period).

workspace = team.addWorkspace(
    name="CustomerAnalytics",
    sinks=[
        DatasetSink("SalesDB", "customers", LiveOnly()),
        DatasetSink("SalesDB", "orders", ForensicHistory())
    ]
)

2. Get Infrastructure Assignment

Central ops reviews and assigns your Workspace to a DataPlatform (production, dev, or specific consumer database for high-volume use).

3. Query Your Views

DataSurface creates workspace-specific SQL views. Connect and query. Views enforce permissions—you see only your authorized data.

-- Connection provided by ops team
SELECT * FROM customer_analytics.customers_live
WHERE signup_date > '2025-01-01';

⚙️ Infrastructure Team Perspective

1. Define Runtime Environments

Configure DataPlatforms (dev/qa/prod), specify infrastructure (Kubernetes namespace, databases, Airflow), set version selectors (which git tags to deploy).

2. Map Workspaces to Platforms

Assign consumer Workspaces to appropriate DataPlatforms. High-priority or high-volume consumers can get dedicated Consumer Replica Groups (CQRS).

3. Monitor & Scale

Watch batch metrics, latency, and load. Add more consumer databases as needed. DataSurface handles replication automatically.

What Happens Automatically

🔄 Ingestion

Scheduled jobs pull data from source systems (SQL databases, APIs, files). Supports snapshot, watermark-based, and CDC ingestion.

📊 Schema Management

Tables, columns, indexes automatically created and maintained. Schema evolution validated—forward-compatible changes flow through, breaking changes blocked.

🔀 Merge Logic

SCD Type 1 (live records only) or Type 2 (full forensic history) maintained automatically. Handles inserts, updates, deletes transactionally.

🎯 Consumer Views

Workspace-specific views generated automatically. Permissions enforced at view level. Name translation handles database differences.

📈 CQRS Replication

High-volume consumers replicated to dedicated databases. Batch-consistent replication preserves transactional integrity.

🔐 Governance

Policies enforced automatically. Access controls, data residency, purpose restrictions all validated during PR review and at runtime.

Real-World Example

Scenario: Regulatory Reporting System

A bank needs to consolidate data from 50 source systems, run 20 transformations for data cleanup/enrichment, and serve 100 downstream reports (some real-time, some historical).

❌ Traditional Approach

  • 50 × 100 = 5,000 custom pipelines to maintain
  • Each producer change breaks downstream consumers
  • 3-6 months to onboard new data source
  • 12-18 months to migrate to new database technology
  • 80% of engineering time on maintenance

✅ With DataSurface

  • 50 + 20 + 100 = 170 model definitions (97% less code)
  • Schema changes automatically propagated with validation
  • Hours to onboard new source (add to model, merge PR)
  • Configuration change to swap database vendors
  • 20% engineering time on maintenance, 80% on value

Outcome: The bank reduces data platform costs by 60%, delivers new reports in days instead of months, and can migrate from on-premise to AWS or Azure with a configuration change instead of an $8M rewrite project.

The Technology Layer Is Replaceable

Start on Postgres today. Add SQL Server consumers next month. Migrate primary storage to Aurora next year. Move to Azure in 3 years. Zero pipeline rewrites.

Your business logic stays the same.
The infrastructure evolves underneath.
Consumers never notice.

See It In Action

Common Questions

What happens when a producer changes their schema?

Forward-compatible changes (new columns, new datasets) are validated during PR review. If approved, DataSurface automatically updates downstream tables and views. Consumers see new columns as NULL until data arrives. Breaking changes (removing columns, changing types) are blocked—backward compatibility is enforced.

How does CQRS scaling work?

Primary storage ingests data once. High-volume consumers get dedicated Consumer Replica Group databases. DataSurface replicates data from primary to consumer databases in batches, maintaining transactional consistency. Each consumer database can be optimized for its specific workload (different indexes, partitioning, even different database technology).

How long does it take to onboard a new data source?

Initial model definition: 1-2 hours (define schema, set credentials)
PR validation: Automated (seconds)
Infrastructure generation: Minutes after merge
First data available: One ingestion cycle (minutes to hours depending on source size)

Total: Same-day for straightforward sources vs. 3-6 months traditional pipeline building.

What about DataTransformers (ETL/ELT)?

DataTransformers are first-class citizens. Define transformation code in Git with versioning. DataSurface schedules execution, manages credentials, handles failures. Output becomes a new Datastore that consumers can use. Git versioning ensures reproducibility—you know exactly which code version produced which batch.

Ready to See It Working?

Schedule a technical demo to see DataSurface handling real data workflows: ingestion, transformation, replication, and schema evolution—all automated.

Schedule Demo See Model Examples