DataTransformers

Transform data with Python or dbt. Test locally in your IDE before deploying.

Transform Data Your Way

DataTransformers are isolated execution environments for custom data transformations. They consume datasets from a Workspace and write results to their own Datastore, with full schema governance and version control.

Choose the right tool for the job:

Python

Complex business logic, machine learning, API integrations, or transformations easier to express in code than SQL.

dbt

SQL-based transformations with dbt's testing framework, model dependencies, and ecosystem of packages.

dbt Integration

DataSurface provides first-class dbt integration. When you define a dbt DataTransformer, DataSurface automatically:

Clones your dbt repository from GitHub at the specified version tag
Auto-generates profiles.yml with database connection details
Auto-generates sources.yml mapping your Workspace inputs to dbt sources
Handles credentials securely via Kubernetes secrets (never written to files)
Runs dbt run and ingests results to your Datastore

Configuration Example

DataTransformer(
    "CustomerSegmentation",
    store=Datastore(
        "SegmentedCustomers",
        datasets=[...]
    ),
    code=DBTCodeArtifact(
        VersionedRepository(
            GitHubRepository("myorg/dbt-transforms", "main"),
            EnvRefReleaseSelector("dbt_version")
        ),
        imageKey="dbt_v1_10"
    ),
    runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
    trigger=CronTrigger("Daily", "0 0 * * *")
)

Writing dbt Models

Reference input sources using the auto-generated workspace_inputs source:

{{ config(materialized='view') }}

{% set target_table = var('output_customers') %}

{% set insert_sql %}
INSERT INTO {{ target_table }} (customer_id, segment)
SELECT
    customer_id,
    CASE
        WHEN tenure_days > 365 THEN 'loyal'
        ELSE 'new'
    END as segment
FROM {{ source('workspace_inputs', 'customers') }}
{% endset %}

{% do run_query(insert_sql) %}
SELECT 1 as dummy

This pattern uses explicit INSERT into DataSurface-managed tables, respecting schema governance while giving you full dbt flexibility.

dbt Connectors for Data Ingestion

Beyond transformations, dbt DataTransformers unlock ingestion from any source with a dbt connector. This means you can bring data into DataSurface from SaaS platforms, APIs, and external systems without writing custom ingestion code.

☁️

Salesforce

📊

HubSpot

🔄

Fivetran

📈

Google Analytics

💳

Stripe

📦

Shopify

Any dbt source package that can extract data becomes an ingestion pathway into DataSurface. Your dbt model pulls from the external API and writes to a DataSurface-managed Datastore, where it gains full governance, SCD2 history tracking, and CQRS replication.

Extra Credentials for External Connections

dbt connectors need credentials to authenticate with external systems. DataSurface allows DataTransformers to declare extraCredentials that are securely injected at runtime via Kubernetes secrets:

DataTransformer(
    "SalesforceIngestion",
    store=Datastore("SalesforceData", datasets=[...]),
    code=DBTCodeArtifact(
        VersionedRepository(
            GitHubRepository("myorg/salesforce-dbt", "main"),
            EnvRefReleaseSelector("dbt_version")
        ),
        imageKey="dbt_v1_10"
    ),
    runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
    extraCredentials=[
        Credential("salesforce_api", CredentialType.USER_PASSWORD),
        Credential("salesforce_token", CredentialType.API_KEY)
    ],
    trigger=CronTrigger("Daily", "0 6 * * *")
)

Credentials are injected as environment variables (SALESFORCE_API_USER, SALESFORCE_API_PASSWORD, etc.) that your dbt profiles.yml template can reference. No secrets are ever written to disk.

dbt Packages

Use any dbt package for transformations or ingestion:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: calogica/dbt_expectations
    version: 0.8.0

Python vs dbt

                    
                        CapabilityPythondbt

                        Best forComplex logic, ML, APIsSQL transformations
Learning curveSteeperGentle for SQL users
TestingCustom code testsdbt tests & assertions
LineageImplicit in codeExplicit in dbt DAG
EcosystemPython packagesdbt packages
PerformanceDepends on PythonNative SQL

Capability	Python	dbt
Best for	Complex logic, ML, APIs	SQL transformations
Learning curve	Steeper	Gentle for SQL users
Testing	Custom code tests	dbt tests & assertions
Lineage	Implicit in code	Explicit in dbt DAG
Ecosystem	Python packages	dbt packages
Performance	Depends on Python	Native SQL

Local IDE Testing

Test DataTransformers locally in your IDE against a local database, without deploying to Kubernetes or Airflow. Get instant feedback with standard pytest.

How It Works

Prepare your transformer with self-describing input/output definitions
Create a test class inheriting from BaseDTLocalTest
Inject test data, run the transformer, verify output

Test Example

from tests.test_datatransformer_local import BaseDTLocalTest

class TestMyTransformer(BaseDTLocalTest):
    def setUp(self):
        super().setUp()
        self.setup_from_transformer_module("myproject.transformers.mask_pii")

    def test_masks_email(self):
        # 1. Inject test data
        self.inject_data("customers", [
            {"id": 1, "name": "Alice", "email": "alice@example.com"}
        ])

        # 2. Run transformer
        self.run_dt_job()

        # 3. Verify output
        output = self.get_output_data()
        self.assertEqual(len(output), 1)
        self.assertIn("***", output[0]["email"])  # Verify masking

Key Features

Database Agnostic

Works with any SQLAlchemy-compatible database. Default is local PostgreSQL.

Full IDE Support

Set breakpoints, step through code, inspect variables. Full debugging in PyCharm, VS Code, etc.

Cycle Testing

Inject data, run, verify, inject more, run again. Test incremental and CDC logic locally.

Schema Management

Automatically creates input/output tables from your Datastore definitions.

Environment Setup

# Default: PostgreSQL on localhost
export TEST_POSTGRES_URL=postgresql://localhost/datasurface_test

# Run tests with pytest
pytest tests/test_my_transformer.py -v

Getting Started

Choose Python or dbt based on your transformation needs
Define your DataTransformer in your ecosystem model
Write local tests using the testing framework
Tag and deploy - DataSurface handles the rest

Start Building Transformers