DataTransformers

Transform data with Python or dbt. Test locally in your IDE before deploying.

Transform Data Your Way

DataTransformers are isolated execution environments for custom data transformations. They consume datasets from a Workspace and write results to their own Datastore, with full schema governance and version control.

Choose the right tool for the job:

Python

Complex business logic, machine learning, API integrations, or transformations easier to express in code than SQL.

dbt

SQL-based transformations with dbt's testing framework, model dependencies, and ecosystem of packages.

dbt Integration

DataSurface provides first-class dbt integration. When you define a dbt DataTransformer, DataSurface automatically:

  • Clones your dbt repository from GitHub at the specified version tag
  • Auto-generates profiles.yml with database connection details
  • Auto-generates sources.yml mapping your Workspace inputs to dbt sources
  • Handles credentials securely via Kubernetes secrets (never written to files)
  • Runs dbt run and ingests results to your Datastore

Configuration Example

DataTransformer(
    "CustomerSegmentation",
    store=Datastore(
        "SegmentedCustomers",
        datasets=[...]
    ),
    code=DBTCodeArtifact(
        VersionedRepository(
            GitHubRepository("myorg/dbt-transforms", "main"),
            EnvRefReleaseSelector("dbt_version")
        ),
        imageKey="dbt_v1_10"
    ),
    runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
    trigger=CronTrigger("Daily", "0 0 * * *")
)

Writing dbt Models

Reference input sources using the auto-generated workspace_inputs source:

{{ config(materialized='view') }}

{% set target_table = var('output_customers') %}

{% set insert_sql %}
INSERT INTO {{ target_table }} (customer_id, segment)
SELECT
    customer_id,
    CASE
        WHEN tenure_days > 365 THEN 'loyal'
        ELSE 'new'
    END as segment
FROM {{ source('workspace_inputs', 'customers') }}
{% endset %}

{% do run_query(insert_sql) %}
SELECT 1 as dummy

This pattern uses explicit INSERT into DataSurface-managed tables, respecting schema governance while giving you full dbt flexibility.

dbt Connectors for Data Ingestion

Beyond transformations, dbt DataTransformers unlock ingestion from any source with a dbt connector. This means you can bring data into DataSurface from SaaS platforms, APIs, and external systems without writing custom ingestion code.

☁️
Salesforce
📊
HubSpot
🔄
Fivetran
📈
Google Analytics
💳
Stripe
📦
Shopify

Any dbt source package that can extract data becomes an ingestion pathway into DataSurface. Your dbt model pulls from the external API and writes to a DataSurface-managed Datastore, where it gains full governance, SCD2 history tracking, and CQRS replication.

Extra Credentials for External Connections

dbt connectors need credentials to authenticate with external systems. DataSurface allows DataTransformers to declare extraCredentials that are securely injected at runtime via Kubernetes secrets:

DataTransformer(
    "SalesforceIngestion",
    store=Datastore("SalesforceData", datasets=[...]),
    code=DBTCodeArtifact(
        VersionedRepository(
            GitHubRepository("myorg/salesforce-dbt", "main"),
            EnvRefReleaseSelector("dbt_version")
        ),
        imageKey="dbt_v1_10"
    ),
    runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
    extraCredentials=[
        Credential("salesforce_api", CredentialType.USER_PASSWORD),
        Credential("salesforce_token", CredentialType.API_KEY)
    ],
    trigger=CronTrigger("Daily", "0 6 * * *")
)

Credentials are injected as environment variables (SALESFORCE_API_USER, SALESFORCE_API_PASSWORD, etc.) that your dbt profiles.yml template can reference. No secrets are ever written to disk.

dbt Packages

Use any dbt package for transformations or ingestion:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: calogica/dbt_expectations
    version: 0.8.0

Python vs dbt

CapabilityPythondbt
Best forComplex logic, ML, APIsSQL transformations
Learning curveSteeperGentle for SQL users
TestingCustom code testsdbt tests & assertions
LineageImplicit in codeExplicit in dbt DAG
EcosystemPython packagesdbt packages
PerformanceDepends on PythonNative SQL

Local IDE Testing

Test DataTransformers locally in your IDE against a local database, without deploying to Kubernetes or Airflow. Get instant feedback with standard pytest.

How It Works

  1. Prepare your transformer with self-describing input/output definitions
  2. Create a test class inheriting from BaseDTLocalTest
  3. Inject test data, run the transformer, verify output

Test Example

from tests.test_datatransformer_local import BaseDTLocalTest

class TestMyTransformer(BaseDTLocalTest):
    def setUp(self):
        super().setUp()
        self.setup_from_transformer_module("myproject.transformers.mask_pii")

    def test_masks_email(self):
        # 1. Inject test data
        self.inject_data("customers", [
            {"id": 1, "name": "Alice", "email": "alice@example.com"}
        ])

        # 2. Run transformer
        self.run_dt_job()

        # 3. Verify output
        output = self.get_output_data()
        self.assertEqual(len(output), 1)
        self.assertIn("***", output[0]["email"])  # Verify masking

Key Features

Database Agnostic

Works with any SQLAlchemy-compatible database. Default is local PostgreSQL.

Full IDE Support

Set breakpoints, step through code, inspect variables. Full debugging in PyCharm, VS Code, etc.

Cycle Testing

Inject data, run, verify, inject more, run again. Test incremental and CDC logic locally.

Schema Management

Automatically creates input/output tables from your Datastore definitions.

Environment Setup

# Default: PostgreSQL on localhost
export TEST_POSTGRES_URL=postgresql://localhost/datasurface_test

# Run tests with pytest
pytest tests/test_my_transformer.py -v

Getting Started

  1. Choose Python or dbt based on your transformation needs
  2. Define your DataTransformer in your ecosystem model
  3. Write local tests using the testing framework
  4. Tag and deploy - DataSurface handles the rest
Start Building Transformers