DataTransformers
Transform data with Python or dbt. Test locally in your IDE before deploying.
Transform Data Your Way
DataTransformers are isolated execution environments for custom data transformations. They consume datasets from a Workspace and write results to their own Datastore, with full schema governance and version control.
Choose the right tool for the job:
Python
Complex business logic, machine learning, API integrations, or transformations easier to express in code than SQL.
dbt
SQL-based transformations with dbt's testing framework, model dependencies, and ecosystem of packages.
dbt Integration
DataSurface provides first-class dbt integration. When you define a dbt DataTransformer, DataSurface automatically:
- Clones your dbt repository from GitHub at the specified version tag
- Auto-generates
profiles.ymlwith database connection details - Auto-generates
sources.ymlmapping your Workspace inputs to dbt sources - Handles credentials securely via Kubernetes secrets (never written to files)
- Runs
dbt runand ingests results to your Datastore
Configuration Example
DataTransformer(
"CustomerSegmentation",
store=Datastore(
"SegmentedCustomers",
datasets=[...]
),
code=DBTCodeArtifact(
VersionedRepository(
GitHubRepository("myorg/dbt-transforms", "main"),
EnvRefReleaseSelector("dbt_version")
),
imageKey="dbt_v1_10"
),
runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
trigger=CronTrigger("Daily", "0 0 * * *")
)
Writing dbt Models
Reference input sources using the auto-generated workspace_inputs source:
{{ config(materialized='view') }}
{% set target_table = var('output_customers') %}
{% set insert_sql %}
INSERT INTO {{ target_table }} (customer_id, segment)
SELECT
customer_id,
CASE
WHEN tenure_days > 365 THEN 'loyal'
ELSE 'new'
END as segment
FROM {{ source('workspace_inputs', 'customers') }}
{% endset %}
{% do run_query(insert_sql) %}
SELECT 1 as dummy
This pattern uses explicit INSERT into DataSurface-managed tables, respecting schema governance while giving you full dbt flexibility.
dbt Connectors for Data Ingestion
Beyond transformations, dbt DataTransformers unlock ingestion from any source with a dbt connector. This means you can bring data into DataSurface from SaaS platforms, APIs, and external systems without writing custom ingestion code.
Any dbt source package that can extract data becomes an ingestion pathway into DataSurface. Your dbt model pulls from the external API and writes to a DataSurface-managed Datastore, where it gains full governance, SCD2 history tracking, and CQRS replication.
Extra Credentials for External Connections
dbt connectors need credentials to authenticate with external systems. DataSurface allows DataTransformers to declare extraCredentials that are securely injected at runtime via Kubernetes secrets:
DataTransformer(
"SalesforceIngestion",
store=Datastore("SalesforceData", datasets=[...]),
code=DBTCodeArtifact(
VersionedRepository(
GitHubRepository("myorg/salesforce-dbt", "main"),
EnvRefReleaseSelector("dbt_version")
),
imageKey="dbt_v1_10"
),
runAsCredential=Credential("db_writer", CredentialType.USER_PASSWORD),
extraCredentials=[
Credential("salesforce_api", CredentialType.USER_PASSWORD),
Credential("salesforce_token", CredentialType.API_KEY)
],
trigger=CronTrigger("Daily", "0 6 * * *")
)
Credentials are injected as environment variables (SALESFORCE_API_USER, SALESFORCE_API_PASSWORD, etc.) that your dbt profiles.yml template can reference. No secrets are ever written to disk.
dbt Packages
Use any dbt package for transformations or ingestion:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_expectations
version: 0.8.0
Python vs dbt
| Capability | Python | dbt |
|---|---|---|
| Best for | Complex logic, ML, APIs | SQL transformations |
| Learning curve | Steeper | Gentle for SQL users |
| Testing | Custom code tests | dbt tests & assertions |
| Lineage | Implicit in code | Explicit in dbt DAG |
| Ecosystem | Python packages | dbt packages |
| Performance | Depends on Python | Native SQL |
Local IDE Testing
Test DataTransformers locally in your IDE against a local database, without deploying to Kubernetes or Airflow. Get instant feedback with standard pytest.
How It Works
- Prepare your transformer with self-describing input/output definitions
- Create a test class inheriting from
BaseDTLocalTest - Inject test data, run the transformer, verify output
Test Example
from tests.test_datatransformer_local import BaseDTLocalTest
class TestMyTransformer(BaseDTLocalTest):
def setUp(self):
super().setUp()
self.setup_from_transformer_module("myproject.transformers.mask_pii")
def test_masks_email(self):
# 1. Inject test data
self.inject_data("customers", [
{"id": 1, "name": "Alice", "email": "alice@example.com"}
])
# 2. Run transformer
self.run_dt_job()
# 3. Verify output
output = self.get_output_data()
self.assertEqual(len(output), 1)
self.assertIn("***", output[0]["email"]) # Verify masking
Key Features
Database Agnostic
Works with any SQLAlchemy-compatible database. Default is local PostgreSQL.
Full IDE Support
Set breakpoints, step through code, inspect variables. Full debugging in PyCharm, VS Code, etc.
Cycle Testing
Inject data, run, verify, inject more, run again. Test incremental and CDC logic locally.
Schema Management
Automatically creates input/output tables from your Datastore definitions.
Environment Setup
# Default: PostgreSQL on localhost
export TEST_POSTGRES_URL=postgresql://localhost/datasurface_test
# Run tests with pytest
pytest tests/test_my_transformer.py -v
Getting Started
- Choose Python or dbt based on your transformation needs
- Define your DataTransformer in your ecosystem model
- Write local tests using the testing framework
- Tag and deploy - DataSurface handles the rest