Change Data Capture

Capture every change from SQL Server and Snowflake without touching source schemas.

Why CDC Matters

Traditional watermark-based ingestion relies on timestamp columns to detect changes. But what happens when a record is deleted? The watermark approach misses it entirely.

Change Data Capture (CDC) solves this by capturing every operation directly from the database transaction log:

Inserts - New records added to the source
Updates - Modified records with before/after values
Deletes - Removed records that watermark ingestion would miss

CDC vs Watermark Ingestion

                    
                        CapabilityWatermarkCDC

                        Captures DeletesNoYes
Requires Source ChangesYes (watermark column)No (just enable CDC)
LatencyDepends on poll intervalNear real-time
State TrackingTimestamp/integerLSN or Stream object

Capability	Watermark	CDC
Captures Deletes	No	Yes
Requires Source Changes	Yes (watermark column)	No (just enable CDC)
Latency	Depends on poll interval	Near real-time
State Tracking	Timestamp/integer	LSN or Stream object

Supported Databases

DataSurface provides native CDC integration for:

SQL Server CDC

Uses native CDC functions and LSN (Log Sequence Number) tracking. Changes captured via sys.fn_cdc_get_all_changes with full operation type mapping.

Snowflake Streams

Uses Snowflake Streams with SHOW_INITIAL_ROWS for automatic initial snapshots. Stream offset managed automatically by Snowflake.

How It Works

Initial Snapshot

When CDC is first enabled, DataSurface captures all existing data before switching to incremental mode. The key insight: capture the position marker before reading any data.

Timeline:

                        T0: Capture LSN/Stream position

                        T1-T100: Read all existing rows (snapshot)

                        Changes during this time have position > T0

                        T101: Snapshot complete, store position

                        T102+: CDC mode - query changes since T0

This approach guarantees no data loss: any changes during the snapshot are captured in the subsequent CDC query, and the merge process handles any duplicates automatically.

Large Table Support

For tables too large for a single batch, DataSurface supports multi-batch snapshots with checkpoint-based resumability:

Bounded batches - Configure maximum rows per batch (default 100M)
Ordered reads - Uses primary key ordering for resumable snapshots
Within-batch checkpoints - Progress saved after each chunk for crash recovery
Cross-batch continuity - Resume point persisted across batches

Schema Evolution Protection

CDC jobs enforce model-first schema evolution with strict validation:

                    
                        ScenarioResultAction Required

                        Source has extra columnsBlockedUpdate model to include new columns
Model expects missing columnsBlockedAdd columns to source first
Exact schema matchProceedsNone

Scenario	Result	Action Required
Source has extra columns	Blocked	Update model to include new columns
Model expects missing columns	Blocked	Add columns to source first
Exact schema match	Proceeds	None

This prevents data corruption by ensuring schema changes are coordinated between producers and consumers.

Batch Traceability

Every committed batch stores its state in the batch_metrics table, providing a built-in audit trail:

When did the snapshot start and complete?
How far did each batch get before committing?
What LSN range did each CDC batch cover?
How many records were inserted, updated, or deleted?

Configuration Example

Enable CDC ingestion with a simple configuration change:

Datastore(
    "CustomerData",
    capture_metadata=SQLCDCIngestion(
        EnvRefDataContainer("customer_db"),
        CronTrigger("Every15Min", "*/15 * * * *"),
        Credential("db_reader", CredentialType.USER_PASSWORD),
        IngestionConsistencyType.MULTI_DATASET
    ),
    datasets=[...]
)

Learn More About CDC Integration