Change Data Capture

Capture every change from SQL Server and Snowflake without touching source schemas.

Why CDC Matters

Traditional watermark-based ingestion relies on timestamp columns to detect changes. But what happens when a record is deleted? The watermark approach misses it entirely.

Change Data Capture (CDC) solves this by capturing every operation directly from the database transaction log:

  • Inserts - New records added to the source
  • Updates - Modified records with before/after values
  • Deletes - Removed records that watermark ingestion would miss

CDC vs Watermark Ingestion

CapabilityWatermarkCDC
Captures DeletesNoYes
Requires Source ChangesYes (watermark column)No (just enable CDC)
LatencyDepends on poll intervalNear real-time
State TrackingTimestamp/integerLSN or Stream object

Supported Databases

DataSurface provides native CDC integration for:

SQL Server CDC

Uses native CDC functions and LSN (Log Sequence Number) tracking. Changes captured via sys.fn_cdc_get_all_changes with full operation type mapping.

Snowflake Streams

Uses Snowflake Streams with SHOW_INITIAL_ROWS for automatic initial snapshots. Stream offset managed automatically by Snowflake.

How It Works

Initial Snapshot

When CDC is first enabled, DataSurface captures all existing data before switching to incremental mode. The key insight: capture the position marker before reading any data.

Timeline:
T0: Capture LSN/Stream position
T1-T100: Read all existing rows (snapshot)
Changes during this time have position > T0
T101: Snapshot complete, store position
T102+: CDC mode - query changes since T0

This approach guarantees no data loss: any changes during the snapshot are captured in the subsequent CDC query, and the merge process handles any duplicates automatically.

Large Table Support

For tables too large for a single batch, DataSurface supports multi-batch snapshots with checkpoint-based resumability:

  • Bounded batches - Configure maximum rows per batch (default 100M)
  • Ordered reads - Uses primary key ordering for resumable snapshots
  • Within-batch checkpoints - Progress saved after each chunk for crash recovery
  • Cross-batch continuity - Resume point persisted across batches

Schema Evolution Protection

CDC jobs enforce model-first schema evolution with strict validation:

ScenarioResultAction Required
Source has extra columnsBlockedUpdate model to include new columns
Model expects missing columnsBlockedAdd columns to source first
Exact schema matchProceedsNone

This prevents data corruption by ensuring schema changes are coordinated between producers and consumers.

Batch Traceability

Every committed batch stores its state in the batch_metrics table, providing a built-in audit trail:

  • When did the snapshot start and complete?
  • How far did each batch get before committing?
  • What LSN range did each CDC batch cover?
  • How many records were inserted, updated, or deleted?

Configuration Example

Enable CDC ingestion with a simple configuration change:

Datastore(
    "CustomerData",
    capture_metadata=SQLCDCIngestion(
        EnvRefDataContainer("customer_db"),
        CronTrigger("Every15Min", "*/15 * * * *"),
        Credential("db_reader", CredentialType.USER_PASSWORD),
        IngestionConsistencyType.MULTI_DATASET
    ),
    datasets=[...]
)
Learn More About CDC Integration