Change Data Capture
Capture every change from SQL Server and Snowflake without touching source schemas.
Why CDC Matters
Traditional watermark-based ingestion relies on timestamp columns to detect changes. But what happens when a record is deleted? The watermark approach misses it entirely.
Change Data Capture (CDC) solves this by capturing every operation directly from the database transaction log:
- Inserts - New records added to the source
- Updates - Modified records with before/after values
- Deletes - Removed records that watermark ingestion would miss
CDC vs Watermark Ingestion
| Capability | Watermark | CDC |
|---|---|---|
| Captures Deletes | No | Yes |
| Requires Source Changes | Yes (watermark column) | No (just enable CDC) |
| Latency | Depends on poll interval | Near real-time |
| State Tracking | Timestamp/integer | LSN or Stream object |
Supported Databases
DataSurface provides native CDC integration for:
SQL Server CDC
Uses native CDC functions and LSN (Log Sequence Number) tracking. Changes captured via sys.fn_cdc_get_all_changes with full operation type mapping.
Snowflake Streams
Uses Snowflake Streams with SHOW_INITIAL_ROWS for automatic initial snapshots. Stream offset managed automatically by Snowflake.
How It Works
Initial Snapshot
When CDC is first enabled, DataSurface captures all existing data before switching to incremental mode. The key insight: capture the position marker before reading any data.
T1-T100: Read all existing rows (snapshot)
Changes during this time have position > T0
T101: Snapshot complete, store position
T102+: CDC mode - query changes since T0
This approach guarantees no data loss: any changes during the snapshot are captured in the subsequent CDC query, and the merge process handles any duplicates automatically.
Large Table Support
For tables too large for a single batch, DataSurface supports multi-batch snapshots with checkpoint-based resumability:
- Bounded batches - Configure maximum rows per batch (default 100M)
- Ordered reads - Uses primary key ordering for resumable snapshots
- Within-batch checkpoints - Progress saved after each chunk for crash recovery
- Cross-batch continuity - Resume point persisted across batches
Schema Evolution Protection
CDC jobs enforce model-first schema evolution with strict validation:
| Scenario | Result | Action Required |
|---|---|---|
| Source has extra columns | Blocked | Update model to include new columns |
| Model expects missing columns | Blocked | Add columns to source first |
| Exact schema match | Proceeds | None |
This prevents data corruption by ensuring schema changes are coordinated between producers and consumers.
Batch Traceability
Every committed batch stores its state in the batch_metrics table, providing a built-in audit trail:
- When did the snapshot start and complete?
- How far did each batch get before committing?
- What LSN range did each CDC batch cover?
- How many records were inserted, updated, or deleted?
Configuration Example
Enable CDC ingestion with a simple configuration change:
Datastore(
"CustomerData",
capture_metadata=SQLCDCIngestion(
EnvRefDataContainer("customer_db"),
CronTrigger("Every15Min", "*/15 * * * *"),
Credential("db_reader", CredentialType.USER_PASSWORD),
IngestionConsistencyType.MULTI_DATASET
),
datasets=[...]
)