Enterprise Observability
OpenTelemetry integration for Datadog, Prometheus, Grafana, and any OTLP-compatible backend.
Why Observability Matters
Data pipelines are only as good as your ability to monitor them. When a job fails at 3 AM, you need answers fast:
- Which job failed and why?
- What model version was running?
- How long did it take before failing?
- What was the impact on downstream consumers?
DataSurface integrates with OpenTelemetry (OTLP) to export metrics and traces to your existing observability stack, with full governance context in every metric.
Supported Backends
How It Works
Enable telemetry with a simple configuration on your Platform Service Provider:
psp = YellowPlatformServiceProvider(
name="Production_PSP",
otlpEnabled=True,
otlpPort=4318,
otlpProtocol="http/protobuf"
)
DataSurface automatically:
- Injects node IP into pods via Kubernetes Downward API
- Configures OTLP endpoint for node-local telemetry agents
- Uses delta temporality for short-lived job compatibility
- Flushes all telemetry before job exit
Governance Context in Every Metric
DataSurface doesn't just export generic metrics. Every trace and metric includes governance context for filtering and correlation:
Deployment Attributes
| Attribute | Description | Example |
|---|---|---|
rte_name | Runtime environment | prod, uat, dev |
ecosystem_name | Ecosystem/model name | YellowStarter |
model_repo | Model git repository | myorg/model@main |
model_version | Model git tag | v2.0.7-prod |
Job Context
| Attribute | Description | Example |
|---|---|---|
job_type | Type of job | ingestion, transformer, cqrs |
datastore | Datastore name | CustomerStore |
dataset | Dataset name | Customers |
workspace | Workspace name | Analytics |
status | Job outcome | success, failed |
DataTransformer Attributes
| Attribute | Description | Example |
|---|---|---|
code_artifact_version | Code artifact git tag | v1.2.3 |
code_artifact_repo | Code repository | myorg/dbt@main |
governance_zone | Governance zone | Finance |
team | Team name | DataEngineering |
Available Metrics
Job Duration
datasurface.job.duration_seconds (histogram) - Measures execution time for all job types:
- Ingestion jobs (staging, merge phases)
- DataTransformer execution
- CQRS sync jobs
- View reconciliation
- Git cloning, model loading, and validation
Merge Counters
| Metric | Description |
|---|---|
datasurface.merge.records_inserted | Records inserted during merge |
datasurface.merge.records_updated | Records updated during merge |
datasurface.merge.records_deleted | Records deleted during merge |
Pipeline Counters
| Metric | Description |
|---|---|
datasurface.pipeline.records_processed | Records processed by transformers |
datasurface.cqrs.streams_synced | CQRS streams synchronized |
datasurface.reconcile.views_count | Views reconciled per container |
Example Queries
Datadog
# All ingestion jobs by platform
job_type:ingestion | group by platform
# Failed jobs in production
status:failed rte_name:prod
# Job duration by model version
avg:datasurface.job.duration_seconds{job_type:ingestion} by {model_version}
Prometheus
# Average ingestion duration by platform
avg by (platform) (datasurface_job_duration_seconds{job_type="ingestion"})
# Records inserted per minute
sum(rate(datasurface_merge_records_inserted_total[1m])) by (datastore)
Technical Details
Delta Temporality
Required by Datadog and optimal for short-lived Kubernetes jobs. Each job reports only its own metrics, not cumulative totals.
Automatic Flush
Telemetry is flushed before job exit, ensuring no data loss even for jobs that complete in seconds.
Node-Local Agents
Connects to node-local telemetry agents (like Datadog Agent DaemonSet), avoiding cross-node network issues.