Enterprise Observability

OpenTelemetry integration for Datadog, Prometheus, Grafana, and any OTLP-compatible backend.

Why Observability Matters

Data pipelines are only as good as your ability to monitor them. When a job fails at 3 AM, you need answers fast:

  • Which job failed and why?
  • What model version was running?
  • How long did it take before failing?
  • What was the impact on downstream consumers?

DataSurface integrates with OpenTelemetry (OTLP) to export metrics and traces to your existing observability stack, with full governance context in every metric.

Supported Backends

Datadog
Prometheus
Grafana
OTEL
Any OTLP Backend

How It Works

Enable telemetry with a simple configuration on your Platform Service Provider:

psp = YellowPlatformServiceProvider(
    name="Production_PSP",
    otlpEnabled=True,
    otlpPort=4318,
    otlpProtocol="http/protobuf"
)

DataSurface automatically:

  • Injects node IP into pods via Kubernetes Downward API
  • Configures OTLP endpoint for node-local telemetry agents
  • Uses delta temporality for short-lived job compatibility
  • Flushes all telemetry before job exit

Governance Context in Every Metric

DataSurface doesn't just export generic metrics. Every trace and metric includes governance context for filtering and correlation:

Deployment Attributes

AttributeDescriptionExample
rte_nameRuntime environmentprod, uat, dev
ecosystem_nameEcosystem/model nameYellowStarter
model_repoModel git repositorymyorg/model@main
model_versionModel git tagv2.0.7-prod

Job Context

AttributeDescriptionExample
job_typeType of jobingestion, transformer, cqrs
datastoreDatastore nameCustomerStore
datasetDataset nameCustomers
workspaceWorkspace nameAnalytics
statusJob outcomesuccess, failed

DataTransformer Attributes

AttributeDescriptionExample
code_artifact_versionCode artifact git tagv1.2.3
code_artifact_repoCode repositorymyorg/dbt@main
governance_zoneGovernance zoneFinance
teamTeam nameDataEngineering

Available Metrics

Job Duration

datasurface.job.duration_seconds (histogram) - Measures execution time for all job types:

  • Ingestion jobs (staging, merge phases)
  • DataTransformer execution
  • CQRS sync jobs
  • View reconciliation
  • Git cloning, model loading, and validation

Merge Counters

MetricDescription
datasurface.merge.records_insertedRecords inserted during merge
datasurface.merge.records_updatedRecords updated during merge
datasurface.merge.records_deletedRecords deleted during merge

Pipeline Counters

MetricDescription
datasurface.pipeline.records_processedRecords processed by transformers
datasurface.cqrs.streams_syncedCQRS streams synchronized
datasurface.reconcile.views_countViews reconciled per container

Example Queries

Datadog

# All ingestion jobs by platform
job_type:ingestion | group by platform

# Failed jobs in production
status:failed rte_name:prod

# Job duration by model version
avg:datasurface.job.duration_seconds{job_type:ingestion} by {model_version}

Prometheus

# Average ingestion duration by platform
avg by (platform) (datasurface_job_duration_seconds{job_type="ingestion"})

# Records inserted per minute
sum(rate(datasurface_merge_records_inserted_total[1m])) by (datastore)

Technical Details

Delta Temporality

Required by Datadog and optimal for short-lived Kubernetes jobs. Each job reports only its own metrics, not cumulative totals.

Automatic Flush

Telemetry is flushed before job exit, ensuring no data loss even for jobs that complete in seconds.

Node-Local Agents

Connects to node-local telemetry agents (like Datadog Agent DaemonSet), avoiding cross-node network issues.

Set Up Observability