Skip to content

Traces

Distributed tracing is the third L1 backend after Prom and Loki. Like them, it follows ADR-014: edges push direct, manager only queries.

The data plane

text
your apps (OTLP gRPC/HTTP)


edge:
  otelcol (subprocess plugin)
    └─ remote OTLP → tempo:4317

                          ├─ trace storage (S3-compatible blob)

                          └─ metrics_generator → traces_spanmetrics_*


                                                 prometheus:9090

otelcol runs as a plugin under the edge agent (same supervision model as promtail — see Logs). The collector accepts both gRPC (:4317) and HTTP (:4318) on the host, batches, and forwards to the cloud's Tempo over HTTPS.

Apps don't need to know about Ongrid

Configure your application's OTel exporter to point at http://localhost:4318 (or the gRPC equivalent). The local otelcol handles auth + batching + the trip to cloud. Removing Ongrid means removing one env var — your code is untouched.

The spanmetrics generator

The Tempo deployment in the compose has metrics_generator enabled. Every batch of spans produces three derived metrics emitted back into the same Prom:

SeriesSource
traces_spanmetrics_calls_totalone counter per (service, operation, status_code)
traces_spanmetrics_latency_buckethistogram of span duration
traces_spanmetrics_size_bucketoptional, off by default

This is the load-bearing trick that makes trace alerts feel like metric alerts: the alert evaluator queries Prom, the data lives in Tempo.

Alert kinds

trace_latency

json
{
  "kind": "trace_latency",
  "scope_type": "global",
  "conditions_json": {
    "service": "payments-api",
    "operation": "POST /v1/charge",
    "quantile": "p95",
    "window": "5m",
    "threshold_ms": 800
  }
}

Compiles to:

promql
histogram_quantile(
  0.95,
  sum by (le) (rate(traces_spanmetrics_latency_bucket{
    service_name="payments-api", span_name="POST /v1/charge"
  }[5m]))
) * 1000 > 800

operation is optional — drop it for service-wide latency. See compileTraceLatencyRule.

Supported quantiles: p50 / p95 (default) / p99. The string form exists so the UI form picker stays terse; the compiler maps to the float histogram_quantile wants.

trace_error_rate

json
{
  "kind": "trace_error_rate",
  "scope_type": "global",
  "conditions_json": {
    "service": "payments-api",
    "window": "5m",
    "operator": ">",
    "threshold_pct": 1.0
  }
}

Compiles to:

promql
100 * (
  sum by (service_name) (rate(traces_spanmetrics_calls_total{
    service_name="payments-api", status_code="STATUS_CODE_ERROR"
  }[5m]))
  / sum by (service_name) (rate(traces_spanmetrics_calls_total{
    service_name="payments-api"
  }[5m]))
) > 1.0

The STATUS_CODE_ERROR literal is what the spanmetrics generator emits; alternative span status conventions need their own kind.

Tools

query_traceql

Direct passthrough to Tempo's TraceQL endpoint (query_traceql_basetool.go). Used by the investigator persona when the operator asks for specific trace IDs — e.g. "find a slow trace for payments-api in the last 30 minutes."

text
{ resource.service.name="payments-api" } | duration > 800ms

The underlying client is internal/pkg/tracequery; backend-decoupled package name on purpose — Jaeger / Zipkin clients can drop in later without renaming the surface.

correlate_incident

The composite tool the investigator persona starts with. Pulls Prom + Loki + Tempo signals around the incident's fire window in a single fan- out call. Reduces 4-5 sequential tool calls to one — important when the per-investigation budget is 10 tool calls.

See correlate_incident_basetool.go.

See also