Traces
Distributed tracing is the third L1 backend after Prom and Loki. Like them, it follows ADR-014: edges push direct, manager only queries.
The data plane
your apps (OTLP gRPC/HTTP)
│
▼
edge:
otelcol (subprocess plugin)
└─ remote OTLP → tempo:4317
│
├─ trace storage (S3-compatible blob)
│
└─ metrics_generator → traces_spanmetrics_*
│
▼
prometheus:9090otelcol runs as a plugin under the edge agent (same supervision model as promtail — see Logs). The collector accepts both gRPC (:4317) and HTTP (:4318) on the host, batches, and forwards to the cloud's Tempo over HTTPS.
Apps don't need to know about Ongrid
Configure your application's OTel exporter to point at http://localhost:4318 (or the gRPC equivalent). The local otelcol handles auth + batching + the trip to cloud. Removing Ongrid means removing one env var — your code is untouched.
The spanmetrics generator
The Tempo deployment in the compose has metrics_generator enabled. Every batch of spans produces three derived metrics emitted back into the same Prom:
| Series | Source |
|---|---|
traces_spanmetrics_calls_total | one counter per (service, operation, status_code) |
traces_spanmetrics_latency_bucket | histogram of span duration |
traces_spanmetrics_size_bucket | optional, off by default |
This is the load-bearing trick that makes trace alerts feel like metric alerts: the alert evaluator queries Prom, the data lives in Tempo.
Alert kinds
trace_latency
{
"kind": "trace_latency",
"scope_type": "global",
"conditions_json": {
"service": "payments-api",
"operation": "POST /v1/charge",
"quantile": "p95",
"window": "5m",
"threshold_ms": 800
}
}Compiles to:
histogram_quantile(
0.95,
sum by (le) (rate(traces_spanmetrics_latency_bucket{
service_name="payments-api", span_name="POST /v1/charge"
}[5m]))
) * 1000 > 800operation is optional — drop it for service-wide latency. See compileTraceLatencyRule.
Supported quantiles: p50 / p95 (default) / p99. The string form exists so the UI form picker stays terse; the compiler maps to the float histogram_quantile wants.
trace_error_rate
{
"kind": "trace_error_rate",
"scope_type": "global",
"conditions_json": {
"service": "payments-api",
"window": "5m",
"operator": ">",
"threshold_pct": 1.0
}
}Compiles to:
100 * (
sum by (service_name) (rate(traces_spanmetrics_calls_total{
service_name="payments-api", status_code="STATUS_CODE_ERROR"
}[5m]))
/ sum by (service_name) (rate(traces_spanmetrics_calls_total{
service_name="payments-api"
}[5m]))
) > 1.0The STATUS_CODE_ERROR literal is what the spanmetrics generator emits; alternative span status conventions need their own kind.
Tools
query_traceql
Direct passthrough to Tempo's TraceQL endpoint (query_traceql_basetool.go). Used by the investigator persona when the operator asks for specific trace IDs — e.g. "find a slow trace for payments-api in the last 30 minutes."
{ resource.service.name="payments-api" } | duration > 800msThe underlying client is internal/pkg/tracequery; backend-decoupled package name on purpose — Jaeger / Zipkin clients can drop in later without renaming the surface.
correlate_incident
The composite tool the investigator persona starts with. Pulls Prom + Loki + Tempo signals around the incident's fire window in a single fan- out call. Reduces 4-5 sequential tool calls to one — important when the per-investigation budget is 10 tool calls.
See correlate_incident_basetool.go.
See also
- Alerts — the rule kinds.
- Monitoring — the Prom backend the trace alerts query.
- Topology — service-to-service edges Tempo's
metrics_generatorinfers automatically. - Telemetry data plane — ADR-014 in full.