Topology
Topology is the graph that turns "an incident on host edge-prod-04" into "an incident that takes out payments and search and email." Without it the LLM can only reason about the single firing series.
ADR-025
The current topology BaseTools (expand_topology, find_topology_node) landed 2026-05-18. Before that, the LLM had a flat device list and no way to walk relationships; "what depends on X" required the operator to know the answer already.
The data model
Four tables, all in MySQL:
| Table | Holds |
|---|---|
topology_nodes | One row per node. Has a type (device / service / cluster / ...), a name, and a free-form props_json. |
topology_node_types | The closed set of allowed node.type values. |
topology_relations | Directed edge src_node_id -> dst_node_id with a type. |
topology_relation_types | Closed set of edge types, each tagged with a semantics field (hard_dep / runtime_dep / traffic / annotation / observation). |
The semantics tag is the key idea. An edge tagged hard_dep propagates failure (if src dies, dst is affected); an edge tagged annotation does not. The blast-radius walk uses this to filter.
See internal/manager/biz/topology/.
Where the data comes from
Three sources, layered:
- Auto from spans — Tempo's
service_graphprocessor emitsservice_a -> service_bedges with aroutes_tosemantics tag. The manager mirrors these intotopology_relationson a sync tick. - Auto from edges — every registered edge becomes a
type=devicenode; the node id is back-linked to thehost_devices.node_idcolumn soexpand_topology(device_id=X)resolves through. - Manual — operators add
type=service/type=clusternodes and edges through the SPA's/topologypage. Used for nodes you want to address but that aren't directly observed (a managed database, a third-party API).
Tools
expand_topology
Walk outward from a node, return every reachable node plus how it was reached. Default BFS depth 2, cap 5. Default direction both (blast radius is symmetric — what could break this, AND what does this break).
{
"node_id": 142,
"depth": 2,
"only_propagating": true,
"direction": "downstream"
}Or, when you start from a device id:
{ "device_id": 17, "depth": 3 }The tool resolves device_id → device.node_id automatically. only_propagating=true (default) walks only hard_dep / runtime_dep / traffic edges; flip to false to include observation / annotation edges (useful for the "show me everything related to X" cases).
Returned hits carry the path metadata the LLM needs to reason about impact:
{
"center": { "node_id": 142, "node_name": "payments-api", "node_type": "service", "hops": 0, "propagates_failure": false },
"max_hops": 2,
"reachable_count": 7,
"reachable": [
{ "node_id": 71, "node_name": "edge-prod-04", "node_type": "device", "hops": 1,
"relation_type": "deployed_on", "semantics_tag": "runtime_dep", "reached_via": "downstream",
"propagates_failure": true,
"via_node_id": 142, "via_node_name": "payments-api" }
]
}The flat list (no nested per-neighbor struct) is intentional — keeps the JSON cheap to embed in the prompt. See expand_topology_basetool.go.
find_topology_node
The "I have a human-given name, get me a node_id" pre-step. The persona runs this before expand_topology whenever the prompt mentions a service / host by name:
User: "what does loki-write depend on?"
Agent:
→ find_topology_node{ name: "loki-write" }
← { node_id: 219, node_type: "service", name: "loki-write" }
→ expand_topology{ node_id: 219, direction: "upstream" }
← { reachable_count: 4, ... }Both are registered as ScopeManager BaseTools (no edge_id argument) — the topology DB lives manager-side.
Blast-radius walk in practice
The investigator persona's prompt includes an explicit "after you identify the firing service, call expand_topology to see what else is affected." This is how the report's related_alerts and the "业务影响 / Business impact" section get populated — the agent walks the graph from the firing device upward / downward and cross-checks which other incidents have fired in the same window on those nodes.
The relevant code path:
correlate_incidentreturns metric + log + trace summaries plus the incident'sdevice_id.expand_topology { device_id, direction: both, depth: 2 }returns the reachable set.- Pass-2 extraction reads the worker's narrative and pulls
pinpointed_target(the zero patient) +related_alerts(the cascade).