Skip to content

Topology

Topology is the graph that turns "an incident on host edge-prod-04" into "an incident that takes out payments and search and email." Without it the LLM can only reason about the single firing series.

ADR-025

The current topology BaseTools (expand_topology, find_topology_node) landed 2026-05-18. Before that, the LLM had a flat device list and no way to walk relationships; "what depends on X" required the operator to know the answer already.

The data model

Four tables, all in MySQL:

TableHolds
topology_nodesOne row per node. Has a type (device / service / cluster / ...), a name, and a free-form props_json.
topology_node_typesThe closed set of allowed node.type values.
topology_relationsDirected edge src_node_id -> dst_node_id with a type.
topology_relation_typesClosed set of edge types, each tagged with a semantics field (hard_dep / runtime_dep / traffic / annotation / observation).

The semantics tag is the key idea. An edge tagged hard_dep propagates failure (if src dies, dst is affected); an edge tagged annotation does not. The blast-radius walk uses this to filter.

See internal/manager/biz/topology/.

Where the data comes from

Three sources, layered:

  1. Auto from spans — Tempo's service_graph processor emits service_a -> service_b edges with a routes_to semantics tag. The manager mirrors these into topology_relations on a sync tick.
  2. Auto from edges — every registered edge becomes a type=device node; the node id is back-linked to the host_devices.node_id column so expand_topology(device_id=X) resolves through.
  3. Manual — operators add type=service / type=cluster nodes and edges through the SPA's /topology page. Used for nodes you want to address but that aren't directly observed (a managed database, a third-party API).

Tools

expand_topology

Walk outward from a node, return every reachable node plus how it was reached. Default BFS depth 2, cap 5. Default direction both (blast radius is symmetric — what could break this, AND what does this break).

json
{
  "node_id": 142,
  "depth": 2,
  "only_propagating": true,
  "direction": "downstream"
}

Or, when you start from a device id:

json
{ "device_id": 17, "depth": 3 }

The tool resolves device_id → device.node_id automatically. only_propagating=true (default) walks only hard_dep / runtime_dep / traffic edges; flip to false to include observation / annotation edges (useful for the "show me everything related to X" cases).

Returned hits carry the path metadata the LLM needs to reason about impact:

json
{
  "center":  { "node_id": 142, "node_name": "payments-api", "node_type": "service", "hops": 0, "propagates_failure": false },
  "max_hops": 2,
  "reachable_count": 7,
  "reachable": [
    { "node_id": 71, "node_name": "edge-prod-04", "node_type": "device", "hops": 1,
      "relation_type": "deployed_on", "semantics_tag": "runtime_dep", "reached_via": "downstream",
      "propagates_failure": true,
      "via_node_id": 142, "via_node_name": "payments-api" }
  ]
}

The flat list (no nested per-neighbor struct) is intentional — keeps the JSON cheap to embed in the prompt. See expand_topology_basetool.go.

find_topology_node

The "I have a human-given name, get me a node_id" pre-step. The persona runs this before expand_topology whenever the prompt mentions a service / host by name:

text
User: "what does loki-write depend on?"
Agent:
  → find_topology_node{ name: "loki-write" }
    ← { node_id: 219, node_type: "service", name: "loki-write" }
  → expand_topology{ node_id: 219, direction: "upstream" }
    ← { reachable_count: 4, ... }

Both are registered as ScopeManager BaseTools (no edge_id argument) — the topology DB lives manager-side.

Blast-radius walk in practice

The investigator persona's prompt includes an explicit "after you identify the firing service, call expand_topology to see what else is affected." This is how the report's related_alerts and the "业务影响 / Business impact" section get populated — the agent walks the graph from the firing device upward / downward and cross-checks which other incidents have fired in the same window on those nodes.

The relevant code path:

  1. correlate_incident returns metric + log + trace summaries plus the incident's device_id.
  2. expand_topology { device_id, direction: both, depth: 2 } returns the reachable set.
  3. Pass-2 extraction reads the worker's narrative and pulls pinpointed_target (the zero patient) + related_alerts (the cascade).

See also

  • RCA — the investigator persona that uses these tools.
  • Skills — how expand_topology is registered in both the BaseTool bag and the skill registry (inventory_bridge).
  • Concepts — the Edge / Device / Node vocabulary.