Post

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 7: From Dashboards to a NOC Team: How We Replaced Grafana Alerts with AI Agents

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 7: From Dashboards to a NOC Team: How We Replaced Grafana Alerts with AI Agents

When we started the Convergence platform back in January, the goal was simple: stop guessing about what was happening on our home network and start knowing. Two Cisco switches, a pfSense firewall, a pair of Synology NAS devices, and the ambient noise of a network that — like any network — had opinions about when it wanted to misbehave.

The first six phases got us from zero to a genuinely capable observability stack: SNMP metrics in VictoriaMetrics, syslogs in Loki, pfSense firewall events enriched with threat intelligence, an AI agent that could propose IP blocks and get Discord approval before executing them, and a Grafana dashboard collection that, if you stared at it long enough, would tell you something useful.

That last bit is the problem. Dashboards require someone to stare at them.

Phase 7 is about not having to stare.


The Problem with Alerts

We had Grafana alerts. They fired into Discord. They were, in the classic tradition of monitoring alerts everywhere, simultaneously too noisy and not smart enough.

An alert that says “interface utilization on Te1/1/3 is 14 Mbps” is technically correct and completely useless. 14 Mbps on a 10G port is 0.14% utilization. It is not a problem. But an alerting rule doesn’t know that — it just knows a threshold was crossed, or a query matched, or a condition was true for N seconds.

The other problem is context. An alert that fires doesn’t know what’s plugged into that port. It doesn’t know whether the switch has been reporting normally for the past hour, or whether the SNMP collector has been silently failing and this is the first data point in six hours. It can’t look at the NetFlow data and correlate whether that “high” utilization is a port scan in progress or someone copying a large file to the NAS.

We needed something that could think, not just compare.


NetClaw: The Starting Point

The Convergence platform was already running NetClaw — a specialized AI network engineering agent built on the OpenClaw framework. NetClaw knows Cisco IOS-XE. It can run pyATS show commands. It has over 100 skills covering everything from BGP configuration to EVPN fabric auditing to Nautobot DCIM reconciliation.

The original idea was to use NetClaw as the primary monitoring agent. Let it poll the network, check the metrics, and tell us what was wrong.

The problem with a single-agent approach is the same problem you’d have hiring one person to be simultaneously a network engineer, a security analyst, a storage admin, and a NOC watch officer. They’d be context-switching constantly, bringing a generalist perspective to every domain, and probably missing things.

Real NOCs work in tiers for a reason. You have L1 watch officers who see everything and escalate what they can’t handle. You have L2/L3 specialists who know their domain deeply. You have senior architects who handle the hard stuff. This specialization isn’t bureaucracy — it’s how you avoid both alert fatigue and blind spots.


The NET-OPS Team Architecture

Phase 7 builds a hierarchical team of AI agents on the Claude Agent SDK, wired into the existing Convergence data plane (VictoriaMetrics, Loki, Nautobot, pfSense).

1
2
3
4
5
6
7
Supervisor
├── NOC Watch Officer    (L1 — sees everything, 5-minute poll)
├── Network Engineer     (L2/L3 — switches, interfaces, utilization)
├── Security Expert      (L2/L3 — pfSense firewall, NetFlow threat hunting)
├── Security Engineer    (L2/L3 — threat intel correlation, C2 detection)
├── NAS Engineer         (L2/L3 — Synology health, RAID, disk temps)
└── Interface Reconciler (L2/L3 — Nautobot DCIM vs SNMP reality)

Each agent runs every 5 minutes. Each has a curated tool set and a system prompt that encodes domain expertise — not just generic instructions, but actual knowledge about the specific equipment and environment.

The pattern is the same for every agent: a standard agentic loop using the Anthropic Python SDK’s messages.create() API with tool_use/end_turn stop reason handling. Each agent queries its data sources, reasons about what it finds, and either reports a finding or moves on. Findings are collected by the Supervisor and posted to Discord.


What Made This Work: Domain Knowledge in System Prompts

The difference between a useful AI agent and a noisy one isn’t the model — it’s the system prompt. You have to encode the specific knowledge that a human expert would bring to the job.

Some examples of what we embedded:

Network Engineer — Cisco WS-C3850-48P hardware:

1
2
3
4
5
6
WS-C3850-48P uplink module (IMPORTANT):
- Gi1/1/1 through Gi1/1/4 are COMBO ports shared with Te1/1/1-Te1/1/4.
- When an SFP+ is inserted and Te1/1/x is active, the corresponding Gi1/1/x is
  automatically disabled by IOS — it will show as down with no traffic.
- Do NOT report Gi1/1/1-4 as errors or missing if Te1/1/1-4 are active.
  This is expected and correct hardware behavior, not a fault.

Without this, the agent was repeatedly flagging four “down” interfaces as potential problems. With it, the agent knows that Gi1/1/1-4 being inactive when Te1/1/1-4 are active is correct. This is exactly the kind of equipment-specific knowledge that separates a useful network engineer from someone reading the symptom list without understanding the platform.

NAS Engineer — Synology RAID semantics:

1
2
3
4
5
IMPORTANT — Storage Pool vs Volume semantics:
- "Storage Pool" entries in nas_raid_free_bytes showing 0 free is NORMAL and expected.
  It means all pool capacity is allocated to Volumes — this is correct Synology behavior.
- Only Volume entries (raid_name=~"Volume.*") reflect actual user-available disk space.
  Report storage warnings based on Volume free space only, never on Storage Pool entries.

This one caused real confusion early on. The SNMP data was coming back with Storage Pool entries showing free_bytes = 0, and the agent was interpreting that as “the storage is full.” It isn’t — Synology allocates all pool capacity to volumes, and the volume entries show what’s actually available. The fix required encoding that semantic understanding directly.

Security Expert — NetFlow threat hunting patterns:

1
2
3
4
5
6
Threat hunting patterns to check for in NetFlow:
- Port scans: one src IP hitting many different dst ports in short time window
- Brute force: many flows to the same dst IP on port 22, 3389, or 445
- C2 beaconing: regular small flows from internal IP at fixed intervals (e.g., every 60s)
- Data exfiltration: large outbound flows (flow.io.bytes > 10MB) to external IPs
- Lateral movement: unexpected internal-to-internal flows between unusual source/dest pairs

This turns the NetFlow query from “show me traffic” into “actively look for these specific threat signatures.”


The NetFlow Pipeline

One of the bigger infrastructure additions in Phase 7 was getting NetFlow data from pfSense into Loki so agents could query it.

pfSense was already exporting Netflow v5 over UDP. The challenge was getting it into the stack without the Loki exporter (which turned out not to be available in the OTEL collector build we were running). The solution was simpler than expected:

1
2
pfSense → OTEL netflow receiver (UDP 2055) → file exporter (/data/netflow/netflow.jsonl)
       → Promtail (tail the file) → Loki {job="netflow"}

The OTEL netflow receiver outputs OTLP-formatted JSON records. Each flow is a log line with attributes in the format {"key": "source.address", "value": {"stringValue": "1.2.3.4"}}. The agents know this format and can extract source/destination IPs, ports, byte counts, and flow direction from the raw records.

Getting the OTEL receiver config right took some iteration. The netflow receiver only accepts four valid keys: scheme, hostname, port, and workers. Everything else (endpoint, addr, transport, listeners) is invalid and causes a startup crash. After finding this out the hard way, the working config is:

1
2
3
4
netflow:
  scheme: netflow
  hostname: "0.0.0.0"
  port: 2055

NAS SNMP: The NAT Problem

The Synology NAS devices (192.168.100.22/23) sit on a different subnet from the Docker containers running the OTEL collector (172.18.0.0/16). Traffic goes: container → host eth0 (192.168.3.254) → pfSense → NAS.

The OTEL SNMP receiver has a default timeout that’s tight enough that this NAT path was causing timeouts on the initial collection. The fix was adding timeout: 10s to both NAS SNMP receivers. After that, both devices started reporting correctly.

The other NAS SNMP issue was metric naming. When OTEL exports to Prometheus/VictoriaMetrics, it appends the unit name to the metric name:

  • nas.system.temperature (unit: “Cel”) → nas_system_temperature_celsius
  • nas.disk.status (unit: “1”) → nas_disk_status_ratio
  • nas.raid.free (unit: “By”) → nas_raid_free_bytes

The NAS Engineer’s system prompt, the Grafana dashboard PromQL, and the dashboard panel units all needed to be updated to match these suffixed names.


Nautobot as a Living Source of Truth

The Interface Reconciler is probably the most impactful long-running agent in the team. Every poll cycle, it:

  1. Queries VictoriaMetrics for all interfaces currently reporting SNMP data on both switches
  2. Queries Nautobot GraphQL for the current interface inventory
  3. Diffs the two: creates interfaces in Nautobot that SNMP sees but Nautobot doesn’t know about
  4. Queries pfSense for DHCP leases and ARP table
  5. Correlates MAC addresses to find what’s connected to each port
  6. Writes port descriptions back to Nautobot: Connected: HDHR-10A70D51 (192.168.100.223, 00:18:dd:0a:70:d5)

Nautobot isn’t just being read anymore — it’s being actively maintained by the agent. The Nautobot DCIM inventory now reflects the actual current state of the network, updated continuously rather than manually when someone remembers to update it.

The decision about what to auto-write vs what to report is deliberate:

  • Descriptions and missing interfaces: auto-write (low risk, high value)
  • enabled/status mismatches: report as WARNING, don’t auto-correct (requires human judgment)
  • Stale records: report as WARNING, never auto-delete (might be intentional)

Discord as the Human Interface

The old Discord integration was Grafana alert rules firing webhooks. The new one is agent-driven.

What changed:

  • Grafana alert rules: all cleared. The alerting configuration points every notification to a “Do Nothing” receiver. Dashboards are for visualization, not for paging you.
  • CRITICAL findings: posted to Discord immediately as they’re generated
  • WARNING findings: posted immediately
  • INFO findings: suppressed from Discord, visible in Grafana dashboards
  • Hourly shift reports: summarize all CRITICAL/WARNING findings grouped by agent role, with a direct link to the relevant Grafana dashboard for each finding type

The shift report when everything is healthy reads: “All systems healthy — no issues detected this cycle.” That’s a meaningful signal. You know the agents ran and found nothing, rather than having to wonder whether the alerts would have fired if something were wrong.

Each agent role maps to a specific Grafana dashboard:

Agent Dashboard
Network Engineer Interface Utilization
Security Expert pfSense Firewall Security
NAS Engineer NAS Health
Interface Reconciler Network Device Health

When a finding fires, the Discord embed has a direct link to the dashboard where you can drill in. The agent tells you what happened; the dashboard shows you the context.


NetClaw’s Place in This

NetClaw didn’t go away. It became the L4 escalation target.

The NET-OPS team agents handle continuous monitoring and routine analysis. They don’t push configuration, don’t make changes to network devices, and don’t take actions that could cause outages. When something requires actual intervention — a complex interface configuration, a multi-device topology change, anything requiring pyATS access — that’s what NetClaw is for.

NetClaw’s value in this architecture is its skill library. Over 100 skills covering Cisco, Junos, ACI, Meraki, Nautobot, AWS networking, packet capture, BGP, OSPF, and more. When the Interface Reconciler flags a MEDIUM discrepancy that requires a human decision, the next step is to pull up a netclaw Discord session and ask it to investigate further with full device access.

The nautobot-dcim-reconcile skill in NetClaw is what the Interface Reconciler agent was designed after. The agent runs it automatically and continuously; the skill is the on-demand manual version you’d invoke when you want to do a thorough audit with full reporting.

NetClaw also has an interesting future: MISSION03 describes BGP peering between NetClaw instances via ngrok, using OPEN-based peer identification (since ngrok breaks source-IP-based matching). Two AI network agents, each managing their own domain, exchanging routes and topology information over BGP. That’s future work — but it’s wired into the platform’s direction.


What’s Better About This Approach

Threshold context is in the agent, not the alert rule. A Grafana alert rule doesn’t know that 14 Mbps on a 10G port isn’t a problem. The agent does, because we told it the thresholds explicitly and said “only report if these are exceeded.”

Findings include context that static alerts can’t carry. When the Network Engineer finds high utilization, it first looks up the port description in Nautobot to tell you what device is connected. The finding says “HomeSwitch01 Gi1/0/47 [HDHomeRun Tuner]: 847 Mbps IN, 84.7% utilization, approaching saturation” — not “interface utilization alert fired.”

INFO is cheap but not spammy. Every agent reports an INFO finding on a clean check. The supervisor collects all of these, but they’re suppressed from Discord. You can query the /api/v1/report/latest endpoint to see the full picture, including the all-clear INFO findings. The Discord channel stays quiet unless something needs attention.

The source of truth is actively maintained. Nautobot used to be the thing we manually updated when we remembered. Now the Interface Reconciler is writing to it every 5 minutes, keeping the interface inventory and port descriptions current.

You can ask questions. The Discord bot routes ad-hoc questions to the appropriate agent with full tool access. “What’s plugged into switch port 24?” runs the NAS engineer’s answer_question() method, which queries Nautobot and returns the description. No dashboard digging.


What’s Next

The architecture has a gap we haven’t filled: the agents flag issues but rarely push back changes. The next logical step is closing that loop — agent finds issue, agent proposes fix (via NetClaw), human approves in Discord, NetClaw executes with GAIT audit trail. The automation-agent service from Phase 5 already has this pattern for IP blocking. Generalizing it to switch interface configuration and NAS administration is the natural extension.

NetClaw’s BGP mesh (MISSION03) is worth watching too. When two NetClaw instances can peer over BGP, the combined topology visibility is genuinely interesting. One node sees a route flap; the other correlates it with a security event on its side. That kind of cross-domain, machine-speed correlation is what makes AI network operations different from traditional monitoring — not faster dashboards, but actual reasoning across multiple data sources simultaneously.

The dashboards aren’t going away. Grafana still shows you everything, and the agents link back to it. But the first line of awareness isn’t a screen you stare at anymore. It’s a team that runs while you’re asleep and tells you what matters when you wake up.


The Convergence platform is a home network AI observability project. Phase 7 source code lives in services/net-ops-team/ in the repository.

Ideas or homelab war stories? Find me on X @byrn_baker or Linkedin @byrnbaker .

Code: https://github.com/byrn-baker/Convergence/tree/feature/netclaw-integration

Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.