Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 9: The Project That Ate Itself: Why We Made NetClaw the Main Repo
Three months ago, Convergence was a straightforward idea: build a network observability stack for a home lab. OTEL Collector pulls SNMP from two Cisco 3850 switches and a pfSense firewall. VictoriaMetrics stores the metrics. Loki stores the logs. Grafana draws the pictures. Threat intelligence enriches the firewall events. An automation agent blocks the bad IPs. Nine dashboards. Eleven containers. A clean docker-compose.yml that you could hand to someone and say “this monitors a network.”
Then we added AI agents. Then we added more AI agents. Then we added a bridge between the AI agents and a different AI agent that was already there. Then we added a bridge for the bridge. And somewhere around the third REST proxy sidecar, we stopped and asked the question we should have asked at Phase 7: why are we building this when NetClaw already does it?
This is the story of how a network monitoring project consumed itself, and why we tore it apart and rebuilt it inside the tool it was trying to replace.
How Convergence Grew
The first six phases were solid engineering. Each one solved a real problem:
- Phase 1-3: Telemetry ingestion, storage, visualization, alerting. OTEL → VictoriaMetrics → Grafana → Alertmanager → Discord. Standard observability stack. Nothing controversial.
- Phase 4: Threat intelligence. The top blocked IPs get enriched via AbuseIPDB, GreyNoise, OTX, and IPInfo. Composite scoring. AI-generated narratives. This is a data pipeline — it fetches, scores, and stores. No LLM in the hot path.
- Phase 5: Automation agent. When a high-confidence threat is detected, propose a pfSense block action, get Discord approval, execute via XML-RPC, record in a GAIT audit trail. Deterministic pipeline with a human gate.
- Phase 6: Ollama provider support. The threat-intel and automation-agent services could now use local or cloud LLMs instead of just Anthropic.
Through Phase 6, the architecture was clean. Data pipelines did data pipeline things. The LLM was used for narrative generation and action proposals — specific, bounded tasks. Docker Compose orchestrated everything. You could draw the data flow on a whiteboard.
Phase 7 is where it went sideways.
The Multi-Agent Mistake
The idea was reasonable: build a team of AI agents that continuously monitor the network. A NOC Officer for L1 triage. A Network Engineer for interface analysis. A Security Expert for threat hunting. A NAS Engineer for Synology health. An Interface Reconciler for keeping Nautobot accurate. A Supervisor to coordinate them.
Each agent was a Python module with a system prompt, tool definitions, and an agentic loop. The tools were httpx calls to VictoriaMetrics (PromQL), Loki (LogQL), Nautobot (GraphQL), and pfSense (XML-RPC). The agentic loop called the LLM, executed tools, fed results back, repeated until the LLM said stop.
It worked. The agents found real things. The Interface Reconciler was genuinely useful — it kept Nautobot’s port descriptions current by correlating MAC tables with DHCP leases. The Security Expert caught scanner campaigns and submitted block actions.
The problem was everything else.
158 findings in 6 hours. 33 block submissions. The security expert flagged a Synology NAS as a threat because it talked to multiple VLANs — which is what a NAS does. It blocked five IPs from the same /24 individually as /32s. The interface reconciler wrote port descriptions that the security expert then flagged as suspicious. Six agents, six context windows, zero shared state.
And the whole time, NetClaw was sitting in the same Docker Compose stack with 124 skills, 36 MCP servers, pyATS, Genie parsers, and a CCIE-level SOUL — doing nothing. Because we’d stuffed it into a container as a submodule and never properly set it up.
The Integration Tax
Phase 8 gave us a unified LLM client so the agents could use any provider. Phase 9 was supposed to connect the agents to NetClaw. We built a REST proxy sidecar so the security expert could ask NetClaw to investigate hosts. We built a convergence-mcp server so NetClaw could query the threat-intel and automation-agent APIs.
That’s when the architecture diagram started looking like a Rube Goldberg machine. The security expert finds a suspicious IP. It calls the REST proxy via HTTP. The proxy shells out to openclaw agent via subprocess. The CLI connects to the OpenClaw gateway via WebSocket with Ed25519 device authentication. The gateway dispatches to the LLM. The LLM calls MCP tools. The tools call back to the same VictoriaMetrics and Loki that the security expert was already querying directly.
Four protocol hops to ask a question that NetClaw could answer if it was running the monitoring loop in the first place.
Phase 10 was the reckoning. We wrote three SKILL.md files, built a 100-line scheduler, and deleted 3,800 lines of Python. The six agents became three NetClaw skills. The unified LLM client we’d carefully built in Phase 8 — provider-agnostic, with fallback chains and credential sanitization — got deleted along with the service that used it. OpenClaw’s gateway already handles provider routing.
That worked. But it exposed the deeper problem.
The Submodule Was Always Wrong
NetClaw was deployed as a Docker container built from a git submodule. The Dockerfile ran install.sh || true — which means it installed what it could and silently failed on the rest. It never ran openclaw onboard, which is the step that registers MCP servers with the gateway. The pyATS testbed pointed to Cisco DevNet sandbox devices at 10.10.20.x, not our actual switches at 192.168.3.2/3.
The result: 36 MCP servers installed on disk, 2 registered. Grafana MCP with its 75 tools — sitting on disk, unregistered. Prometheus MCP, Nautobot MCP, pyATS MCP, nmap MCP, GAIT MCP — all installed, all unused. The skills we wrote referenced tools that existed but weren’t connected.
We kept hitting this wall. Register pfsense-mcp manually via openclaw mcp set. Discover that mcpServers isn’t a valid key in the deployment config (it’s managed by the CLI, not hand-edited). Discover that the submodule’s config/openclaw.json uses an older format than the gateway expects. Discover that credentials in the MCP registration need to be literal values, not ${VAR} references.
Every one of these problems exists because we were fighting NetClaw’s deployment model. NetClaw was designed to be installed on a host, onboarded interactively, and configured through its CLI. We were trying to bake it into a Docker image and configure it through volume mounts and environment variables. The impedance mismatch caused every integration bug we hit.
The Decision
The question wasn’t “should we keep using NetClaw” — we were already using it, badly. The question was: which project is the host and which is the guest?
Option A: Convergence is the main project, NetClaw is a submodule. This is what we had. It meant fighting the install process, manual MCP registration, wrong testbeds, config schema mismatches, and a Dockerfile full of || true.
Option B: NetClaw is the main project, Convergence infrastructure bolts on. NetClaw runs natively on the host, properly installed, properly onboarded. The telemetry stack (OTEL, VictoriaMetrics, Loki, Grafana, Redis) runs in Docker. The data services (threat-intel, automation-agent, scheduler) run in Docker. NetClaw talks to them via localhost ports.
We went with B.
The reasoning is simple: NetClaw is the thing that thinks. The telemetry stack is the thing it thinks about. You don’t make the brain a submodule of the nervous system.
The Branch Strategy
NetClaw is an open-source project at automateyournetwork/netclaw. We forked it to byrn-baker/netclaw. The fork has three branches:
main — Clean NetClaw, synced with upstream. Only things that benefit everyone go here. Right now that’s the pfSense MCP server (8 read-only tools via XML-RPC), the pfSense firewall operations skill, and the Synology NAS monitoring skill. These are generic — anyone running pfSense or Synology can use them. They’ll be PR’d upstream.
upstream/pfsense-mcp — The branch for the upstream PR. Just the MCP server, the two skills, and the gitignore update. Clean diff against main.
convergence — Our deployment. Everything specific to this network: the telemetry stack configs, the Grafana dashboards, the threat-intel and automation-agent services, the convergence-mcp server, the scheduler, the deployment-specific skills. This branch merges from main to stay current with upstream, but it never goes upstream itself. It’s our network, our thresholds, our device IPs.
The separation is clean. If upstream NetClaw adds a new skill or MCP server, we merge main into convergence and get it. If we improve the pfSense MCP server, we cherry-pick it to upstream/pfsense-mcp and PR it. The Convergence infrastructure — the docker-compose.yml, the OTEL config, the dashboards — lives only on the convergence branch because it’s deployment-specific.
What Moved Where
Everything that was in the Convergence repo now lives in the NetClaw fork:
| What | Where in NetClaw | Why |
|---|---|---|
| OTEL, Grafana, Loki, VM configs | config/ (alongside openclaw.json) |
Infrastructure configs at the root |
| 9 Grafana dashboards | dashboards/ |
Visualization |
| threat-intel service | services/threat-intel/ |
Data pipeline, stays as-is |
| automation-agent service | services/automation-agent/ |
Execution engine, stays as-is |
| convergence-scheduler | services/convergence-scheduler/ |
Cron + Discord |
| docker-compose.yml | Root | Infrastructure orchestration |
| pfsense-mcp | mcp-servers/pfsense-mcp/ |
Also on upstream branch |
| convergence-mcp | mcp-servers/convergence-mcp/ |
Convergence branch only |
| netclaw-proxy | mcp-servers/netclaw-proxy/ |
REST proxy for scheduler |
| Phase docs | docs/convergence-*.md |
Prefixed to avoid conflicts |
The docker-compose.yml no longer has a netclaw container. NetClaw runs natively. The scheduler reaches it via host.docker.internal:18790.
What Stays the Same
The telemetry pipeline is untouched. SNMP metrics still flow from the switches through OTEL into VictoriaMetrics. Syslogs still flow from pfSense through OTEL into Loki via Promtail. NetFlow still gets collected. The dashboards still read from the same datasources. The threat-intel service still enriches IPs with the same four APIs. The automation-agent still blocks IPs on pfSense with the same GAIT audit trail.
None of that changed. The data plane is solid. It was always solid. The problem was never the data — it was the AI layer that looked at the data and decided what to do about it.
What’s Next
The convergence branch needs the actual deployment work:
- Install NetClaw properly —
install.shwith all 53 steps completing,openclaw onboardrunning interactively - Update the pyATS testbed with real device IPs and credentials
- Register the MCP servers that matter: Grafana (75 tools), Prometheus, Nautobot, pyATS, pfsense-mcp, convergence-mcp
- Test each one — verify NetClaw can actually query Grafana, SSH to switches, read pfSense leases
- Write the three Convergence-specific skills (noc-watch, security-monitor, interface-reconciler)
- Start the infrastructure stack from the NetClaw directory
- Start NetClaw natively alongside it
The Convergence repo at byrn-baker/Convergence becomes an archive. The README gets updated to point to the NetClaw fork. The git history is preserved — ten phases of learning, including the phases we had to undo.
The Lesson
Build on your tools, don’t build around them.
We had a CCIE-level AI agent with 124 skills, 36 MCP servers, pyATS device access, and a framework designed for exactly this kind of deployment. We treated it as a container to stuff into a compose stack and built a parallel system that reimplemented its capabilities in worse ways. Then we spent three phases building bridges between the two systems. Then we tore down the parallel system and moved into the tool we should have been using from the start.
The Convergence infrastructure — the telemetry stack, the dashboards, the data pipelines — that was always good work. It’s the nervous system. OTEL collects the signals, VictoriaMetrics and Loki store them, Grafana visualizes them, threat-intel enriches them. That work survives intact.
The mistake was building a second brain instead of teaching the first one what to look for.
The Convergence platform lives at byrn-baker/netclaw on the convergence branch. Upstream contributions (pfSense MCP server, pfSense and Synology skills) are on upstream/pfsense-mcp. The original Convergence repo is archived at byrn-baker/Convergence.
