Post

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 12: Teaching an AI Coworker: Skills, Mistakes, and the 20-Step Demo That Broke Five Different Ways

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 12: Teaching an AI Coworker: Skills, Mistakes, and the 20-Step Demo That Broke Five Different Ways

▶️ Watch the video

I have a demo. One prompt, one conversation, and the agent builds a complete network lab — 20 routers and switches, source of truth, generated configurations, compliance monitoring. It takes about 15 minutes when it works. It took five models and a month of debugging to get it to work.

The demo isn’t the interesting part. The interesting part is what happened every time it didn’t work, and what I changed so it wouldn’t happen again. Because training an AI agent turns out to be exactly like training a new hire — except the new hire occasionally decides to install nginx and build a git server from scratch instead of using GitHub.


The Coworker Framing

NetClaw isn’t an assistant. The SOUL.md file — the personality definition that gets injected into every session — says it explicitly: “You are not an assistant. You are a coworker. You own this network.”

That framing matters because it changes how you think about the training. You don’t write documentation for an assistant. You write onboarding materials for a team member. There are three parts:

Who it is. SOUL.md defines a CCIE-level network engineer with 15 years of experience across enterprise, service provider, and data center environments. It knows OSPF, BGP, MPLS, EVPN. It knows what a flapping interface looks like and what questions to ask when a BGP peer goes down. This isn’t a chatbot that searches the internet — this knowledge is part of who it is.

How it works here. AGENTS.md defines the operating rules. Never guess what a device is doing — go look. Never change a config without capturing the before state. Never skip the change request. Always log what you did and why. These aren’t suggestions — they’re non-negotiable, just like they would be for any engineer on the team. There are 12 of them, and they exist because at some point the agent violated each one.

What it knows about us. USER.md has the human’s name, timezone, and preferences. TOOLS.md has device IPs, Slack channels, and site information. The stuff that makes it useful for your network, not just any network.

This is standard OpenClaw architecture. Every agent gets these files at session start. What makes it interesting is what happens when the coworker starts working and you discover all the things you forgot to tell it.


Skills: Teaching Someone How to Do the Job

Imagine you’re teaching someone to cook a recipe. If you’re teaching a professional chef, you can say “deglaze the pan and reduce.” If you’re teaching someone who’s never cooked, you need to say “pour a splash of wine in the hot pan, scrape the brown bits with a wooden spoon, and let it simmer until it’s half the volume.”

Same dish. Different instructions. Depends on who you’re teaching.

Skills work the same way. They’re step-by-step procedures written for the AI. Some models are like experienced engineers — you give them the goal and they figure out the steps. Others need every single command spelled out with warnings about what NOT to do.

The skill driving the demo is about 350 lines. It says: clone this repo, run these exact commands, use these specific tools to talk to Nautobot, wait for the user between phases, and whatever you do, don’t try to build a local git server — use the ones that already exist on GitHub.

That last instruction? I learned it the hard way.


The Mistakes

Every mistake the agent made became a better instruction. Here they are, in the order I discovered them.

The $10 Git Server

The first time I ran the demo on Claude Sonnet, the agent needed to register golden config template repositories in Nautobot. The templates already existed on GitHub. The GitHub MCP server was available. The skill said “register these repos.”

The agent decided to install nginx and fcgiwrap, create bare git repositories, configure git-http-backend, set up virtual hosts, and serve the templates locally. Twenty-nine shell commands. Thirty minutes. Ten dollars in API costs. All to solve a problem that didn’t exist.

The fix was one line in the skill: “The repos are on GitHub. Use them. Do NOT build a local git server.”

That’s the thing about AI agents — they’re creative problem solvers. Sometimes too creative. When a human engineer doesn’t know where the templates are, they ask. When an AI agent doesn’t know, it builds infrastructure.

The curl Habit (46 Wasted Calls)

The agent had 32 purpose-built Nautobot MCP tools available. Tools that handle authentication, pagination, error handling, and return structured data. The agent used curl instead.

It manually constructed authorization headers. It got 403 errors because it forgot the Token prefix. It parsed JSON with jq. It built URLs by hand and got the API paths wrong. Forty-six exec calls doing what four MCP tool calls would have done.

The fix went into both the skill and the operating rules: “If an MCP tool exists for the job, use it. Never use exec, curl, or shell commands to call APIs that have MCP servers.”

This one keeps coming back. Every new model we test has to learn this lesson. The smaller models especially — they know curl from their training data. They don’t know nautobot_get_devices. The skill has to be explicit enough to override the instinct.

The Explorer (37 Wasted Calls)

Before executing a single step, the agent spent ten minutes reading every file in the repository. The README. The Dockerfiles. The docker-compose files. The Ansible configs. The Nautobot plugin configs. All information that was already summarized in the skill.

This is the AI equivalent of a new hire who reads the entire company wiki before asking their manager what to do first. Thorough, but expensive when every file read costs tokens.

The fix: “Do NOT explore the repo structure. The skill has all the information you need. Execute the prescribed commands directly.”

The Impatient Poller ($4 in Token Costs)

The Docker image build takes 3-5 minutes. The agent checked progress every 15 seconds. Each check re-sent the full conversation history — 130,000+ tokens at that point. Eighteen polling iterations. Four dollars in API costs just watching a progress bar.

The fix: “Start the build, tell the user to wait, check once when it should be done. NEVER poll builds.”

This is a token economics problem that humans don’t have. When you run docker build in a terminal, watching the output is free. When an AI agent checks on a build, it re-processes the entire conversation to generate the next check command. The cost scales with conversation length, not build duration.

The Skipped Sub-Step

The skill originally numbered steps as 3a, 3b, 3c, 3d. Step 3d was “assign device credentials to all 20 devices.” Without credentials, golden config can’t SSH to any device. Backup jobs fail. Compliance has nothing to compare against.

The agent consistently skipped 3d. It would do 3a, 3b, 3c, then jump to Step 4. Every model we tested did this. Something about sub-lettered steps makes LLMs treat them as optional.

The fix: renumber everything as flat sequential steps 1 through 20. No sub-letters. You can’t skip Step 9 when it’s between Step 8 and Step 10.

The Container Restart

The agent needed a GitHub token in Nautobot’s environment for git push authentication. It correctly created the secret in Nautobot’s UI. Then it restarted the entire Nautobot stack mid-demo to pick up the environment variable.

This killed all running jobs, dropped the database connections, and required waiting for health checks to pass again. Three minutes of downtime in a 15-minute demo.

The fix: add the token during initial setup, before Nautobot starts. And a new rule: “Never restart Nautobot containers unless this skill explicitly says to.”

Skipping Connectivity

The agent deployed 20 ContainerLab devices and immediately started pushing configurations via Ansible. Half the devices hadn’t finished booting. IOL images take 2-3 minutes to initialize. The Ansible playbook failed on 10 devices, and the agent spent turns debugging connection timeouts that would have resolved themselves.

The fix: mandatory SSH connectivity test before Ansible runs. The skill now has a Step 14 that tests one device from each role — P router, PE router, spine switch, leaf switch. If any fail, wait and retry. Do NOT proceed until all pass.

The Arista VRF Landmine

Ansible deploy succeeded on all 10 IOS devices and failed on all 10 Arista switches. The generated configs include vrf forwarding clab-mgmt under Management0. But cEOS startup configs already have the management VRF applied. Ansible’s replace: line mode can’t handle the conflict.

The agent’s response was to try SSH heredocs. Then sshpass. Then a bash script that piped commands through SSH with expect-style interaction. Each attempt more baroque than the last, none of them working, all of them burning turns.

The fix was two parts. First, document the known issue in the skill so the agent doesn’t try to solve it from first principles. Second, prescribe the failback: “Use pyATS. It handles EOS enable mode and config sessions correctly. Do NOT attempt to fix this with sshpass or SSH heredocs.”

The Git Push 403 (Two Bugs, One Symptom)

Golden config jobs generated intended configurations correctly but failed pushing them to GitHub. HTTP 403. The agent correctly identified “auth issue” and then entered a debugging spiral — checking token permissions via the GitHub API (they looked fine), testing raw git push from inside the container, trying single-device targets, exploring fork approaches.

The actual root cause was two separate bugs producing the same symptom:

  1. Nautobot’s git credential helper builds https://<username>:<token>@github.com/.... The GitHub Access secrets group only had the token, not the username. Without a username, Nautobot logs HTTP Username not found for secrets group GitHub Access — but that’s a DEBUG-level message, easy to miss.

  2. The GitHub fine-grained PAT was scoped to specific repositories. The golden config repos — templates, intended, backup — weren’t in the allowed list. Even though the token owner owned the repos, push was denied.

Same 403 error. Two different causes. The agent couldn’t fix either one because both required changes outside its reach — adding an environment variable to the Docker environment and changing the PAT scope in GitHub’s UI.

The fix: Step 2 now adds GITHUB_USERNAME=x-access-token to the credentials file. Step 10 creates both a username and token secret in the GitHub Access secrets group. And the progress notes warn that fine-grained PATs must explicitly include all golden config repos.

The SSH Key Exchange

Nautobot’s celery worker container blocks Arista’s SSH key exchange algorithms by default. Every EOS backup job fails with an SSH handshake error. The agent can’t fix this from inside the conversation because it requires injecting an SSH config into a Docker container.

The fix: Step 13 now includes a command that writes an SSH config to the celery worker container, allowing the older key exchange algorithms that cEOS requires.


The Pattern

Every one of these mistakes follows the same pattern:

  1. The agent encounters something unexpected
  2. It tries to solve it from first principles
  3. The solution is creative but wrong (or expensive)
  4. I figure out the right answer
  5. The right answer goes into the skill

This is exactly how you train a human engineer. The first time they encounter a VRF conflict on cEOS, they’ll try to debug it. The second time, they know the workaround. The difference is that a human remembers across sessions automatically. An AI agent needs the lesson written down in a skill, or it’ll make the same mistake next Tuesday.

The skill file grew from 50 lines to 350 lines over the course of a month. Every line after the first 50 exists because something went wrong. The skill isn’t a procedure — it’s accumulated operational wisdom encoded as instructions.


What Good Looks Like

When the skill is right and the model follows it, the demo is remarkable. The agent:

  1. Clones the workshop repo and configures credentials (1 command)
  2. Builds and starts Nautobot (2 commands, waits for health)
  3. Runs Design Builder to populate 20 devices with interfaces, IPs, BGP, OSPF (1 MCP call)
  4. Restarts Nautobot so custom fields register in GraphQL (1 command)
  5. Syncs config contexts from a git data source (2 MCP calls)
  6. Creates secrets groups for device credentials and GitHub auth (10 MCP calls)
  7. Assigns credentials to all 20 devices (20 MCP calls)
  8. Deploys ContainerLab with 20 network nodes (1 command)
  9. Connects Nautobot containers to the lab network (1 command)
  10. Verifies SSH connectivity to all device roles (1 command)
  11. Generates and deploys configurations via Ansible (2 commands)
  12. Falls back to pyATS for Arista devices when Ansible fails (10 MCP calls)
  13. Registers golden config git repos with authentication (6 MCP calls)
  14. Configures golden config settings with all repos linked (2 MCP calls)
  15. Runs intended generation, backup, and compliance (3 MCP calls)

The user goes from nothing to a fully operational network lab with source-of-truth compliance monitoring in one conversation. The agent stops at five mandatory checkpoints to let the user verify progress and start fresh sessions if the context window is getting full.

The best run was DeepSeek V4 Flash: 285 MCP tool calls, zero dollars. The worst was Claude Sonnet: 47 MCP calls but 281 exec calls, $15.35. The difference wasn’t the model’s intelligence — it was how well it followed the skill’s instructions about using MCP tools instead of curl.


Mandatory Stops and Session Management

The demo takes 200-500+ tool calls depending on how many things go wrong. No model’s context window survives that in a single session. The context fills up, the framework compacts the conversation (losing detail), and the agent starts forgetting what it already did.

The worst run had 16 compactions. Each one lost context. By the end, the agent was re-discovering issues it had already diagnosed and fixed earlier in the same session.

The fix: mandatory stops at Steps 4, 11, 14, 17, and 20. At each stop, the agent tells the user what’s done and suggests starting a fresh session. The next session picks up with “continue demo from Step 12” and a summary of completed state. No re-exploration. No re-verification. Trust the checkpoint and move forward.

This is session management as an engineering discipline. You wouldn’t run a 500-line bash script without checkpoints. You shouldn’t run a 500-turn AI conversation without them either.


The Presentation

The demo is designed to be presented while the agent works. You type the prompt, the agent starts building, and you talk about how it was trained. By the time you’ve explained skills, tools, and the coworker framing, the agent has built the source of truth and is deploying the network.

The audience sees two things simultaneously: the agent doing real work in real time, and the human explaining the engineering behind it. The agent isn’t a demo prop — it’s doing the actual job. If it hits an error, it handles it (or asks for help). That’s more convincing than any slide deck.

The Q&A writes itself:

“What if it makes a mistake?” — Every action is logged in GAIT. For production, all config changes require an approved change request. Destructive operations always require human confirmation.

“How much does it cost?” — With prompt caching on Anthropic, about $2. On Ollama Cloud with open-source models, free. The cost comes from how many times the AI has to think — batching work and taking breaks between phases keeps it low.

“How is this different from Ansible?” — Ansible does exactly what you tell it. If a playbook fails, it stops. NetClaw is a coworker — if something fails, it reads the error, thinks about what went wrong, and tries a different approach. In this demo, Ansible fails on the Arista switches. A pipeline would stop. The coworker switches to pyATS, strips the conflicting lines, and pushes the configs another way.

That last answer is the whole point. The skill gives the agent the judgment to make good decisions when things go wrong. Ansible can’t do that. A pipeline can’t do that. A coworker can.


What I Learned

Training an AI agent is an iterative process. You don’t write the perfect skill on day one. You write a reasonable skill, run it, watch it fail, figure out why, and update the skill. Then you run it again. The skill gets better every iteration because every failure teaches you something you forgot to say.

The mistakes aren’t bugs — they’re training data. The $10 git server taught me to be explicit about what tools to use. The 46 curl calls taught me to reinforce MCP discipline in both the skill and the operating rules. The skipped sub-step taught me that flat numbering is more reliable than hierarchical numbering for LLMs.

The skill file is the most valuable artifact in the project. Not the MCP servers. Not the personality files. The skill. Because the skill is where operational wisdom lives — the accumulated knowledge of what works, what doesn’t, and why. It’s the difference between a coworker who’s been on the team for a day and one who’s been on the team for a month.

And every time the agent makes a new mistake, the skill gets one line better. Just like any team member.


The demo-lab-setup skill is at workspace/skills/demo-lab-setup/SKILL.md. The demo presentation script is at docs/demo-script.md. Session postmortems from all test runs are in docs/session-postmortem-*.md. The demo prompts that drive each phase are at docs/demo-prompts.md.

Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.