Post

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 6: GAIT Hardening, Hardware Dashboards, and Local LLMs

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 6: GAIT Hardening, Hardware Dashboards, and Local LLMs

Three things happened after Part 5b. A bug surfaced that meant every Discord-approved pfSense action was silently missing from the audit trail — GAIT was recording the scheduler’s decision side but nothing from the execution side when a human approved via bot. The pfSense dashboard got a significant overhaul with hardware telemetry (CPU, RAM, disk, interface status) pulled in via SNMP. And Phase 6 shipped: Ollama support as an alternative LLM backend for both services, so the whole stack can run without an Anthropic API key.

Code is on the phase6-ollama-provider branch.


Bug: Discord approvals were invisible in GAIT

This one was embarrassing in retrospect. Part 5b described GAIT as recording every step of every session — input, baseline, prompt, proposed action, decision, execution result, verification, outcome. That’s true for the automated scheduler path and for approvals via the REST API. It was not true for /approve and /approve-all in the Discord bot.

The root cause: both commands passed session=None to execute_and_verify(). There was a comment in the code claiming the GAIT session was re-opened inside the executor — it was not. The REST API endpoint at /api/automation/approve/{session_id} had done this correctly from the start; the Discord bot was added later and the pattern wasn’t carried over.

The fix mirrors exactly what the REST endpoint does — open a {session_id}-approved branch before firing the asyncio task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# discord_bot.py — cmd_approve (after fix)

# Re-open GAIT session for the execution leg (mirrors /api/automation/approve)
session = None
if trail.initialized:
    try:
        session = trail.open_session(ip, f"{session_id}-approved")
        session.record_turn(
            "approval",
            {
                "approved_at": datetime.now(timezone.utc).isoformat(),
                "approved_via": "discord",
                "approved_by": approver,
                "original_session_id": session_id,
            },
        )
    except Exception as exc:
        logger.error("Could not open GAIT session for Discord approval: %s", exc)

asyncio.create_task(
    execute_and_verify(
        session_id, ip, pf_action,
        pending["baseline"],
        pending["threat_data"],
        pending["proposed_action"],
        session,          # ← was None before the fix
    )
)

For /approve-all, the same pattern applies in the bulk loop, with "approved_via": "discord_bulk" to distinguish single from bulk approvals in the audit records:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# discord_bot.py — cmd_approve_all (after fix)

for sid, pending in to_approve:
    ip = pending["ip"]
    # ...build pf_action...

    session = None
    if trail.initialized:
        try:
            session = trail.open_session(ip, f"{sid}-approved")
            session.record_turn(
                "approval",
                {
                    "approved_at": datetime.now(timezone.utc).isoformat(),
                    "approved_via": "discord_bulk",
                    "approved_by": approver,
                    "original_session_id": sid,
                },
            )
        except Exception as exc:
            logger.error("Could not open GAIT session for bulk Discord approval %s: %s", sid, exc)

    asyncio.create_task(execute_and_verify(sid, ip, pf_action, ..., session))

The result is that every human-approved session now produces the same two-branch structure as a REST-approved session:

  • Scheduler branch (automation-{session_id}): turns 00–04 — input, baseline, prompt, proposed action, decision
  • Approved branch (automation-{session_id}-approved): turns 00–03 — approval metadata, execution result, verification, outcome

After deploying the fix, you can verify Discord approvals are being recorded:

1
2
3
4
5
6
7
8
9
10
11
git -C /path/to/audit-repo branch -a | grep approved
# automation-20260302-143021-1-2-3-4-approved
# automation-20260302-145500-5-6-7-8-approved

git -C /path/to/audit-repo show automation-20260302-143021-1-2-3-4-approved:sessions/automation-20260302-143021-1-2-3-4-approved/00_approval.json
# {
#   "approved_at": "2026-03-02T14:31:05+00:00",
#   "approved_via": "discord",
#   "approved_by": "byrn",
#   "original_session_id": "20260302-143021-1-2-3-4"
# }

A second fix bundled with this commit: the documentation now explicitly calls out that PFSENSE_SSH_KEY_PATH must reference a path inside the container, not a path on the Docker host. Only the automation-audit volume is mounted by default. If you configure an SSH key path that exists on your host but not inside the container, Paramiko fails silently (it raises FileNotFoundError which is caught and logged, then the SSH path is skipped), and you may not notice because XML-RPC or REST API succeeds first. The troubleshooting section in PHASE5_AUTOMATION_AGENT.md now lists this as Cause B for “approved IPs not appearing in pfSense alias.”


pfSense dashboard: hardware telemetry

The pfSense Firewall Security dashboard was showing firewall event data but nothing about the appliance itself. If pfSense is running hot (CPU at 90%, filesystem nearly full), that’s operationally important — especially for a homelab box where disk space can sneak up on you with logs and state tables.

The expanded dashboard has three collapsible rows:

  • Hardware — uptime, CPU gauge + timeseries, load average (1/5/15 min), RAM usage %, disk utilization % per partition
  • Firewall — all existing panels preserved (blocked events stat with 24h sparkline, top IPs, top ports, etc.)
  • Network — interface summary table (IF-MIB status + MAC), interface IP mapping table, interface throughput timeseries

Getting the hardware data required adding new SNMP OIDs to the collector. The key additions to the snmp/pfsense-fw01 receiver:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# config/otel-collector/config.yaml — new hardware metrics

metrics:
  # CPU utilization — UCD-SNMP-MIB ssCpu*
  system.cpu.user:
    unit: "%"
    gauge:
      value_type: int
    scalar_oids:
      - oid: "1.3.6.1.4.1.2021.11.50.0"   # ssCpuRawUser

  # Load average — laLoadInt (raw × 100, divide by 100 in Grafana)
  system.load.1min:
    unit: "1"
    gauge:
      value_type: int
    scalar_oids:
      - oid: "1.3.6.1.4.1.2021.10.1.5.1"   # laLoadInt.1

  # Memory — UCD-SNMP-MIB memTotalReal / memAvailReal (kilobytes)
  system.memory.total:
    unit: "kBy"
    gauge:
      value_type: int
    scalar_oids:
      - oid: "1.3.6.1.4.1.2021.4.5.0"      # memTotalReal

  # Storage — hrStorage table (HOST-RESOURCES-MIB)
  system.storage.size:
    unit: "{blocks}"
    gauge:
      value_type: int
    column_oids:
      - oid: "1.3.6.1.2.1.25.2.3.1.5"      # hrStorageSize
        attributes:
          - name: storage.description

  system.storage.used:
    unit: "{blocks}"
    gauge:
      value_type: int
    column_oids:
      - oid: "1.3.6.1.2.1.25.2.3.1.6"      # hrStorageUsed
        attributes:
          - name: storage.description

  # Interface operational status — IF-MIB ifOperStatus (1=up, 2=down)
  interface.operational.status:
    unit: "1"
    gauge:
      value_type: int
    column_oids:
      - oid: "1.3.6.1.2.1.2.2.1.8"
        attributes:
          - name: interface.name

A few things that burned time here:

ifPhysAddress is binary, not a string. The SNMP receiver silently drops metrics when an attribute OID returns binary data it can’t serialize. This means adding interface.mac as an attribute simply causes those metrics to disappear — no error, no warning. The MAC address was removed from the receiver config once I tracked down why interface metrics were suddenly missing. If you need MAC addresses, pull them via a separate mechanism (pfSense diagnostics page, or snmpwalk to inspect the raw value).

{blocks} is the correct unit for hrStorage metrics. Early versions of the dashboard queries used system_storage_size_1 (VictoriaMetrics appends _1 for unitless metrics with a {unit} label), but the actual metric name is system_storage_size with a unit label of {blocks}. VictoriaMetrics doesn’t append a suffix for metrics using curly-brace notation.

RAM queries need the actual hrStorageDescr value. The correct filter is storage_description="Real Memory" (the actual string pfSense reports in hrStorageDescr). Using "RAM" or "Physical Memory" returns no data.

Disk utilization explodes in cardinality without a filter. pfSense’s hrStorage table includes every UMA zone (kernel memory allocator regions), which adds 700+ time series to the storage metrics. The disk utilization panel query needs storage_description!~"UMA:.*" to exclude these:

# Grafana — disk utilization % (excluding UMA kernel zones)
(
  system_storage_used{device_name="pfSense-FW01", storage_description!~"UMA:.*"}
  /
  system_storage_size{device_name="pfSense-FW01", storage_description!~"UMA:.*"}
) * 100

After clearing those issues, all six new metrics confirmed working in VictoriaMetrics:

1
2
3
4
5
6
7
curl -s 'http://localhost:8428/api/v1/label/__name__/values' \
  | python3 -m json.tool | grep system_
# "system_cpu_load_ratio"
# "system_storage_size"
# "system_storage_used"
# "system_storage_allocation_units_bytes"
# "system_uptime"

Phase 6: Ollama as an alternative LLM backend

Up to this point, both the threat-intel narrative generator and the automation-agent action proposer were hard-wired to the Anthropic API (Claude Haiku). Phase 6 makes the provider configurable at runtime via a single environment variable.

# .env — switch to Ollama
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://192.168.1.50:11434
OLLAMA_MODEL=qwen3.5:9b

No rebuild needed — the provider is read from settings at startup.

The _call_llm() dispatcher

The implementation in both claude_client.py and claude_action.py follows the same pattern: a single async helper that dispatches to the right backend and returns a normalized (text, prompt_tokens, completion_tokens) tuple. Callers are completely provider-agnostic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Both services — _call_llm() dispatcher

async def _call_llm(prompt: str, max_tokens: int) -> tuple[str, int, int]:
    """Returns (response_text, prompt_tokens, completion_tokens)."""
    provider = settings.llm_provider

    if provider == "anthropic":
        import anthropic
        client = anthropic.AsyncAnthropic(api_key=settings.anthropic_api_key)
        message = await client.messages.create(
            model=_ANTHROPIC_MODEL,
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": prompt}],
        )
        return message.content[0].text, message.usage.input_tokens, message.usage.output_tokens

    if provider == "ollama":
        import httpx
        url = f"{settings.ollama_base_url}/api/chat"
        payload = {
            "model": settings.ollama_model,
            "think": False,        # suppress reasoning chain on thinking models
            "stream": False,
            "options": {"num_predict": max_tokens},
            "messages": [{"role": "user", "content": prompt}],
        }
        async with httpx.AsyncClient(timeout=120.0) as http:
            resp = await http.post(url, json=payload)
            resp.raise_for_status()
            data = resp.json()
        return (
            data["message"]["content"],
            data.get("prompt_eval_count", 0),
            data.get("eval_count", 0),
        )

    raise RuntimeError(f"Unknown llm_provider: {provider!r}")

Token counts map cleanly:

Field Anthropic Ollama native
Prompt tokens message.usage.input_tokens prompt_eval_count
Completion tokens message.usage.output_tokens eval_count

The propose_action() and generate_narrative() callers add "provider" and "model" to their return dicts, which flow into the GAIT audit records. So you can look at any 03_proposed_action.json and see which backend made that specific decision.

Why the native /api/chat endpoint, not the OpenAI-compat shim

Ollama ships with an OpenAI-compatible endpoint at /v1/chat/completions. The initial implementation used this — it’s less code, and would have let us use the openai Python library. It produced a maddening failure mode with Qwen3-family models.

The symptom: propose_action() returned {"type": "no_action", "reason": "empty_response"} on every call. Adding debug logging revealed the model was responding, but content was an empty string. All the tokens were being consumed by something else.

The root cause: Qwen3 and other “thinking” models have an internal reasoning chain. By default they run this chain first, outputting thousands of tokens of internal reasoning, and then output the actual answer. Ollama’s native API has a "think": false parameter that suppresses this — the reasoning chain is skipped entirely and the model’s answer goes straight to content. The OpenAI-compat shim does not forward this parameter to the underlying model runtime. It ignores it silently.

The concrete failure looked like this:

1
2
3
4
5
6
7
8
9
10
11
# OpenAI-compat (/v1/chat/completions) — think parameter ignored
finish_reason: length
content:       ""
reasoning:     "Thinking Process:\n\nOkay, I need to analyze this IP..."
               (2000 tokens of reasoning, no answer)

# Native (/api/chat) — think: false honoured
content:       '{\n  "risk_level": "high",\n  "executive_summary": "..."}'
thinking:      ""
prompt_eval_count: 4075
eval_count:        970

The fix is using httpx directly against /api/chat with "think": false. This works correctly for thinking models and is harmless for non-thinking models (Llama 3.x, Mistral) — think is simply ignored if the model doesn’t support it.

The 120-second httpx timeout is necessary. Cold-starting a 9B parameter model on first request can take 30–60 seconds. Subsequent requests are fast once the model is loaded into VRAM.

host.docker.internal on Linux

Docker Desktop automatically resolves host.docker.internal to the host machine, but the Docker Engine on Linux does not. To allow the container to reach an Ollama instance running on the host (or anywhere reachable from the host), add extra_hosts to docker-compose.yml:

1
2
3
4
5
6
7
8
# docker-compose.yml — added to threat-intel and automation-agent
services:
  threat-intel:
    extra_hosts:
      - "host.docker.internal:host-gateway"
  automation-agent:
    extra_hosts:
      - "host.docker.internal:host-gateway"

If you’re pointing OLLAMA_BASE_URL at a static LAN IP (e.g. http://192.168.1.50:11434) rather than host.docker.internal, this entry is harmless but unnecessary.

Gate logic

Neither service will attempt an LLM call if the required configuration is absent:

1
2
LLM_PROVIDER=anthropic  →  skips if ANTHROPIC_API_KEY is empty
LLM_PROVIDER=ollama     →  skips if OLLAMA_BASE_URL is empty

On skip, threat-intel returns {"narrative": {"available": false}} and automation-agent returns {"type": "no_action", "reason": "OLLAMA_BASE_URL not set"}. The rest of the pipeline continues normally — threat scoring, GAIT recording, Discord notifications all still happen. The LLM step degrades gracefully rather than blocking everything.

Model recommendations

Tested with qwen3.5:9b on an external Ollama host on the LAN. It produced a complete threat narrative (4000+ input tokens, ~970 completion tokens) with valid JSON and correct risk_level, top_threats, and recommended_actions on the first attempt. Some observations:

Model Size VRAM Notes
qwen3.5:9b 9.7B Q4 ~6 GB Tested and confirmed. Strong JSON adherence. Requires native API for think: false.
llama3.2:3b 3B ~2 GB Default in config. Good for CPU-only hosts. No thinking mode.
llama3.1:8b 8B ~5 GB Good balance on CPU/GPU. No thinking mode — works with both endpoints.
mistral:7b 7B ~4 GB Reliable JSON output. No thinking mode.

The prompts in both services instruct the model to output only valid JSON with no markdown fences. Models with strong instruction-following work best. If you see json_parse_error in the logs, the model wrapped its JSON in backtick fences — the code handles this with a fence-stripper, but a model that consistently wraps JSON despite being told not to will be noisier.


Architecture diagram

All six phases are now represented in a single Excalidraw diagram (docs/images/convergence-architecture.excalidraw) with a rendered PNG alongside it. The diagram organizes the stack into five zones with color-coded data flow arrows:

  • Zone A — Network Infrastructure: Internet/WAN, pfSense, Cisco switches, Nautobot
  • Zone B — Collection & Storage: OTEL Collector, VictoriaMetrics, Loki, Redis
  • Zone C — AI Services: threat-intel (:8001), automation-agent (:8002)
  • Zone D — Integrations: AbuseIPDB, GreyNoise, OTX, IPInfo; Claude API and Ollama
  • Zone E — Outputs: Discord alerts, GAIT audit trail, pfSense block actions

Arrow colors: blue = infrastructure data flow, purple = AI/LLM calls, orange = threat intel API queries, green = notifications and reports, red = block actions reaching back to pfSense.

The README now shows this diagram in place of the old ASCII art.


Bugs and hard-won lessons

GAIT silent gap on Discord approvals. The lesson here mirrors the alert spam bug from Part 5b: any code path that produces an external side effect (pfSense change, Discord notification) needs to be explicitly paired with all its bookkeeping. When the Discord bot was added as a second approval path alongside the REST API, the GAIT recording pattern wasn’t carried over. The fix was three lines of import and a twelve-line GAIT block in each command. The deeper lesson: when you have two code paths that are supposed to behave identically, make them call the same function rather than duplicating logic — the REST endpoint and Discord bot should have shared a single execute_with_audit(pending) helper from the start.

ifPhysAddress silently drops metrics. The OTEL SNMP receiver serializes attribute values as strings. Binary OIDs (like ifPhysAddress, which is a 6-byte MAC address) fail serialization, and when an attribute fails, the entire metric it’s attached to is dropped — not just the attribute. The failure is silent (no log warning at DEBUG level in the receiver). The debugging approach that eventually worked: comment out attributes one at a time and restart, watching VictoriaMetrics for which metrics reappear. A unit test that queries VictoriaMetrics for each expected metric name after a collector restart would have caught this in 30 seconds.

Thinking models and the OpenAI shim. This is a class of bug that will hit more people as thinking models become mainstream. If you’re using Ollama with any Qwen3 or DeepSeek-R1 variant through the OpenAI-compat endpoint and getting empty responses, the shim is eating your think parameter. Switch to the native /api/chat endpoint and add "think": false. The fix is trivial; finding it cost several hours of staring at token counts.


Wrapping up

The GAIT fix is the most important change in this batch. An audit trail that silently skips human-initiated actions isn’t an audit trail — it’s a record of what the algorithm did, which misses exactly the decisions you most want documented. The fix is simple and the trail is now complete for all three approval paths: auto-approve, REST API, and Discord bot.

The Ollama provider is the most significant architectural shift. The stack can now run entirely on-premises with no outbound API calls if you have a machine with 6+ GB of VRAM to run a reasonably capable model. For a homelab where the goal is to keep data local, that matters.

The hardware dashboard panels are more immediately practical than they might seem. In the week after deploying them I noticed the pfSense appliance’s disk was at 78% — state table logs and pfBlockerNG feeds accumulating. That would have gone unnoticed without a storage utilization panel.

Looking forward: the dynamic baseline work mentioned at the end of Part 5b is still on the list — using VictoriaMetrics’ outlier_iqr_over_time() to replace fixed-threshold alerts with alerts that adapt to each IP’s historical traffic pattern. That’s Phase 7.

Ideas or homelab war stories? Find me on X @byrn_baker or Linkedin @byrnbaker .

Code: https://github.com/byrn-baker/Convergence

Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.