Post

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 2: Laying the Infrastructure Foundation

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 2: Laying the Infrastructure Foundation

Welcome back. If you followed along with Part 1, you already know the high-level vision: observability-first, nautobot as the single source of truth, OpenTelemetry Collector as the central telemetry nervous system, VictoriaMetrics for efficient long-term storage, and Grafana/Loki for human + alerting consumption.

The repo is now live and actually doing work — monitoring real Cisco switches via SNMP and pfSense syslog with log-to-metrics conversion, metadata enrichment from nautobot, and several pre-built dashboards. Since the README already contains a solid quickstart, installation steps, validation commands, and basic architecture overview, I’m not going to repeat that here.

Instead, let’s go deeper into the why behind the current structure and decisions — the kind of reasoning that isn’t always obvious from the README or docker-compose file itself. This post is aimed at people who want to understand (or extend) the system, not just run it.

Why This Folder Structure?

The current layout is deliberately opinionated to support three main goals:

  1. Separation of concerns — config, dashboards, scripts, and documentation live in their own top-level directories so each can grow independently without polluting the root.
  2. Easy provisioning & gitops-friendliness — Grafana, Loki, VictoriaMetrics, and the OTEL Collector all support file-based provisioning. Keeping configs in config/ makes it trivial to mount them read-only and version-control changes.
  3. Future extensibility — especially for AI/automation phases where we’ll need custom processors, evaluation scripts, prompt templates, agent definitions, etc.

Current structure (as of main branch):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
.
├── config/                        # All runtime configuration (mounted into containers)
│   ├── otel-collector/
│   │   ├── config.yaml            # Main collector config (includes or imports others)
│   │   └── receivers/
│   │       └── home-lab.yaml      # Per-environment or per-device receiver defs
│   ├── victoriametrics/
│   │   └── prometheus.yml         # scrape_config if using vmagent, or just reference
│   ├── loki/
│   │   └── local-config.yaml      # Loki storage & ingestion rules
│   └── grafana/
│       └── provisioning/          # datasources.yaml, dashboards/ as subfolders
├── dashboards/                    # All .json dashboard definitions (provisioned into Grafana)
│   ├── unified/                   # Core network views
│   └── pfsense-firewall-security.json
├── scripts/                       # Operational & bootstrap scripts
│   ├── nautobot_device_discovery.py  # Core automation piece — pulls nautobot → generates OTEL config snippets
│   └── setup-geoip.sh             # Downloads & prepares MaxMind DB
├── docs/                          # Long-form explanations, architecture decisions, how-tos
│   ├── PROJECT_STATUS.md
│   ├── NAUTOBOT_ENRICHMENT.md
│   ├── FIREWALL-SECURITY-DASHBOARD.md
│   └── quickstart/
├── data/                          # Persistent data directories (GeoIP, future model weights, etc.)
│   └── geoip/
├── convergence/                   # (likely future Python package / application code)
├── .github/workflows/             # CI/CD (validation, linting, maybe image builds later)
├── .env.example                   # Template for secrets & tunable parameters
├── docker-compose.yml             # Single source of truth for local/dev stack
├── Makefile                       # Developer convenience (up, down, validate, logs, etc.)
└── pyproject.toml                 # Project metadata + future tool dependencies (ruff, mypy, etc.)

Rationale

  • config/ is mounted read-only into containers → config changes don’t require rebuilding images.
  • receivers/home-lab.yaml exists as a separate file because we want to be able to regenerate just the receiver section from nautobot without touching the rest of the collector config.
  • dashboards/unified/ groups related dashboards together so it’s easier to maintain a “product” view later (e.g., “core”, “access”, “wan”, “security”).
  • scripts/ keeps Python/shell glue code separate from config — these are the moving parts that talk to nautobot, generate configs, validate the stack, etc.
  • docs/ contains the kind of long explanations that don’t belong in README (e.g. how the log-to-metrics count processor is tuned for pfSense, or how GeoIP enrichment works in OTEL).
  • Root-level .env.example + Makefile make onboarding fast while still allowing power users to override everything.

This is not the final structure — expect new top-level folders like agents/, prompts/, models/, evaluation/ when we hit the AI phases.

docker-compose.yml

Here are the most important non-obvious choices in the current compose file:

  1. Single OTEL Collector instance with modular config Instead of one collector per device type or per site, we run one collector and use include or receivers:: to pull in device-specific config files.

Why?

  • Much lower resource usage than 5–20 tiny collectors
  • Easier TLS / auth management (only one endpoint to secure)
  • Simpler service discovery & health checking
  • But still allows per-device tuning via generated snippets

Trade-off: restart of the collector is needed after config regeneration (for now). Later phases can use filewatch extension or config hot-reload via API (when OTEL supports it better).

  1. VictoriaMetrics
    • –retentionPeriod=365d (or 1y via env) — network baselines need long history
    • Lower memory & disk footprint with similar query performance
    • Native multi-tenancy support (useful when we add synthetic monitoring or application metrics later)
    • vmagent can be added later for remote_write fan-out or extra scraping
  2. Loki + Promtail / OTEL syslog → count processor We’re ingesting syslog via OTEL syslog receiver, then using transform/count processors to turn certain log lines (especially pfSense firewall) into metrics.

  3. Grafana provisioning instead of manual dashboard import All dashboards are JSON files in dashboards/ and provisioned automatically via grafana/provisioning/dashboards/.

  4. .env + .env.example pattern Secrets (nautobot token, SNMP community, MaxMind key, Grafana admin password) are never committed. Everything tunable (retention, ports, URLs) goes through .env.

What’s Actually Working Right Now

  • SNMP polling of Cisco interfaces with real ifDescr names (not just ifIndex)
  • nautobot → OTEL metadata enrichment (site, role, model, vendor tags on every metric)
  • pfSense syslog → Loki + log-to-metrics (count blocked connections by source country via GeoIP)
  • Four core dashboards already provisioned and querying real data
  • 1-year retention without crazy disk usage (thanks to VictoriaMetrics compression)

What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework under the Cloud Native Computing Foundation (CNCF). It’s designed to standardize how we collect, process, and export telemetry data—metrics, logs, and traces—from applications and infrastructure. Born from the merger of OpenCensus and OpenTracing in 2019, OTel aims to solve the fragmentation in observability: no more proprietary agents for each vendor or tool.

In a nutshell, OTel gives you a single pipeline to ingest “signals” (metrics/logs/traces), enrich them, and ship them to backends like VictoriaMetrics (for metrics) or Loki (for logs). For networks, this is gold, devices spit out SNMP polls, syslog messages, or streaming telemetry, and OTel normalizes it all.

The Core Mechanics

At its heart, OTel revolves around the OTel Collector—a binary (or Docker image) that acts as an agent or gateway. It’s not a database; it’s a processor. You configure it via YAML to define pipelines that chain together four main building blocks:

  1. Receivers: These are the ingress points. They “receive” telemetry from sources, either by pulling (e.g., polling SNMP every 60s) or pushing (e.g., devices sending syslog over UDP). Receivers convert raw data into OTel’s internal format.

Examples:

  • snmp: Polls OIDs like interface counters (1.3.6.1.2.1.2.2.1.10).
  • syslog: Listens on port 514 for RFC 3164/5424 messages, parsing them into structured logs.
  1. Processors: The transformation layer. Processors modify data (in-flight-filtering noise), adding attributes, or converting formats.
    • attributes: Inserts or updates labels (e.g., adding device.role=”core” from external metadata).
    • transform: Uses OTTL (OTel Transformation Language) for complex ops like regex parsing on logs.
    • batch: Groups data for efficient export.
    • memory_limiter: Prevents OOM by throttling.
  2. Exporters: The egress. They send processed data to destinations. For metrics, prometheusremotewrite pushes to Prometheus compatible stores like VictoriaMetrics via HTTP.
  3. Connectors: Special components that link pipelines (e.g., turn logs into metrics by counting events).

Pipelines tie it all together: You define one or more per signal type (metrics, logs, traces). Data flows sequentially: receiver → processors → exporter.

For example: A metric pipeline might receive SNMP, process with attributes/batch, export to VictoriaMetrics. Multiple pipelines prevent cross-contamination (e.g., one per device).

Under the hood:

  • Data Model: Metrics are gauges/counters/histograms with timestamps and attributes (labels). Logs are bodies with attributes and timestamps.
  • Transport: Often OTLP (OTel Protocol) over gRPC/HTTP, but receivers handle legacy like SNMP.
  • Extensions: Add-ons like health checks or file watchers (e.g., for config reloads).
  • Performance: Go-based, lightweight; handles millions of data points/sec with proper tuning.

If something goes wrong, OTel exposes its own metrics (at :8888) and health (:13133) for monitoring the monitor.

How I’m Using OpenTelemetry in Convergence

In Convergence, OTel is the telemetry backbone—ingesting from network devices, enriching with nautobot metadata, converting logs to metrics where useful, and exporting to VictoriaMetrics/Loki. It’s deployed via Docker in docker-compose.yml, using the otel/opentelemetry-collector-contrib image. Config lives in config/otel-collector/config.yaml, with device-specific parts generated by scripts/nautobot_device_discovery.py.

Config Breakdown: Receivers, Processors, Exporters, and Pipelines

The main config.yaml includes generated snippets for modularity. Here’s how it’s structured, with real examples from the repo.

Receivers: Ingesting Network Data We use device-specific receivers to avoid shared state issues. For SNMP (polling Cisco/pfSense every 60s):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
receivers:
  snmp/homeswitch01:  # Device-specific
    collection_interval: 60s
    endpoint: "udp://192.168.3.2:161"  # From nautobot
    version: v2c
    community: ${env:SNMP_COMMUNITY}  # Secure via .env
    attributes:
      interface.name:
        oid: "1.3.6.1.2.1.2.2.1.2"  # ifDescr for real names like "GigabitEthernet1/0/1"
        indexed_value_prefix: ""
    metrics:
      system.uptime:
        unit: "s"
        gauge:
          value_type: int
        scalar_oids:
          - oid: "1.3.6.1.2.1.1.3.0"
      interface.in.octets:  # Cumulative bytes in
        unit: "By"
        sum:
          aggregation: cumulative
          monotonic: true
          value_type: int
        column_oids:
          - oid: "1.3.6.1.2.1.2.2.1.10"
            attributes:
              - name: interface.name
      # Similar for out.octets, in.errors, out.errors

This polls OIDs for uptime and per-interface stats. The attributes block fetches human-readable interface names, no more cryptic ifIndex.

For syslog (from pfSense):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
receivers:
  syslog/udp:
    udp:
      listen_address: "0.0.0.0:514"
    protocol: rfc3164
    operators:
      - type: regex_parser  # Parse pfSense filterlog
        if: 'attributes.appname == "filterlog"'
        regex: '^(?P<rule_num>\d+),[^,]*,[^,]*,(?P<tracker>\d+),(?P<interface>[^,]+),...'  # Extracts src_ip, dst_ip, action=block/pass, proto_name, etc.
        parse_from: attributes.message
        parse_to: attributes
      - type: add
        if: 'attributes.appname == "filterlog"'
        field: attributes.log_type
        value: "firewall"
  # TCP variant similar

This listens for UDP/TCP syslog, using inline operators (a mini-processor) to parse pfSense’s comma-separated format into structured attributes.

Processors: Enriching and Transforming

Enrichment is key, pulling metadata from nautobot for context.

Device-specific attributes processor:

1
2
3
4
5
6
7
8
9
10
11
12
13
processors:
  attributes/homeswitch01:
    actions:
      - key: device.name
        value: "HomeSwitch01"  # From nautobot
        action: insert
      - key: device.ip
        value: "192.168.3.2"
        action: insert
      - key: device.vendor
        value: "Cisco"
        action: insert
      # Also model="WS-C3850-48P", role="home_switch", site="House"

This inserts labels on every metric/log, turning generic data into contextual gold (e.g., query by device_site=”House” in Grafana).

For Syslog:

1
2
3
4
5
6
7
8
9
processors:
  transform/syslog_enrichment:
    log_statements:
      - context: log
        conditions:
          - attributes["appname"] == "filterlog"
        statements:
          - set(attributes["device.name"], "pfSense-FW01")
          # Adds vendor, role, site similarly

We also use a count connector (as a receiver in a derived pipeline) to turn parsed logs into metrics like firewall_events{action=”block”, proto_name=”TCP”}. Standard processors like batch (for efficiency) and memory_limiter (throttle to 500MiB) are applied globally.

Exporters: Shipping to Backends

1
2
3
4
5
exporters:
  prometheusremotewrite:
    endpoint: "http://victoriametrics:8428/api/v1/write"
    tls:
      insecure: true  # Dev only

This pushes metrics to VictoriaMetrics in Prometheus format. For logs, we export to Loki or files.

Pipelines: The Data Flow Pipelines isolate per device/signal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
service:
  pipelines:
    metrics/homeswitch01:  # Per-device for SNMP
      receivers: [snmp/homeswitch01]
      processors: [memory_limiter, attributes/homeswitch01, resource, batch]
      exporters: [prometheusremotewrite, debug]
    # Similar for other devices
    logs/syslog:
      receivers: [syslog/udp, syslog/tcp]
      processors: [memory_limiter, transform/syslog_enrichment, resource, batch]
      exporters: [file/syslog, count/firewall, debug]  # count/firewall derives metrics
    metrics/firewall:  # Derived from logs
      receivers: [count/firewall]
      processors: [memory_limiter, deltatocumulative/firewall, resource, batch]
      exporters: [prometheusremotewrite, debug]

Flow example for SNMP:

  1. Receiver polls OIDs → raw metrics.
  2. Processors enrich with nautobot labels, batch.
  3. Exporter remote-writes to VictoriaMetrics (stored with 1-year retention).

For syslog (pfSense firewall):

  1. Receiver ingests/parses logs → structured logs with src_ip, action, etc.
  2. Processors enrich with nautobot/GeoIP.
  3. count connector → metrics like block rates.
  4. Metrics exported to VictoriaMetrics; raw logs to Loki.

Lessons learned

  • Device-specific pipelines fixed label bleed in shared setups.
  • Metric name changes: OTel appends units (e.g., _bytes_total)—test queries early.
  • SNMP names: Use ifDescr OID for readability.S
Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.