Post

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 3: Data Enrichment and Alerting

Building Convergence – A Journey from Network Observability to AI-Driven Automation Part 3: Data Enrichment and Alerting

Welcome back to the “Building Convergence” series! In Part 1, we outlined the vision: nautobot as the single source of truth, OpenTelemetry (OTEL) as the telemetry hub, and a suite of open-source tools like VictoriaMetrics, Loki, and Grafana for storage, querying, and visualization.

Part 2 got our hands dirty with the foundational infrastructure—setting up the Docker Compose environment, ingesting SNMP and syslog data via OTEL, enriching logs with GeoIP metadata, and converting key events to metrics for efficient querying.

Now, in Part 3, we’re taking that enriched data and turning it into actionable intelligence. We’ll dive into real-time threat mapping using IP location data, set up spike detection with Loki’s ruler, configure intelligent alerting through Grafana and Alertmanager, and reorganize our dashboards for better usability. This phase bridges the gap from passive observability to proactive monitoring, laying the groundwork for automation in Phase 4.

All code and configs for this phase are in the Convergence repo main branch. If you’re following along, pull the latest changes and spin up the stack with docker compose up -d.

Overview

Phase 3 builds on the geo-enriched firewall logs introduced in Phase 2 to deliver:

  1. LogQL Spike Detection — Loki ruler evaluates LogQL queries on a schedule and fires alerts when attack patterns from specific regions exceed thresholds.
  2. Grafana Geo-Visualization — The existing Geomap panels on the pfSense Security Dashboard now have richer Loki label support for LogQL-based exploration alongside the VictoriaMetrics metric data.
  3. Intelligent Alerting — Provisioned Grafana alert rules cover country-based block spikes, overall firewall volume spikes, and switch uptime/reachability events. Alerts route to Alertmanager, then to webhooks or email.
  4. Dashboard Reorganization and New Panels — Dashboards are split into Network and Security Grafana folders. Two new dashboards added: Threat Analysis (country breakdown, attack trends) and Network Device Health (uptime, error rates, bandwidth).

This implementation is fully operational as of today (updated: Loki ruler path fix, dashboard reorganization, new panels, Discord alerting, alert rule datasource UID fix). Let’s break it down.

Architecture

Here’s the high-level flow, expanding on Phase 2’s pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
pfSense Firewall (filterlog syslog)
        │
        ▼
OpenTelemetry Collector
  ├─ Parse filterlog: extract src_ip, dst_ip, action, interface, proto_name
  ├─ GeoIP enrich: add src_lat, src_lon, src_country, src_city
  ├─ count/firewall connector → firewall_events_total metric (with geo labels)
  └─ file/syslog → /data/syslog/syslog.jsonl (OTLP JSON)
        │                         │
        ▼                         ▼
VictoriaMetrics              Promtail
(firewall_events_total        ├─ Regex extract labels from OTLP JSON:
 with src_lat, src_lon,       │   action, log_type, src_country,
 src_country labels)          │   src_country_code, interface
        │                     └─ Push structured logs to Loki
        │                               │
        ▼                               ▼
Grafana Unified Alerting           Loki Ruler
  ├─ Alert: Country block spike     ├─ Recording rules → pre-agg metrics
  ├─ Alert: Total block spike       └─ Alert rules → Alertmanager
  ├─ Alert: Switch rebooted                  │
  └─ Alert: Switch SNMP down                 ▼
        │                            Alertmanager
        │                            ├─ Route: critical → Discord (1h repeat)
        │                            ├─ Route: team=security → Discord (4h repeat)
        │                            └─ Route: team=network → Discord (6h repeat)
        ▼
Grafana Dashboards
  ├─ Security/ folder
  │   ├─ pfSense Security Dashboard
  │   │   ├─ Geomap: WAN Threats (firewall_events_total + src_lat/lon)
  │   │   ├─ Geomap: Traffic Destinations
  │   │   ├─ Top Blocked IPs table
  │   │   └─ Firewall Actions timeseries
  │   └─ Threat Analysis Dashboard
  │       ├─ Stats: total blocks / attacking countries / block rate
  │       ├─ Top 10 countries (horizontal bar)
  │       ├─ Protocol distribution (donut)
  │       ├─ Attack rate by country (timeseries)
  │       ├─ Blocks by interface (stacked timeseries)
  │       └─ All attacking countries table
  └─ Network/ folder
      ├─ Interface Utilization
      ├─ Interface Errors
      ├─ Network Overview
      ├─ Platform Health
      └─ Network Device Health Dashboard
          ├─ Per-device uptime stats
          ├─ Uptime history (reboot detection)
          ├─ Interface error rates (table + timeseries)
          └─ Total bandwidth per device

This architecture ensures data flows efficiently from ingestion to visualization and alerting, with minimal overhead.


Part A: Real-Time Threat Mapping with IP Data

Building directly on Phase 2’s GeoIP enrichment, we now use the firewall_events_total metric and Loki labels to create interactive geo-visualizations in Grafana.

How the Geomap Works

The firewall_events_total metric carries rich labels from every parsed filterlog entry:

Label Description Example  
action Firewall decision “block/pass”  
src_ip Source IP address 1.2.3.4  
src_lat Source latitude (GeoIP) 39.9042  
src_lon Source longitude (GeoIP) 116.4074  
src_country Source country name (GeoIP) China  
dst_ip Destination IP 192.168.100.1  
dst_country Destination country United States  
interface pfSense interface igc0.201  
proto_name Protocol “tcp udp”

In the pfSense Security Dashboard, the Geomap panel for WAN Threats uses this PromQL query in instant mode with format=table:

topk(100,
  sum by (src_ip, src_lat, src_lon, src_country) (
    increase(firewall_events_total{
      action="block",
      direction="in",
      interface="igc0.201",
      src_ip!~"^(10\\.|172\\.(1[6-9]|2[0-9]|3[01])\\.|192\\.168\\.)"
    }[2h])
  )
)

Grafana transforms this into a table with columns like src_lat, src_lon, src_country, src_ip, and Value (block count). The Geomap layer maps these to coordinates, with bubble sizes scaled by block volume. Hovering over a bubble reveals the IP and country details—perfect for spotting clustered attacks.

Useful LogQL Queries for Threat Exploration

With Promtail extracting labels like action, src_country, and interface from the OTLP JSON logs, you can drill down in Grafana’s Explore view:

  • View all block events from a specific country:
    {job="syslog", log_type="firewall", action="block", src_country="China"}
    
  • Count blocks per country over the last hour
    sum by (src_country) (
    count_over_time(
      {job="syslog", log_type="firewall", action="block"}[1h]
    )
    )
    
  • Detect countries with a sudden spike (last 5 min vs last hour):
    (
    sum by (src_country) (count_over_time({job="syslog", action="block"}[5m]))
    /
    sum by (src_country) (count_over_time({job="syslog", action="block"}[1h]) / 12)
    ) > 3
    

This relative spike detector adapts to your baseline traffic, firing when recent activity is 3x the average.

  • View raw events with all parsed fields (JSON log exploration):
    {job="syslog", log_type="firewall", action="block"}
    | json
    | line_format "{{ .src_ip }} → {{ .dst_ip }}:{{ .dst_port }} ({{ .proto_name }}) [{{ .src_country }}]"
    

These queries empower ad-hoc investigations, complementing the dashboard views.


Part B: Loki Ruler — Recording Rules and Alerting

Loki’s ruler component runs LogQL queries periodically (every 1 minute), enabling both pre-aggregation for dashboards and threshold-based alerting.

How the Ruler Works

  1. Recording rules compute and store derived metrics in Loki, making them queryable like native time series. This avoids rescanning raw logs for every dashboard refresh.
  2. Alerting rules evaluate expressions and push firing alerts to Alertmanager.

Recording Rules in Action

Defined in config/loki/rules/fake/firewall_alerts.yaml, these rules create efficient metrics:

Metric Description
firewall:blocks_by_country:rate5m Block count per country per 5-min window
firewall:blocks_by_interface:rate5m Block count per interface per 5-min window
firewall:blocks_total:rate5m Total block count per 5-min window

Tune thresholds by monitoring baselines with queries like sum by (src_country)(count_over_time({job="syslog",action="block"}[5m])) in Explore.

Tenant Note With auth_enabled: false, Loki uses a “fake” tenant. Rules go in config/loki/rules/fake/.

Volume Path — Why Rules Live Under /etc/loki Docker Compose mounts ./config/lokito /etc/loki (bind mount, read-only) and uses a named volume for /loki. Place rules in the bind mount for reliability. In local-config.yaml:

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  rule_path: /loki/rules-temp

Misconfiguring this leads to errors like “unable to read rule dir /loki/rules/fake”.


Part C: Intelligent Alerting with Grafana

Grafana’s unified alerting handles VictoriaMetrics metrics, integrating seamlessly with Loki alerts via Alertmanager.

Provisioned Alert Rules

From config/grafana/provisioning/alerting/alert_rules.yaml: Security Alerts (folder: Security Alerts)

Rule Query Threshold For
High Block Rate From Country (1h) “sum by (src_country)(increase(firewall_events_total{action=”“block””}[1h]))” >1000 5m
Firewall Block Rate Spike (5m) “sum(increase(firewall_events_total{action=”“block””}[5m]))” >500 3m

Network Health Alerts (folder: Network Alerts)

Rule Query Condition For
Network Switch Rebooted “delta(system_uptime{device_role=”“home_switch””}[5m])” < -60s 0m
Network Switch Low Uptime “min(system_uptime{device_role=”“home_switch””})” < 600s 2m
Network Device SNMP Unreachable “count(system_uptime{device_role=”“home_switch””})” no data 5m

Notification Policies

From config/grafana/provisioning/alerting/notification_policies.yaml:

1
2
3
4
All alerts → Discord (default, 4h repeat)
  ├─ severity=critical → Discord (1h repeat, group_wait 10s)
  ├─ team=security     → Discord (4h repeat)
  └─ team=network      → Discord (6h repeat)

Setting Up Discord Notifications

  1. Create a Discord webhook in your server settings and copy the URL.
  2. Add to .env: DISCORD_WEBHOOK_URL=https://discord.com/api/webhooks/ID/TOKEN
  3. Apply: docker compose up -d --force-recreate grafana
  4. Test via API:Bash
    1
    2
    3
    4
    
    curl -s -u admin:admin \
      -X POST "http://localhost:3000/api/v1/provisioning/contact-points/convergence-discord/test" \
      -H "Content-Type: application/json" \
      -d '{}'
    

Contact Points

From config/grafana/provisioning/alerting/contact_points.yaml:

Name Type Purpose      
Discord discord Primary alert destination — reads ${DISCORD_WEBHOOK_URL} from env      
Webhook webhook “Generic HTTP endpoint (Slack custom scripts n8n etc.)”
Email email Email notifications — requires SMTP env vars in docker-compose.yml      
Do Nothing webhook Silently drops alerts — useful for testing routing rules      

Provisioned points show with a lock icon in Grafana. Use the API to verify: curl -s -u admin:admin http://localhost:3000/api/v1/provisioning/contact-points.

Alertmanager

Access at http://localhost:9093. Configure webhooks or email in config/alertmanager/alertmanager.yml, then restart: docker compose restart alertmanager.

Part D: Dashboard Organization and New Panels

We’ve restructured for scalability, provisioning dashboards into folders.

Folder Structure

Host path,Grafana folder,Contents dashboards/network/,Network,”SNMP monitoring, interface stats, device health” dashboards/security/,Security,”Firewall events, geo-visualization, threat analysis” dashboards/cisco/,Vendor - Cisco,(reserved for future Cisco-specific dashboards) dashboards/juniper/,Vendor - Juniper,(reserved) dashboards/arista/,Vendor - Arista,(reserved)

Config in config/grafana/provisioning/dashboards/convergence.yaml. Alert folders are auto-created.

Dashboard Inventory

Security Folder pfSense Firewall Security

pfSense Firewall Security (pfsense-firewall-security.json)

Panel Type Query / Purpose  
Blocked Events Rate gauge sum(rate(firewall_events_total{action=""block""}[5m]))  
WAN Threats — Internet Attacks geomap “Block events from public IPs sized by count coloured by count”
Traffic Destinations geomap “All traffic destinations (red = blocked green = passed)”
All Blocked Source IPs geomap All block origins across all interfaces  
Firewall Actions Over Time timeseries Pass vs block rate over time  
Top 100 Blocked Source IPs table topk(100, sum by (src_ip,...))  
Blocked Traffic by Protocol piechart sum by (proto_name)  
Traffic by Interface barchart sum by (interface)  
Top Blocked Destination IPs timeseries What internal IPs are being targeted  
Recent Firewall Events Summary table Latest block/pass events  

Threat Analysis

Threat Analysis (threat-analysis.json) (new)

Panel Type Query / Purpose  
Total Blocks (24h) stat sum(increase(firewall_events_total{action=""block""}[24h])) — background colour thresholds  
Attacking Countries (24h) stat count(sum by (src_country)(...) > 0) — unique source countries  
Current Block Rate stat sum(rate(...[5m])) * 60 — blocks per minute  
Top 10 Attacking Countries barchart (horizontal) “sort_desc(topk(10 sum by (src_country)(…[24h])))”
Blocked Traffic by Protocol piechart (donut) sum by (proto_name)(...[24h])  
Attack Rate by Country Over Time timeseries Top 7 countries, 15m rolling rate, blocks/min  
Blocks by Interface Over Time timeseries (stacked) All interfaces, shows which is under load  
All Attacking Countries table Full sortable country breakdown with code and block count  

Network Folder Network Device Health

Existing: Interface Utilization Interface Errors Network Overview Platform Health.
Network Device Health (device-health.json) (new)    
Panel Type Query / Purpose
HomeSwitch01 Uptime stat system_uptime{device_name="HomeSwitch01"} / 100 — green ≥1d, orange ≥10m, red <10m”
HomeSwitch02 Uptime stat Same for HomeSwitch02
pfSense-FW01 Uptime stat Same for pfSense-FW01
Device Uptime History timeseries All devices — a drop to near-zero indicates a reboot
Interface Error Rates (table) table rate(interface_in_errors_total[5m]) sorted descending
Interface Error Rate Over Time timeseries Only interfaces with > 0 active errors are shown
Total Bandwidth per Device timeseries sum by (device_name)(rate(interface_in_octets_bytes_total[5m])) * 8 IN + OUT

Note on sysUpTime: SNMP returns centiseconds; divide by 100 for seconds. Adjust if your receiver pre-converts.


pfSense WAN Block Logging — Important Note

pfSense CE 2.7+ removed the UI option for logging default blocks. Add an explicit bottom-of-WAN-ruleset block rule with logging enabled:

Field Value
Action Block
Interface WAN
Direction in
Protocol Any
Source Any
Destination Any
Log ✓ checked
Description Default deny - logged

This logs unsolicited inbound traffic without affecting established connections. Verify in System Logs → Firewall.


Deployment Steps

  1. Apply changes: docker compose restart promtail loki alertmanager grafana
  2. Verify Loki labels: In Explore, check {job="syslog"} for labels like action, src_country.
  3. Verify recording rules: curl http://localhost:3100/loki/api/v1/rules and query firewall:blocks_total:rate5m.
  4. Verify Alertmanager: curl http://localhost:9093/-/healthy
  5. Set Discord URL: Update .env, then docker compose up -d --force-recreate grafana
  6. Test alerts: Use the curl command above.

Troubleshooting

  • No labels in Loki: Check OTEL JSON format and Promtail regex. Inspect logs with docker exec.
  • Loki no rules: Verify mounts and tenant path. Use docker exec to list files.
  • Grafana “No data”: Confirm metrics/labels with VictoriaMetrics API.
  • Datasource not found: Update datasourceUid in alert_rules.yaml with actual UIDs (query API), then recreate Grafana.
  • Always-firing network alerts: Ensure device_role matches your data; silence if SNMP isn’t set up.

Wrapping Up and Next Steps

Phase 3 transforms our observability stack into a vigilant system that not only visualizes threats but alerts on them intelligently. We’ve enriched IP data for geo-mapping, leveraged Loki for spike detection, and integrated alerting with Discord for quick response.

Looking ahead to Phase 4: AI Threat Intelligence. We’ll explore creating threat intelligence dashboards with automated IP Enrichment, Composite Scoring & Bad-Actor Flagging, Port Attack Analysis, and Claude Haiku Narratives.

If you have questions or suggestions, hit me up on X @byrn_baker or on LinkedIn. Stay tuned!

Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.