Skip to content

Monitoring

The monitoring stack collects metrics from all 5 Swarm nodes, the Proxmox host, dockerrr, and data services. 29 Prometheus alert rules across 8 groups route through Alertmanager directly to Pushover for iPhone push notifications. Grafana dashboards provide visualization.

Architecture Note

Prometheus, Grafana, and Alertmanager run on dockerrr (192.168.1.47, Default VLAN) rather than the Swarm cluster. This is because dockerrr can reach all VLANs (50, 51, and IOT), while VLAN 51 nodes cannot reach VLAN 50 or the Default VLAN due to firewall rules.

Monitoring Stack

Central Services (dockerrr, 192.168.1.47)

Running via Docker Compose at /opt/docker/homelab/docker-compose.yml:

Service URL Port Purpose
Prometheus prometheus.home.jlwaller.com 9090 Metrics collection, storage (30d retention), alerting rules
Grafana grafana.home.jlwaller.com 3000 Dashboards and visualization
Alertmanager alertmanager.home.jlwaller.com 9093 Alert routing and notification

All three are accessible via Traefik with Let's Encrypt TLS (Cloudflare DNS challenge).

Cluster Exporters (Swarm)

The monitoring-exporters stack deploys exporters across the Swarm cluster.

Stack file: /opt/swarm/stacks/monitoring-exporters/exporters-stack.yml

Exporter Mode Port Metrics
cadvisor Global (5/5) -- Container CPU, memory, network, disk I/O
node_exporter Global (5/5) 9100 Host CPU, memory, disk, network
promtail Global (5/5) -- Log aggregation (ships to Loki)
postgres_exporter Replicated (1) 9187 PostgreSQL stats, connections, replication
redis_exporter Replicated (1) 9121 Redis memory, commands, keys, clients

Postgres and Redis exporters are constrained to apps-data and connect via the data_net overlay network.

Prometheus Targets

All 15 targets are scraped from dockerrr across VLANs:

Job Target VLAN Metrics
prometheus localhost:9090 -- Self-monitoring
swarm-nodes 192.168.50.10:9100 50 apps-gateway host metrics
swarm-nodes 192.168.51.10:9100 51 apps-app1 host metrics
swarm-nodes 192.168.51.15:9100 51 apps-app2 host metrics
swarm-nodes 192.168.51.30:9100 51 apps-data host metrics
swarm-nodes 192.168.51.40:9100 51 apps-dev1 host metrics
swarm-cadvisor 192.168.51.10:8080 51 Container metrics (via app1)
swarm-cadvisor 192.168.51.30:8080 51 Container metrics (via data)
postgres 192.168.51.30:9187 51 PostgreSQL database metrics
redis 192.168.51.30:9121 51 Redis cache metrics
monitoring-server 192.168.51.20:9100 51 apps-monitoring host metrics
monitoring-cadvisor 192.168.51.20:8080 51 Container metrics (via monitoring)
proxmox-host 192.168.1.5:9100 Default Proxmox host metrics (node_exporter)
dockerrr 192.168.1.47:9100 Default dockerrr host metrics (node_exporter)
traefik-homelab 192.168.1.47:8080 Default Homelab Traefik metrics
traefik-swarm 192.168.50.10:8080 50 Swarm Traefik metrics
infra-checks 192.168.1.47:9101 Default Custom checks (Vault seal, Plex, HA)

Config file: /opt/docker/homelab/prometheus/prometheus.yml

Alerting

Alert Pipeline

Prometheus (29 alert rules across 8 groups)
  -> Alertmanager (routing, grouping, silencing)
    -> Pushover (direct push notifications to iPhone)

Migration from Home Assistant

Alerting was migrated from Home Assistant webhooks to Pushover direct integration. This removes the HA dependency from the alerting pipeline. Pushover credentials are also stored in Vault at secret/pushover.

Alert Rule Groups

Config file: /opt/docker/homelab/prometheus/alerts.yml

Group Rules Description
host_alerts HostDown, HostHighCPU, HostHighMemory (>92%), HostDiskSpaceLow, HostDiskSpaceCritical Host-level alerts for swarm nodes
container_alerts ContainerHighCPU, ContainerHighMemory Container resource alerts
database_alerts PostgresDown, PostgresHighConnections, PostgresDeadlocks PostgreSQL health
cache_alerts RedisDown, RedisHighMemory Redis health
application_alerts TraefikHighErrorRate (5xx), TraefikHighLatency, TraefikTrafficSpike Traefik application-level alerts
proxmox_alerts ProxmoxHostDown, ProxmoxHighCPU, ProxmoxHighMemory, ProxmoxDiskSpace Proxmox host health
dockerrr_alerts DockerrrDown, DockerrrHighCPU, DockerrrHighMemory, DockerrrDiskSpace dockerrr host health
infra_alerts VaultSealed, VaultDown, PlexDown, HomeAssistantDown Infrastructure service checks

HostHighMemory Threshold

The HostHighMemory threshold was raised from 85% to 92% because swarm VMs previously had balloon-constrained memory, which inflated usage percentages. Balloon is now disabled on all swarm VMs.

Alertmanager Configuration

Config file: /opt/docker/homelab/alertmanager/alertmanager.yml

Receiver Method Behavior
pushover-critical Pushover (priority: emergency) Immediate (0s wait), repeat 15m, breaks Do Not Disturb
pushover-warning Pushover (priority: high) 1m wait, repeat 4h, normal push notification

Alerts are grouped by alertname and node. Resolved notifications are sent when alerts clear.

Managing Silences

# List active silences
curl -s http://192.168.1.47:9093/api/v2/silences | python3 -m json.tool

# Create a silence (e.g., for maintenance)
curl -X POST http://192.168.1.47:9093/api/v2/silences -H "Content-Type: application/json" -d '{
  "matchers": [{"name": "alertname", "value": "HostDown", "isRegex": false}],
  "startsAt": "2026-01-01T00:00:00Z",
  "endsAt": "2026-01-01T06:00:00Z",
  "createdBy": "admin",
  "comment": "Planned maintenance"
}'

Grafana Dashboards

Grafana uses file-based provisioning. Dashboards are stored in /opt/docker/homelab/grafana/provisioning/dashboards/json/.

Dashboard Source Description
Infrastructure Overview Custom Node status, CPU, memory, disk, network, Postgres connections, Redis memory
Node Exporter Full Community (#1860) Detailed host metrics per node
PostgreSQL Community (#9628) Database connections, queries, cache hit ratio
Redis Community (#11835) Memory usage, commands, keys, clients

Datasource: Prometheus at http://prometheus:9090 (same Docker network)

Adding a Dashboard

  1. Place the JSON file in /opt/docker/homelab/grafana/provisioning/dashboards/json/
  2. Ensure the datasource UID is PBFA97CFB590B2093
  3. Restart Grafana: docker restart grafana

Exporter Credentials

PostgreSQL Exporter

  • User: monitoring with pg_monitor role (read-only)
  • Password: Docker secret monitoring_pg_password
  • Auth method: DATA_SOURCE_PASS_FILE pointing to /run/secrets/monitoring_pg_password

Redis Exporter

  • Password: Docker secret monitoring_redis_password
  • Format: {"redis://data_redis:6379": "<password>"}
  • Auth method: --redis.password-file=/run/secrets/monitoring_redis_password

Redis Password File Format

The oliver006/redis_exporter image is a scratch/distroless image with no shell. The --redis.password-file flag requires JSON format mapping the Redis URL to the password, not plain text.

Log Management

Promtail

Promtail runs as a global service on all 5 nodes, collecting container logs and shipping them to Loki for centralized log aggregation.

Docker Log Rotation

Docker daemon configured with JSON file logging driver:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Health Checks

Quick Cluster Status

# All nodes healthy?
docker node ls

# All services running expected replicas?
docker service ls

# Any failed tasks?
docker service ls --format "{{.Name}} {{.Replicas}}" | grep -v "1/1\|2/2\|5/5"

Monitoring Stack Health

# Check all Prometheus targets
curl -s http://192.168.1.47:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
    print(f\"{t['labels'].get('job',''):>15} | {t['labels'].get('instance',''):>25} | {t['health']}\")"

# Check active alerts
curl -s http://192.168.1.47:9093/api/v2/alerts | python3 -m json.tool

# Grafana health
curl -s http://192.168.1.47:3000/api/health

Disaster Recovery

Node Failure

# Drain failed node
docker node update --availability drain {node}

# Force redistribute services
docker service ls --format "{{.Name}}" | xargs -I {} docker service update --force {}

# When repaired, reactivate
docker node update --availability active {node}

Service Rollback

# Immediate rollback to previous version
docker service rollback {service-name}

# Deploy specific version
docker service update --image registry.apps.jlwaller.com/{image}:{tag} {service-name}

Database Recovery

# Restore from backup
cat backup.sql | docker exec -i $(docker ps -qf name=data_postgres) psql -U postgres -d {database}