Monitoring¶

The monitoring stack collects metrics from all 5 Swarm nodes, the Proxmox host, dockerrr, and data services. 29 Prometheus alert rules across 8 groups route through Alertmanager directly to Pushover for iPhone push notifications. Grafana dashboards provide visualization.

Architecture Note

Prometheus, Grafana, and Alertmanager run on dockerrr (192.168.1.47, Default VLAN) rather than the Swarm cluster. This is because dockerrr can reach all VLANs (50, 51, and IOT), while VLAN 51 nodes cannot reach VLAN 50 or the Default VLAN due to firewall rules.

Monitoring Stack¶

Central Services (dockerrr, 192.168.1.47)¶

Running via Docker Compose at /opt/docker/homelab/docker-compose.yml:

Service	URL	Port	Purpose
Prometheus	prometheus.home.jlwaller.com	9090	Metrics collection, storage (30d retention), alerting rules
Grafana	grafana.home.jlwaller.com	3000	Dashboards and visualization
Alertmanager	alertmanager.home.jlwaller.com	9093	Alert routing and notification

All three are accessible via Traefik with Let's Encrypt TLS (Cloudflare DNS challenge).

Cluster Exporters (Swarm)¶

The monitoring-exporters stack deploys exporters across the Swarm cluster.

Stack file: /opt/swarm/stacks/monitoring-exporters/exporters-stack.yml

Exporter	Mode	Port	Metrics
cadvisor	Global (5/5)	--	Container CPU, memory, network, disk I/O
node_exporter	Global (5/5)	9100	Host CPU, memory, disk, network
promtail	Global (5/5)	--	Log aggregation (ships to Loki)
postgres_exporter	Replicated (1)	9187	PostgreSQL stats, connections, replication
redis_exporter	Replicated (1)	9121	Redis memory, commands, keys, clients

Postgres and Redis exporters are constrained to apps-data and connect via the data_net overlay network.

Prometheus Targets¶

All 15 targets are scraped from dockerrr across VLANs:

Job	Target	VLAN	Metrics
prometheus	localhost:9090	--	Self-monitoring
swarm-nodes	192.168.50.10:9100	50	apps-gateway host metrics
swarm-nodes	192.168.51.10:9100	51	apps-app1 host metrics
swarm-nodes	192.168.51.15:9100	51	apps-app2 host metrics
swarm-nodes	192.168.51.30:9100	51	apps-data host metrics
swarm-nodes	192.168.51.40:9100	51	apps-dev1 host metrics
swarm-cadvisor	192.168.51.10:8080	51	Container metrics (via app1)
swarm-cadvisor	192.168.51.30:8080	51	Container metrics (via data)
postgres	192.168.51.30:9187	51	PostgreSQL database metrics
redis	192.168.51.30:9121	51	Redis cache metrics
monitoring-server	192.168.51.20:9100	51	apps-monitoring host metrics
monitoring-cadvisor	192.168.51.20:8080	51	Container metrics (via monitoring)
proxmox-host	192.168.1.5:9100	Default	Proxmox host metrics (node_exporter)
dockerrr	192.168.1.47:9100	Default	dockerrr host metrics (node_exporter)
traefik-homelab	192.168.1.47:8080	Default	Homelab Traefik metrics
traefik-swarm	192.168.50.10:8080	50	Swarm Traefik metrics
infra-checks	192.168.1.47:9101	Default	Custom checks (Vault seal, Plex, HA)

Config file: /opt/docker/homelab/prometheus/prometheus.yml

Alerting¶

Alert Pipeline¶

Prometheus (29 alert rules across 8 groups)
  -> Alertmanager (routing, grouping, silencing)
    -> Pushover (direct push notifications to iPhone)

Migration from Home Assistant

Alerting was migrated from Home Assistant webhooks to Pushover direct integration. This removes the HA dependency from the alerting pipeline. Pushover credentials are also stored in Vault at secret/pushover.

Alert Rule Groups¶

Config file: /opt/docker/homelab/prometheus/alerts.yml

Group	Rules	Description
host_alerts	HostDown, HostHighCPU, HostHighMemory (>92%), HostDiskSpaceLow, HostDiskSpaceCritical	Host-level alerts for swarm nodes
container_alerts	ContainerHighCPU, ContainerHighMemory	Container resource alerts
database_alerts	PostgresDown, PostgresHighConnections, PostgresDeadlocks	PostgreSQL health
cache_alerts	RedisDown, RedisHighMemory	Redis health
application_alerts	TraefikHighErrorRate (5xx), TraefikHighLatency, TraefikTrafficSpike	Traefik application-level alerts
proxmox_alerts	ProxmoxHostDown, ProxmoxHighCPU, ProxmoxHighMemory, ProxmoxDiskSpace	Proxmox host health
dockerrr_alerts	DockerrrDown, DockerrrHighCPU, DockerrrHighMemory, DockerrrDiskSpace	dockerrr host health
infra_alerts	VaultSealed, VaultDown, PlexDown, HomeAssistantDown	Infrastructure service checks

HostHighMemory Threshold

The HostHighMemory threshold was raised from 85% to 92% because swarm VMs previously had balloon-constrained memory, which inflated usage percentages. Balloon is now disabled on all swarm VMs.

Alertmanager Configuration¶

Config file: /opt/docker/homelab/alertmanager/alertmanager.yml

Receiver	Method	Behavior
`pushover-critical`	Pushover (priority: emergency)	Immediate (0s wait), repeat 15m, breaks Do Not Disturb
`pushover-warning`	Pushover (priority: high)	1m wait, repeat 4h, normal push notification

Alerts are grouped by alertname and node. Resolved notifications are sent when alerts clear.

Managing Silences¶

# List active silences
curl -s http://192.168.1.47:9093/api/v2/silences | python3 -m json.tool

# Create a silence (e.g., for maintenance)
curl -X POST http://192.168.1.47:9093/api/v2/silences -H "Content-Type: application/json" -d '{
  "matchers": [{"name": "alertname", "value": "HostDown", "isRegex": false}],
  "startsAt": "2026-01-01T00:00:00Z",
  "endsAt": "2026-01-01T06:00:00Z",
  "createdBy": "admin",
  "comment": "Planned maintenance"
}'

Grafana Dashboards¶

Grafana uses file-based provisioning. Dashboards are stored in /opt/docker/homelab/grafana/provisioning/dashboards/json/.

Dashboard	Source	Description
Infrastructure Overview	Custom	Node status, CPU, memory, disk, network, Postgres connections, Redis memory
Node Exporter Full	Community (#1860)	Detailed host metrics per node
PostgreSQL	Community (#9628)	Database connections, queries, cache hit ratio
Redis	Community (#11835)	Memory usage, commands, keys, clients

Datasource: Prometheus at http://prometheus:9090 (same Docker network)

Adding a Dashboard¶

Place the JSON file in /opt/docker/homelab/grafana/provisioning/dashboards/json/
Ensure the datasource UID is PBFA97CFB590B2093
Restart Grafana: docker restart grafana

Exporter Credentials¶

PostgreSQL Exporter¶

User: monitoring with pg_monitor role (read-only)
Password: Docker secret monitoring_pg_password
Auth method: DATA_SOURCE_PASS_FILE pointing to /run/secrets/monitoring_pg_password

Redis Exporter¶

Password: Docker secret monitoring_redis_password
Format: {"redis://data_redis:6379": "<password>"}
Auth method: --redis.password-file=/run/secrets/monitoring_redis_password

Redis Password File Format

The oliver006/redis_exporter image is a scratch/distroless image with no shell. The --redis.password-file flag requires JSON format mapping the Redis URL to the password, not plain text.

Log Management¶

Promtail¶

Promtail runs as a global service on all 5 nodes, collecting container logs and shipping them to Loki for centralized log aggregation.

Docker Log Rotation¶

Docker daemon configured with JSON file logging driver:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Health Checks¶

Quick Cluster Status¶

# All nodes healthy?
docker node ls

# All services running expected replicas?
docker service ls

# Any failed tasks?
docker service ls --format "{{.Name}} {{.Replicas}}" | grep -v "1/1\|2/2\|5/5"

Monitoring Stack Health¶

# Check all Prometheus targets
curl -s http://192.168.1.47:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
    print(f\"{t['labels'].get('job',''):>15} | {t['labels'].get('instance',''):>25} | {t['health']}\")"

# Check active alerts
curl -s http://192.168.1.47:9093/api/v2/alerts | python3 -m json.tool

# Grafana health
curl -s http://192.168.1.47:3000/api/health

Disaster Recovery¶

Node Failure¶

# Drain failed node
docker node update --availability drain {node}

# Force redistribute services
docker service ls --format "{{.Name}}" | xargs -I {} docker service update --force {}

# When repaired, reactivate
docker node update --availability active {node}

Service Rollback¶

# Immediate rollback to previous version
docker service rollback {service-name}

# Deploy specific version
docker service update --image registry.apps.jlwaller.com/{image}:{tag} {service-name}

Database Recovery¶

# Restore from backup
cat backup.sql | docker exec -i $(docker ps -qf name=data_postgres) psql -U postgres -d {database}