Monitoring¶
The monitoring stack collects metrics from all 5 Swarm nodes, the Proxmox host, dockerrr, and data services. 29 Prometheus alert rules across 8 groups route through Alertmanager directly to Pushover for iPhone push notifications. Grafana dashboards provide visualization.
Architecture Note
Prometheus, Grafana, and Alertmanager run on dockerrr (192.168.1.47, Default VLAN) rather than the Swarm cluster. This is because dockerrr can reach all VLANs (50, 51, and IOT), while VLAN 51 nodes cannot reach VLAN 50 or the Default VLAN due to firewall rules.
Monitoring Stack¶
Central Services (dockerrr, 192.168.1.47)¶
Running via Docker Compose at /opt/docker/homelab/docker-compose.yml:
| Service | URL | Port | Purpose |
|---|---|---|---|
| Prometheus | prometheus.home.jlwaller.com | 9090 | Metrics collection, storage (30d retention), alerting rules |
| Grafana | grafana.home.jlwaller.com | 3000 | Dashboards and visualization |
| Alertmanager | alertmanager.home.jlwaller.com | 9093 | Alert routing and notification |
All three are accessible via Traefik with Let's Encrypt TLS (Cloudflare DNS challenge).
Cluster Exporters (Swarm)¶
The monitoring-exporters stack deploys exporters across the Swarm cluster.
Stack file: /opt/swarm/stacks/monitoring-exporters/exporters-stack.yml
| Exporter | Mode | Port | Metrics |
|---|---|---|---|
| cadvisor | Global (5/5) | -- | Container CPU, memory, network, disk I/O |
| node_exporter | Global (5/5) | 9100 | Host CPU, memory, disk, network |
| promtail | Global (5/5) | -- | Log aggregation (ships to Loki) |
| postgres_exporter | Replicated (1) | 9187 | PostgreSQL stats, connections, replication |
| redis_exporter | Replicated (1) | 9121 | Redis memory, commands, keys, clients |
Postgres and Redis exporters are constrained to apps-data and connect via the data_net overlay network.
Prometheus Targets¶
All 15 targets are scraped from dockerrr across VLANs:
| Job | Target | VLAN | Metrics |
|---|---|---|---|
| prometheus | localhost:9090 | -- | Self-monitoring |
| swarm-nodes | 192.168.50.10:9100 | 50 | apps-gateway host metrics |
| swarm-nodes | 192.168.51.10:9100 | 51 | apps-app1 host metrics |
| swarm-nodes | 192.168.51.15:9100 | 51 | apps-app2 host metrics |
| swarm-nodes | 192.168.51.30:9100 | 51 | apps-data host metrics |
| swarm-nodes | 192.168.51.40:9100 | 51 | apps-dev1 host metrics |
| swarm-cadvisor | 192.168.51.10:8080 | 51 | Container metrics (via app1) |
| swarm-cadvisor | 192.168.51.30:8080 | 51 | Container metrics (via data) |
| postgres | 192.168.51.30:9187 | 51 | PostgreSQL database metrics |
| redis | 192.168.51.30:9121 | 51 | Redis cache metrics |
| monitoring-server | 192.168.51.20:9100 | 51 | apps-monitoring host metrics |
| monitoring-cadvisor | 192.168.51.20:8080 | 51 | Container metrics (via monitoring) |
| proxmox-host | 192.168.1.5:9100 | Default | Proxmox host metrics (node_exporter) |
| dockerrr | 192.168.1.47:9100 | Default | dockerrr host metrics (node_exporter) |
| traefik-homelab | 192.168.1.47:8080 | Default | Homelab Traefik metrics |
| traefik-swarm | 192.168.50.10:8080 | 50 | Swarm Traefik metrics |
| infra-checks | 192.168.1.47:9101 | Default | Custom checks (Vault seal, Plex, HA) |
Config file: /opt/docker/homelab/prometheus/prometheus.yml
Alerting¶
Alert Pipeline¶
Prometheus (29 alert rules across 8 groups)
-> Alertmanager (routing, grouping, silencing)
-> Pushover (direct push notifications to iPhone)
Migration from Home Assistant
Alerting was migrated from Home Assistant webhooks to Pushover direct integration. This removes the HA dependency from the alerting pipeline. Pushover credentials are also stored in Vault at secret/pushover.
Alert Rule Groups¶
Config file: /opt/docker/homelab/prometheus/alerts.yml
| Group | Rules | Description |
|---|---|---|
| host_alerts | HostDown, HostHighCPU, HostHighMemory (>92%), HostDiskSpaceLow, HostDiskSpaceCritical | Host-level alerts for swarm nodes |
| container_alerts | ContainerHighCPU, ContainerHighMemory | Container resource alerts |
| database_alerts | PostgresDown, PostgresHighConnections, PostgresDeadlocks | PostgreSQL health |
| cache_alerts | RedisDown, RedisHighMemory | Redis health |
| application_alerts | TraefikHighErrorRate (5xx), TraefikHighLatency, TraefikTrafficSpike | Traefik application-level alerts |
| proxmox_alerts | ProxmoxHostDown, ProxmoxHighCPU, ProxmoxHighMemory, ProxmoxDiskSpace | Proxmox host health |
| dockerrr_alerts | DockerrrDown, DockerrrHighCPU, DockerrrHighMemory, DockerrrDiskSpace | dockerrr host health |
| infra_alerts | VaultSealed, VaultDown, PlexDown, HomeAssistantDown | Infrastructure service checks |
HostHighMemory Threshold
The HostHighMemory threshold was raised from 85% to 92% because swarm VMs previously had balloon-constrained memory, which inflated usage percentages. Balloon is now disabled on all swarm VMs.
Alertmanager Configuration¶
Config file: /opt/docker/homelab/alertmanager/alertmanager.yml
| Receiver | Method | Behavior |
|---|---|---|
pushover-critical |
Pushover (priority: emergency) | Immediate (0s wait), repeat 15m, breaks Do Not Disturb |
pushover-warning |
Pushover (priority: high) | 1m wait, repeat 4h, normal push notification |
Alerts are grouped by alertname and node. Resolved notifications are sent when alerts clear.
Managing Silences¶
# List active silences
curl -s http://192.168.1.47:9093/api/v2/silences | python3 -m json.tool
# Create a silence (e.g., for maintenance)
curl -X POST http://192.168.1.47:9093/api/v2/silences -H "Content-Type: application/json" -d '{
"matchers": [{"name": "alertname", "value": "HostDown", "isRegex": false}],
"startsAt": "2026-01-01T00:00:00Z",
"endsAt": "2026-01-01T06:00:00Z",
"createdBy": "admin",
"comment": "Planned maintenance"
}'
Grafana Dashboards¶
Grafana uses file-based provisioning. Dashboards are stored in /opt/docker/homelab/grafana/provisioning/dashboards/json/.
| Dashboard | Source | Description |
|---|---|---|
| Infrastructure Overview | Custom | Node status, CPU, memory, disk, network, Postgres connections, Redis memory |
| Node Exporter Full | Community (#1860) | Detailed host metrics per node |
| PostgreSQL | Community (#9628) | Database connections, queries, cache hit ratio |
| Redis | Community (#11835) | Memory usage, commands, keys, clients |
Datasource: Prometheus at http://prometheus:9090 (same Docker network)
Adding a Dashboard¶
- Place the JSON file in
/opt/docker/homelab/grafana/provisioning/dashboards/json/ - Ensure the datasource UID is
PBFA97CFB590B2093 - Restart Grafana:
docker restart grafana
Exporter Credentials¶
PostgreSQL Exporter¶
- User:
monitoringwithpg_monitorrole (read-only) - Password: Docker secret
monitoring_pg_password - Auth method:
DATA_SOURCE_PASS_FILEpointing to/run/secrets/monitoring_pg_password
Redis Exporter¶
- Password: Docker secret
monitoring_redis_password - Format:
{"redis://data_redis:6379": "<password>"} - Auth method:
--redis.password-file=/run/secrets/monitoring_redis_password
Redis Password File Format
The oliver006/redis_exporter image is a scratch/distroless image with no shell. The --redis.password-file flag requires JSON format mapping the Redis URL to the password, not plain text.
Log Management¶
Promtail¶
Promtail runs as a global service on all 5 nodes, collecting container logs and shipping them to Loki for centralized log aggregation.
Docker Log Rotation¶
Docker daemon configured with JSON file logging driver:
Health Checks¶
Quick Cluster Status¶
# All nodes healthy?
docker node ls
# All services running expected replicas?
docker service ls
# Any failed tasks?
docker service ls --format "{{.Name}} {{.Replicas}}" | grep -v "1/1\|2/2\|5/5"
Monitoring Stack Health¶
# Check all Prometheus targets
curl -s http://192.168.1.47:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
print(f\"{t['labels'].get('job',''):>15} | {t['labels'].get('instance',''):>25} | {t['health']}\")"
# Check active alerts
curl -s http://192.168.1.47:9093/api/v2/alerts | python3 -m json.tool
# Grafana health
curl -s http://192.168.1.47:3000/api/health
Disaster Recovery¶
Node Failure¶
# Drain failed node
docker node update --availability drain {node}
# Force redistribute services
docker service ls --format "{{.Name}}" | xargs -I {} docker service update --force {}
# When repaired, reactivate
docker node update --availability active {node}
Service Rollback¶
# Immediate rollback to previous version
docker service rollback {service-name}
# Deploy specific version
docker service update --image registry.apps.jlwaller.com/{image}:{tag} {service-name}