# Infrastructure Monitoring Comprehensive monitoring solutions for all infrastructure components in Sankofa Phoenix. ## Overview This directory contains monitoring components including custom Prometheus exporters, Grafana dashboards, and alerting rules for infrastructure monitoring. ## Components ### Exporters (`exporters/`) Custom Prometheus exporters for: - Proxmox VE metrics - TP-Link Omada metrics - Network switch/router metrics - Infrastructure health checks ### Dashboards (`dashboards/`) Grafana dashboards for: - Infrastructure overview - Proxmox cluster health - Network performance - Omada controller status - Site-level monitoring ## Exporters ### Proxmox Exporter The Proxmox exporter (`pve_exporter`) provides metrics for: - VM status and resource usage - Node health and performance - Storage pool utilization - Network interface statistics - Cluster status **Installation:** ```bash pip install pve_exporter ``` **Configuration:** ```yaml exporter: listen_address: 0.0.0.0:9221 proxmox: endpoint: https://pve1.sankofa.nexus:8006 username: monitoring@pam password: ${PROXMOX_PASSWORD} ``` ### Omada Exporter Custom exporter for TP-Link Omada Controller metrics: - Access point status - Client device counts - Network throughput - Controller health **See**: `exporters/omada_exporter/` for implementation ### Network Exporter SNMP-based exporter for network devices: - Switch port statistics - Router interface metrics - VLAN utilization - Network topology changes **See**: `exporters/network_exporter/` for implementation ## Dashboards ### Infrastructure Overview Comprehensive dashboard showing: - All sites status - Resource utilization - Health scores - Alert summary **Location**: `dashboards/infrastructure-overview.json` ### Proxmox Cluster Dashboard for Proxmox clusters: - Cluster health - Node performance - VM resource usage - Storage utilization **Location**: `dashboards/proxmox-cluster.json` ### Network Performance Network performance dashboard: - Bandwidth utilization - Latency metrics - Error rates - Top talkers **Location**: `dashboards/network-performance.json` ### Omada Controller Omada-specific dashboard: - Controller status - Access point health - Client statistics - Network policies **Location**: `dashboards/omada-controller.json` ## Installation ### Deploy Exporters ```bash # Deploy all exporters kubectl apply -f exporters/manifests/ # Or deploy individually kubectl apply -f exporters/manifests/proxmox-exporter.yaml kubectl apply -f exporters/manifests/omada-exporter.yaml ``` ### Import Dashboards ```bash # Import all dashboards to Grafana ./scripts/import-dashboards.sh # Or import individually grafana-cli admin import-dashboard dashboards/infrastructure-overview.json ``` ## Configuration ### Prometheus Scrape Configuration ```yaml scrape_configs: - job_name: 'proxmox' static_configs: - targets: - 'pve-exporter.monitoring.svc.cluster.local:9221' - job_name: 'omada' static_configs: - targets: - 'omada-exporter.monitoring.svc.cluster.local:9222' - job_name: 'network' static_configs: - targets: - 'network-exporter.monitoring.svc.cluster.local:9223' ``` ### Alerting Rules Alert rules are defined in `exporters/alert-rules/`: - `proxmox-alerts.yaml`: Proxmox cluster alerts - `omada-alerts.yaml`: Omada controller alerts - `network-alerts.yaml`: Network infrastructure alerts ## Metrics ### Proxmox Metrics - `pve_node_status`: Node status (0=offline, 1=online) - `pve_vm_status`: VM status - `pve_storage_used_bytes`: Storage usage - `pve_network_rx_bytes`: Network receive bytes - `pve_network_tx_bytes`: Network transmit bytes ### Omada Metrics - `omada_ap_status`: Access point status - `omada_clients_total`: Total client count - `omada_throughput_bytes`: Network throughput - `omada_controller_status`: Controller health ### Network Metrics - `network_port_status`: Switch port status - `network_port_rx_bytes`: Port receive bytes - `network_port_tx_bytes`: Port transmit bytes - `network_vlan_utilization`: VLAN utilization ## Alerts ### Critical Alerts - Proxmox cluster node down - Omada controller unreachable - Network switch offline - High resource utilization (>90%) ### Warning Alerts - High resource utilization (>80%) - Network latency spikes - Access point offline - Storage pool >80% full ## Troubleshooting ### Exporter Issues ```bash # Check exporter status kubectl get pods -n monitoring -l app=proxmox-exporter # View exporter logs kubectl logs -n monitoring -l app=proxmox-exporter # Test exporter endpoint curl http://proxmox-exporter.monitoring.svc.cluster.local:9221/metrics ``` ### Dashboard Issues ```bash # Verify dashboard import grafana-cli admin ls-dashboard # Check dashboard data sources # In Grafana UI: Configuration > Data Sources ``` ## Related Documentation - [Proxmox Management](../proxmox/README.md) - [Omada Management](../omada/README.md) - [Network Management](../network/README.md) - [Infrastructure Management](../README.md)