241 lines
5.0 KiB
Markdown
241 lines
5.0 KiB
Markdown
|
|
# Infrastructure Monitoring
|
||
|
|
|
||
|
|
Comprehensive monitoring solutions for all infrastructure components in Sankofa Phoenix.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This directory contains monitoring components including custom Prometheus exporters, Grafana dashboards, and alerting rules for infrastructure monitoring.
|
||
|
|
|
||
|
|
## Components
|
||
|
|
|
||
|
|
### Exporters (`exporters/`)
|
||
|
|
|
||
|
|
Custom Prometheus exporters for:
|
||
|
|
- Proxmox VE metrics
|
||
|
|
- TP-Link Omada metrics
|
||
|
|
- Network switch/router metrics
|
||
|
|
- Infrastructure health checks
|
||
|
|
|
||
|
|
### Dashboards (`dashboards/`)
|
||
|
|
|
||
|
|
Grafana dashboards for:
|
||
|
|
- Infrastructure overview
|
||
|
|
- Proxmox cluster health
|
||
|
|
- Network performance
|
||
|
|
- Omada controller status
|
||
|
|
- Site-level monitoring
|
||
|
|
|
||
|
|
## Exporters
|
||
|
|
|
||
|
|
### Proxmox Exporter
|
||
|
|
|
||
|
|
The Proxmox exporter (`pve_exporter`) provides metrics for:
|
||
|
|
- VM status and resource usage
|
||
|
|
- Node health and performance
|
||
|
|
- Storage pool utilization
|
||
|
|
- Network interface statistics
|
||
|
|
- Cluster status
|
||
|
|
|
||
|
|
**Installation:**
|
||
|
|
```bash
|
||
|
|
pip install pve_exporter
|
||
|
|
```
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
```yaml
|
||
|
|
exporter:
|
||
|
|
listen_address: 0.0.0.0:9221
|
||
|
|
proxmox:
|
||
|
|
endpoint: https://pve1.sankofa.nexus:8006
|
||
|
|
username: monitoring@pam
|
||
|
|
password: ${PROXMOX_PASSWORD}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Omada Exporter
|
||
|
|
|
||
|
|
Custom exporter for TP-Link Omada Controller metrics:
|
||
|
|
- Access point status
|
||
|
|
- Client device counts
|
||
|
|
- Network throughput
|
||
|
|
- Controller health
|
||
|
|
|
||
|
|
**See**: `exporters/omada_exporter/` for implementation
|
||
|
|
|
||
|
|
### Network Exporter
|
||
|
|
|
||
|
|
SNMP-based exporter for network devices:
|
||
|
|
- Switch port statistics
|
||
|
|
- Router interface metrics
|
||
|
|
- VLAN utilization
|
||
|
|
- Network topology changes
|
||
|
|
|
||
|
|
**See**: `exporters/network_exporter/` for implementation
|
||
|
|
|
||
|
|
## Dashboards
|
||
|
|
|
||
|
|
### Infrastructure Overview
|
||
|
|
|
||
|
|
Comprehensive dashboard showing:
|
||
|
|
- All sites status
|
||
|
|
- Resource utilization
|
||
|
|
- Health scores
|
||
|
|
- Alert summary
|
||
|
|
|
||
|
|
**Location**: `dashboards/infrastructure-overview.json`
|
||
|
|
|
||
|
|
### Proxmox Cluster
|
||
|
|
|
||
|
|
Dashboard for Proxmox clusters:
|
||
|
|
- Cluster health
|
||
|
|
- Node performance
|
||
|
|
- VM resource usage
|
||
|
|
- Storage utilization
|
||
|
|
|
||
|
|
**Location**: `dashboards/proxmox-cluster.json`
|
||
|
|
|
||
|
|
### Network Performance
|
||
|
|
|
||
|
|
Network performance dashboard:
|
||
|
|
- Bandwidth utilization
|
||
|
|
- Latency metrics
|
||
|
|
- Error rates
|
||
|
|
- Top talkers
|
||
|
|
|
||
|
|
**Location**: `dashboards/network-performance.json`
|
||
|
|
|
||
|
|
### Omada Controller
|
||
|
|
|
||
|
|
Omada-specific dashboard:
|
||
|
|
- Controller status
|
||
|
|
- Access point health
|
||
|
|
- Client statistics
|
||
|
|
- Network policies
|
||
|
|
|
||
|
|
**Location**: `dashboards/omada-controller.json`
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
### Deploy Exporters
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Deploy all exporters
|
||
|
|
kubectl apply -f exporters/manifests/
|
||
|
|
|
||
|
|
# Or deploy individually
|
||
|
|
kubectl apply -f exporters/manifests/proxmox-exporter.yaml
|
||
|
|
kubectl apply -f exporters/manifests/omada-exporter.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Import Dashboards
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Import all dashboards to Grafana
|
||
|
|
./scripts/import-dashboards.sh
|
||
|
|
|
||
|
|
# Or import individually
|
||
|
|
grafana-cli admin import-dashboard dashboards/infrastructure-overview.json
|
||
|
|
```
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
### Prometheus Scrape Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: 'proxmox'
|
||
|
|
static_configs:
|
||
|
|
- targets:
|
||
|
|
- 'pve-exporter.monitoring.svc.cluster.local:9221'
|
||
|
|
|
||
|
|
- job_name: 'omada'
|
||
|
|
static_configs:
|
||
|
|
- targets:
|
||
|
|
- 'omada-exporter.monitoring.svc.cluster.local:9222'
|
||
|
|
|
||
|
|
- job_name: 'network'
|
||
|
|
static_configs:
|
||
|
|
- targets:
|
||
|
|
- 'network-exporter.monitoring.svc.cluster.local:9223'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alerting Rules
|
||
|
|
|
||
|
|
Alert rules are defined in `exporters/alert-rules/`:
|
||
|
|
|
||
|
|
- `proxmox-alerts.yaml`: Proxmox cluster alerts
|
||
|
|
- `omada-alerts.yaml`: Omada controller alerts
|
||
|
|
- `network-alerts.yaml`: Network infrastructure alerts
|
||
|
|
|
||
|
|
## Metrics
|
||
|
|
|
||
|
|
### Proxmox Metrics
|
||
|
|
|
||
|
|
- `pve_node_status`: Node status (0=offline, 1=online)
|
||
|
|
- `pve_vm_status`: VM status
|
||
|
|
- `pve_storage_used_bytes`: Storage usage
|
||
|
|
- `pve_network_rx_bytes`: Network receive bytes
|
||
|
|
- `pve_network_tx_bytes`: Network transmit bytes
|
||
|
|
|
||
|
|
### Omada Metrics
|
||
|
|
|
||
|
|
- `omada_ap_status`: Access point status
|
||
|
|
- `omada_clients_total`: Total client count
|
||
|
|
- `omada_throughput_bytes`: Network throughput
|
||
|
|
- `omada_controller_status`: Controller health
|
||
|
|
|
||
|
|
### Network Metrics
|
||
|
|
|
||
|
|
- `network_port_status`: Switch port status
|
||
|
|
- `network_port_rx_bytes`: Port receive bytes
|
||
|
|
- `network_port_tx_bytes`: Port transmit bytes
|
||
|
|
- `network_vlan_utilization`: VLAN utilization
|
||
|
|
|
||
|
|
## Alerts
|
||
|
|
|
||
|
|
### Critical Alerts
|
||
|
|
|
||
|
|
- Proxmox cluster node down
|
||
|
|
- Omada controller unreachable
|
||
|
|
- Network switch offline
|
||
|
|
- High resource utilization (>90%)
|
||
|
|
|
||
|
|
### Warning Alerts
|
||
|
|
|
||
|
|
- High resource utilization (>80%)
|
||
|
|
- Network latency spikes
|
||
|
|
- Access point offline
|
||
|
|
- Storage pool >80% full
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Exporter Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check exporter status
|
||
|
|
kubectl get pods -n monitoring -l app=proxmox-exporter
|
||
|
|
|
||
|
|
# View exporter logs
|
||
|
|
kubectl logs -n monitoring -l app=proxmox-exporter
|
||
|
|
|
||
|
|
# Test exporter endpoint
|
||
|
|
curl http://proxmox-exporter.monitoring.svc.cluster.local:9221/metrics
|
||
|
|
```
|
||
|
|
|
||
|
|
### Dashboard Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Verify dashboard import
|
||
|
|
grafana-cli admin ls-dashboard
|
||
|
|
|
||
|
|
# Check dashboard data sources
|
||
|
|
# In Grafana UI: Configuration > Data Sources
|
||
|
|
```
|
||
|
|
|
||
|
|
## Related Documentation
|
||
|
|
|
||
|
|
- [Proxmox Management](../proxmox/README.md)
|
||
|
|
- [Omada Management](../omada/README.md)
|
||
|
|
- [Network Management](../network/README.md)
|
||
|
|
- [Infrastructure Management](../README.md)
|
||
|
|
|