# Infrastructure Monitoring

Comprehensive monitoring solutions for all infrastructure components in Sankofa Phoenix.

## Overview

This directory contains monitoring components including custom Prometheus exporters, Grafana dashboards, and alerting rules for infrastructure monitoring.

## Components

### Exporters (`exporters/`)

Custom Prometheus exporters for:
- Proxmox VE metrics
- TP-Link Omada metrics
- Network switch/router metrics
- Infrastructure health checks

### Dashboards (`dashboards/`)

Grafana dashboards for:
- Infrastructure overview
- Proxmox cluster health
- Network performance
- Omada controller status
- Site-level monitoring

## Exporters

### Proxmox Exporter

The Proxmox exporter (`pve_exporter`) provides metrics for:
- VM status and resource usage
- Node health and performance
- Storage pool utilization
- Network interface statistics
- Cluster status

**Installation:**
```bash
pip install pve_exporter
```

**Configuration:**
```yaml
exporter:
  listen_address: 0.0.0.0:9221
  proxmox:
    endpoint: https://pve1.sankofa.nexus:8006
    username: monitoring@pam
    password: ${PROXMOX_PASSWORD}
```

### Omada Exporter

Custom exporter for TP-Link Omada Controller metrics:
- Access point status
- Client device counts
- Network throughput
- Controller health

**See**: `exporters/omada_exporter/` for implementation

### Network Exporter

SNMP-based exporter for network devices:
- Switch port statistics
- Router interface metrics
- VLAN utilization
- Network topology changes

**See**: `exporters/network_exporter/` for implementation

## Dashboards

### Infrastructure Overview

Comprehensive dashboard showing:
- All sites status
- Resource utilization
- Health scores
- Alert summary

**Location**: `dashboards/infrastructure-overview.json`

### Proxmox Cluster

Dashboard for Proxmox clusters:
- Cluster health
- Node performance
- VM resource usage
- Storage utilization

**Location**: `dashboards/proxmox-cluster.json`

### Network Performance

Network performance dashboard:
- Bandwidth utilization
- Latency metrics
- Error rates
- Top talkers

**Location**: `dashboards/network-performance.json`

### Omada Controller

Omada-specific dashboard:
- Controller status
- Access point health
- Client statistics
- Network policies

**Location**: `dashboards/omada-controller.json`

## Installation

### Deploy Exporters

```bash
# Deploy all exporters
kubectl apply -f exporters/manifests/

# Or deploy individually
kubectl apply -f exporters/manifests/proxmox-exporter.yaml
kubectl apply -f exporters/manifests/omada-exporter.yaml
```

### Import Dashboards

```bash
# Import all dashboards to Grafana
./scripts/import-dashboards.sh

# Or import individually
grafana-cli admin import-dashboard dashboards/infrastructure-overview.json
```

## Configuration

### Prometheus Scrape Configuration

```yaml
scrape_configs:
  - job_name: 'proxmox'
    static_configs:
      - targets:
        - 'pve-exporter.monitoring.svc.cluster.local:9221'
  
  - job_name: 'omada'
    static_configs:
      - targets:
        - 'omada-exporter.monitoring.svc.cluster.local:9222'
  
  - job_name: 'network'
    static_configs:
      - targets:
        - 'network-exporter.monitoring.svc.cluster.local:9223'
```

### Alerting Rules

Alert rules are defined in `exporters/alert-rules/`:

- `proxmox-alerts.yaml`: Proxmox cluster alerts
- `omada-alerts.yaml`: Omada controller alerts
- `network-alerts.yaml`: Network infrastructure alerts

## Metrics

### Proxmox Metrics

- `pve_node_status`: Node status (0=offline, 1=online)
- `pve_vm_status`: VM status
- `pve_storage_used_bytes`: Storage usage
- `pve_network_rx_bytes`: Network receive bytes
- `pve_network_tx_bytes`: Network transmit bytes

### Omada Metrics

- `omada_ap_status`: Access point status
- `omada_clients_total`: Total client count
- `omada_throughput_bytes`: Network throughput
- `omada_controller_status`: Controller health

### Network Metrics

- `network_port_status`: Switch port status
- `network_port_rx_bytes`: Port receive bytes
- `network_port_tx_bytes`: Port transmit bytes
- `network_vlan_utilization`: VLAN utilization

## Alerts

### Critical Alerts

- Proxmox cluster node down
- Omada controller unreachable
- Network switch offline
- High resource utilization (>90%)

### Warning Alerts

- High resource utilization (>80%)
- Network latency spikes
- Access point offline
- Storage pool >80% full

## Troubleshooting

### Exporter Issues

```bash
# Check exporter status
kubectl get pods -n monitoring -l app=proxmox-exporter

# View exporter logs
kubectl logs -n monitoring -l app=proxmox-exporter

# Test exporter endpoint
curl http://proxmox-exporter.monitoring.svc.cluster.local:9221/metrics
```

### Dashboard Issues

```bash
# Verify dashboard import
grafana-cli admin ls-dashboard

# Check dashboard data sources
# In Grafana UI: Configuration > Data Sources
```

## Related Documentation

- [Proxmox Management](../proxmox/README.md)
- [Omada Management](../omada/README.md)
- [Network Management](../network/README.md)
- [Infrastructure Management](../README.md)