Files
Sankofa/docs/guides/MONITORING_GUIDE.md
defiQUG fe0365757a Update documentation structure and enhance .gitignore
- Added generated index files and report directories to .gitignore to prevent unnecessary tracking of transient files.
- Updated README links to reflect new documentation paths for better navigation.
- Improved documentation organization by ensuring all links point to the correct locations, enhancing user experience and accessibility.
2025-12-12 21:18:55 -08:00

340 lines
7.9 KiB
Markdown

# Monitoring and Observability Guide
**Last Updated**: 2025-01-09
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
## Overview
Sankofa Phoenix uses a comprehensive monitoring stack:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboards
- **Loki**: Log aggregation
- **Alertmanager**: Alert routing and notification
## Tenant-Aware Metrics
All metrics are tagged with tenant IDs for multi-tenant isolation.
### Metric Naming Convention
```
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
```
Examples:
- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
## Grafana Dashboards
### 1. System Overview Dashboard
**Location**: `grafana/dashboards/system-overview.json`
**Metrics**:
- API request rate and latency
- Database connection pool usage
- Keycloak authentication rate
- System resource usage (CPU, memory, disk)
**Panels**:
- Request rate (requests/sec)
- P95 latency (ms)
- Error rate (%)
- Active connections
- Authentication success rate
### 2. Tenant Dashboard
**Location**: `grafana/dashboards/tenant-overview.json`
**Metrics**:
- Tenant resource usage
- Tenant cost tracking
- Tenant API usage
- Tenant user activity
**Panels**:
- Resource usage by tenant
- Cost breakdown by tenant
- API calls by tenant
- Active users by tenant
### 3. Billing Dashboard
**Location**: `grafana/dashboards/billing.json`
**Metrics**:
- Real-time cost tracking
- Cost by service/resource
- Budget vs actual spend
- Cost forecast
- Billing anomalies
**Panels**:
- Current month cost
- Cost trend (7d, 30d)
- Top resources by cost
- Budget utilization
- Anomaly detection alerts
### 4. Proxmox Infrastructure Dashboard
**Location**: `grafana/dashboards/proxmox-infrastructure.json`
**Metrics**:
- VM status and health
- Node resource usage
- Storage utilization
- Network throughput
- VM creation/deletion rate
**Panels**:
- VM status overview
- Node CPU/memory usage
- Storage pool usage
- Network I/O
- VM lifecycle events
### 5. Security Dashboard
**Location**: `grafana/dashboards/security.json`
**Metrics**:
- Authentication events
- Failed login attempts
- Policy violations
- Incident response metrics
- Audit log events
**Panels**:
- Authentication success/failure rate
- Policy violations by severity
- Incident response time
- Audit log volume
- Security events timeline
## Prometheus Configuration
### Scrape Configs
```yaml
scrape_configs:
- job_name: 'sankofa-api'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- api
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: api
metric_relabel_configs:
- source_labels: [tenant_id]
target_label: tenant_id
regex: '(.+)'
replacement: '${1}'
- job_name: 'proxmox'
static_configs:
- targets:
- proxmox-exporter:9091
relabel_configs:
- source_labels: [__address__]
target_label: instance
```
### Recording Rules
```yaml
groups:
- name: sankofa_rules
interval: 30s
rules:
- record: sankofa:api:requests:rate5m
expr: rate(sankofa_api_requests_total[5m])
- record: sankofa:billing:cost:rate1h
expr: rate(sankofa_billing_cost_usd[1h])
- record: sankofa:proxmox:vm:count
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
```
## Alerting Rules
### Critical Alerts
```yaml
groups:
- name: sankofa_critical
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: DatabaseConnectionPoolExhausted
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
- alert: BudgetExceeded
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
for: 1h
labels:
severity: warning
annotations:
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
- alert: ProxmoxNodeDown
expr: up{job="proxmox"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.instance }} is down"
```
### Billing Anomaly Detection
```yaml
- name: sankofa_billing_anomalies
interval: 1h
rules:
- alert: CostAnomalyDetected
expr: |
(
sankofa_billing_cost_usd
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
for: 2h
labels:
severity: warning
annotations:
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
```
## Real-Time Cost Tracking
### Metrics Exposed
- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
### Grafana Query Example
```promql
# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)
# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
```
## Log Aggregation
### Loki Configuration
Logs are collected with tenant context:
```yaml
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: ${TENANT_ID}
```
### Log Labels
- `tenant_id`: Tenant identifier
- `service`: Service name (api, portal, etc.)
- `level`: Log level (info, warn, error)
- `component`: Component name
### Log Queries
```logql
# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}
# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h
# Authentication failures
{component="auth"} | json | status="failed"
```
## Deployment
### Install Monitoring Stack
```bash
# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values grafana/values.yaml
# Apply custom dashboards
kubectl apply -f grafana/dashboards/
```
### Import Dashboards
```bash
# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
kubectl create configmap $(basename $dashboard .json) \
--from-file=$dashboard \
--namespace=monitoring \
--dry-run=client -o yaml | kubectl apply -f -
done
```
## Access
- **Grafana**: https://grafana.sankofa.nexus
- **Prometheus**: https://prometheus.sankofa.nexus
- **Alertmanager**: https://alertmanager.sankofa.nexus
Default credentials (change immediately):
- Username: `admin`
- Password: (from secret `monitoring-grafana`)
## Best Practices
1. **Tenant Isolation**: Always filter metrics by tenant_id
2. **Retention**: Configure appropriate retention periods
3. **Cardinality**: Avoid high-cardinality labels
4. **Alerts**: Set up alerting for critical metrics
5. **Dashboards**: Create tenant-specific dashboards
6. **Cost Tracking**: Monitor billing metrics closely
7. **Anomaly Detection**: Enable anomaly detection for billing
## References
- Dashboard definitions: `grafana/dashboards/`
- Prometheus config: `monitoring/prometheus/`
- Alert rules: `monitoring/alerts/`