- Added "Last Updated" date to multiple documentation files for better tracking. - Enhanced the README with quick navigation indexes for guides, references, and architecture documentation. - Updated titles in Keycloak deployment and testing guide for consistency.
7.9 KiB
7.9 KiB
Monitoring and Observability Guide
Last Updated: 2025-01-09
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
Overview
Sankofa Phoenix uses a comprehensive monitoring stack:
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Loki: Log aggregation
- Alertmanager: Alert routing and notification
Tenant-Aware Metrics
All metrics are tagged with tenant IDs for multi-tenant isolation.
Metric Naming Convention
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
Examples:
sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}
Grafana Dashboards
1. System Overview Dashboard
Location: grafana/dashboards/system-overview.json
Metrics:
- API request rate and latency
- Database connection pool usage
- Keycloak authentication rate
- System resource usage (CPU, memory, disk)
Panels:
- Request rate (requests/sec)
- P95 latency (ms)
- Error rate (%)
- Active connections
- Authentication success rate
2. Tenant Dashboard
Location: grafana/dashboards/tenant-overview.json
Metrics:
- Tenant resource usage
- Tenant cost tracking
- Tenant API usage
- Tenant user activity
Panels:
- Resource usage by tenant
- Cost breakdown by tenant
- API calls by tenant
- Active users by tenant
3. Billing Dashboard
Location: grafana/dashboards/billing.json
Metrics:
- Real-time cost tracking
- Cost by service/resource
- Budget vs actual spend
- Cost forecast
- Billing anomalies
Panels:
- Current month cost
- Cost trend (7d, 30d)
- Top resources by cost
- Budget utilization
- Anomaly detection alerts
4. Proxmox Infrastructure Dashboard
Location: grafana/dashboards/proxmox-infrastructure.json
Metrics:
- VM status and health
- Node resource usage
- Storage utilization
- Network throughput
- VM creation/deletion rate
Panels:
- VM status overview
- Node CPU/memory usage
- Storage pool usage
- Network I/O
- VM lifecycle events
5. Security Dashboard
Location: grafana/dashboards/security.json
Metrics:
- Authentication events
- Failed login attempts
- Policy violations
- Incident response metrics
- Audit log events
Panels:
- Authentication success/failure rate
- Policy violations by severity
- Incident response time
- Audit log volume
- Security events timeline
Prometheus Configuration
Scrape Configs
scrape_configs:
- job_name: 'sankofa-api'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- api
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: api
metric_relabel_configs:
- source_labels: [tenant_id]
target_label: tenant_id
regex: '(.+)'
replacement: '${1}'
- job_name: 'proxmox'
static_configs:
- targets:
- proxmox-exporter:9091
relabel_configs:
- source_labels: [__address__]
target_label: instance
Recording Rules
groups:
- name: sankofa_rules
interval: 30s
rules:
- record: sankofa:api:requests:rate5m
expr: rate(sankofa_api_requests_total[5m])
- record: sankofa:billing:cost:rate1h
expr: rate(sankofa_billing_cost_usd[1h])
- record: sankofa:proxmox:vm:count
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
Alerting Rules
Critical Alerts
groups:
- name: sankofa_critical
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: DatabaseConnectionPoolExhausted
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
- alert: BudgetExceeded
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
for: 1h
labels:
severity: warning
annotations:
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
- alert: ProxmoxNodeDown
expr: up{job="proxmox"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Proxmox node {{ $labels.instance }} is down"
Billing Anomaly Detection
- name: sankofa_billing_anomalies
interval: 1h
rules:
- alert: CostAnomalyDetected
expr: |
(
sankofa_billing_cost_usd
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
for: 2h
labels:
severity: warning
annotations:
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
Real-Time Cost Tracking
Metrics Exposed
sankofa_billing_cost_usd{tenant_id, service, resource_id}- Current costsankofa_billing_cost_rate_usd_per_hour{tenant_id}- Cost ratesankofa_billing_budget_usd{tenant_id}- Budget limitsankofa_billing_budget_utilization_percent{tenant_id}- Budget usage %
Grafana Query Example
# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)
# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
Log Aggregation
Loki Configuration
Logs are collected with tenant context:
clients:
- url: http://loki:3100/loki/api/v1/push
tenant_id: ${TENANT_ID}
Log Labels
tenant_id: Tenant identifierservice: Service name (api, portal, etc.)level: Log level (info, warn, error)component: Component name
Log Queries
# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}
# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h
# Authentication failures
{component="auth"} | json | status="failed"
Deployment
Install Monitoring Stack
# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values grafana/values.yaml
# Apply custom dashboards
kubectl apply -f grafana/dashboards/
Import Dashboards
# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
kubectl create configmap $(basename $dashboard .json) \
--from-file=$dashboard \
--namespace=monitoring \
--dry-run=client -o yaml | kubectl apply -f -
done
Access
- Grafana: https://grafana.sankofa.nexus
- Prometheus: https://prometheus.sankofa.nexus
- Alertmanager: https://alertmanager.sankofa.nexus
Default credentials (change immediately):
- Username:
admin - Password: (from secret
monitoring-grafana)
Best Practices
- Tenant Isolation: Always filter metrics by tenant_id
- Retention: Configure appropriate retention periods
- Cardinality: Avoid high-cardinality labels
- Alerts: Set up alerting for critical metrics
- Dashboards: Create tenant-specific dashboards
- Cost Tracking: Monitor billing metrics closely
- Anomaly Detection: Enable anomaly detection for billing
References
- Dashboard definitions:
grafana/dashboards/ - Prometheus config:
monitoring/prometheus/ - Alert rules:
monitoring/alerts/