Files

defiQUG 4952ecf453 Update documentation with last updated dates and improve navigation indexes

- Added "Last Updated" date to multiple documentation files for better tracking.
- Enhanced the README with quick navigation indexes for guides, references, and architecture documentation.
- Updated titles in Keycloak deployment and testing guide for consistency.

2025-12-12 19:51:48 -08:00

7.9 KiB

Raw Blame History

Monitoring and Observability Guide

Last Updated: 2025-01-09

This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.

Overview

Sankofa Phoenix uses a comprehensive monitoring stack:

Prometheus: Metrics collection and storage
Grafana: Visualization and dashboards
Loki: Log aggregation
Alertmanager: Alert routing and notification

Tenant-Aware Metrics

All metrics are tagged with tenant IDs for multi-tenant isolation.

Metric Naming Convention

sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}

Examples:

sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}
sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}
sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}

Grafana Dashboards

1. System Overview Dashboard

Location: grafana/dashboards/system-overview.json

Metrics:

API request rate and latency
Database connection pool usage
Keycloak authentication rate
System resource usage (CPU, memory, disk)

Panels:

Request rate (requests/sec)
P95 latency (ms)
Error rate (%)
Active connections
Authentication success rate

2. Tenant Dashboard

Location: grafana/dashboards/tenant-overview.json

Metrics:

Tenant resource usage
Tenant cost tracking
Tenant API usage
Tenant user activity

Panels:

Resource usage by tenant
Cost breakdown by tenant
API calls by tenant
Active users by tenant

3. Billing Dashboard

Location: grafana/dashboards/billing.json

Metrics:

Real-time cost tracking
Cost by service/resource
Budget vs actual spend
Cost forecast
Billing anomalies

Panels:

Current month cost
Cost trend (7d, 30d)
Top resources by cost
Budget utilization
Anomaly detection alerts

4. Proxmox Infrastructure Dashboard

Location: grafana/dashboards/proxmox-infrastructure.json

Metrics:

VM status and health
Node resource usage
Storage utilization
Network throughput
VM creation/deletion rate

Panels:

VM status overview
Node CPU/memory usage
Storage pool usage
Network I/O
VM lifecycle events

5. Security Dashboard

Location: grafana/dashboards/security.json

Metrics:

Authentication events
Failed login attempts
Policy violations
Incident response metrics
Audit log events

Panels:

Authentication success/failure rate
Policy violations by severity
Incident response time
Audit log volume
Security events timeline

Prometheus Configuration

Scrape Configs

scrape_configs:
  - job_name: 'sankofa-api'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - api
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: api
    metric_relabel_configs:
      - source_labels: [tenant_id]
        target_label: tenant_id
        regex: '(.+)'
        replacement: '${1}'

  - job_name: 'proxmox'
    static_configs:
      - targets:
          - proxmox-exporter:9091
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Recording Rules

groups:
  - name: sankofa_rules
    interval: 30s
    rules:
      - record: sankofa:api:requests:rate5m
        expr: rate(sankofa_api_requests_total[5m])
      
      - record: sankofa:billing:cost:rate1h
        expr: rate(sankofa_billing_cost_usd[1h])
      
      - record: sankofa:proxmox:vm:count
        expr: count(sankofa_proxmox_vm_info) by (tenant_id)

Alerting Rules

Critical Alerts

groups:
  - name: sankofa_critical
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"
      
      - alert: DatabaseConnectionPoolExhausted
        expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"
      
      - alert: BudgetExceeded
        expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
      
      - alert: ProxmoxNodeDown
        expr: up{job="proxmox"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Proxmox node {{ $labels.instance }} is down"

Billing Anomaly Detection

  - name: sankofa_billing_anomalies
    interval: 1h
    rules:
      - alert: CostAnomalyDetected
        expr: |
          (
            sankofa_billing_cost_usd
            - predict_linear(sankofa_billing_cost_usd[7d], 3600)
          ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
        for: 2h
        labels:
          severity: warning
        annotations:
          summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"

Real-Time Cost Tracking

Metrics Exposed

sankofa_billing_cost_usd{tenant_id, service, resource_id} - Current cost
sankofa_billing_cost_rate_usd_per_hour{tenant_id} - Cost rate
sankofa_billing_budget_usd{tenant_id} - Budget limit
sankofa_billing_budget_utilization_percent{tenant_id} - Budget usage %

Grafana Query Example

# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)

# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7

# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100

Log Aggregation

Loki Configuration

Logs are collected with tenant context:

clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: ${TENANT_ID}

Log Labels

tenant_id: Tenant identifier
service: Service name (api, portal, etc.)
level: Log level (info, warn, error)
component: Component name

Log Queries

# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}

# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h

# Authentication failures
{component="auth"} | json | status="failed"

Deployment

Install Monitoring Stack

# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values grafana/values.yaml

# Apply custom dashboards
kubectl apply -f grafana/dashboards/

Import Dashboards

# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
  kubectl create configmap $(basename $dashboard .json) \
    --from-file=$dashboard \
    --namespace=monitoring \
    --dry-run=client -o yaml | kubectl apply -f -
done

Access

Grafana: https://grafana.sankofa.nexus
Prometheus: https://prometheus.sankofa.nexus
Alertmanager: https://alertmanager.sankofa.nexus

Default credentials (change immediately):

Username: admin
Password: (from secret monitoring-grafana)

Best Practices

Tenant Isolation: Always filter metrics by tenant_id
Retention: Configure appropriate retention periods
Cardinality: Avoid high-cardinality labels
Alerts: Set up alerting for critical metrics
Dashboards: Create tenant-specific dashboards
Cost Tracking: Monitor billing metrics closely
Anomaly Detection: Enable anomaly detection for billing

References

Dashboard definitions: grafana/dashboards/
Prometheus config: monitoring/prometheus/
Alert rules: monitoring/alerts/

7.9 KiB Raw Blame History

Monitoring and Observability Guide

Overview

Tenant-Aware Metrics

Metric Naming Convention

Grafana Dashboards

1. System Overview Dashboard

2. Tenant Dashboard

3. Billing Dashboard

4. Proxmox Infrastructure Dashboard

5. Security Dashboard

Prometheus Configuration

Scrape Configs

Recording Rules

Alerting Rules

Critical Alerts

Billing Anomaly Detection

Real-Time Cost Tracking

Metrics Exposed

Grafana Query Example

Log Aggregation

Loki Configuration

Log Labels

Log Queries

Deployment

Install Monitoring Stack

Import Dashboards

Access

Best Practices

References

7.9 KiB

Raw Blame History