Files
Sankofa/docs/runbooks/INCIDENT_RESPONSE.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

7.4 KiB

Incident Response Runbook

Overview

This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.

Incident Severity Levels

P0 - Critical (Immediate Response)

  • Complete service outage
  • Data loss or corruption
  • Security breach
  • Response Time: Immediate (< 5 minutes)
  • Resolution Target: < 1 hour

P1 - High (Urgent Response)

  • Partial service outage affecting multiple users
  • Performance degradation > 50%
  • Authentication failures
  • Response Time: < 15 minutes
  • Resolution Target: < 4 hours

P2 - Medium (Standard Response)

  • Single feature/service degraded
  • Performance degradation 20-50%
  • Non-critical errors
  • Response Time: < 1 hour
  • Resolution Target: < 24 hours

P3 - Low (Normal Response)

  • Minor issues
  • Cosmetic problems
  • Non-blocking errors
  • Response Time: < 4 hours
  • Resolution Target: < 1 week

Incident Response Process

1. Detection and Triage

Detection Sources

  • Monitoring Alerts: Prometheus/Alertmanager
  • Error Logs: Loki, application logs
  • User Reports: Support tickets, status page
  • Health Checks: Automated health check failures

Initial Triage Steps

# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running

# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"

# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"

# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"

# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"

2. Incident Declaration

Create Incident Channel

  • Create dedicated Slack/Teams channel: #incident-YYYY-MM-DD-<name>
  • Invite: On-call engineer, Team lead, Product owner
  • Post initial status

Incident Template

INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]

3. Investigation

Common Investigation Commands

Check Pod Status

kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

Check Resource Usage

kubectl top nodes
kubectl top pods --all-namespaces

Check Database

# Connection count
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"

Check Logs

# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error

# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"

# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"

Check Monitoring

# Access Grafana
open https://grafana.sankofa.nexus

# Check Prometheus alerts
kubectl get prometheusrules -n monitoring

4. Resolution

Common Resolution Actions

Restart Service

kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal

Scale Up

kubectl scale deployment/api --replicas=5 -n api

Rollback Deployment

# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api

Clear Rate Limits (if needed)

# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api

Database Maintenance

# Vacuum database
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "VACUUM ANALYZE;"

# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"

5. Post-Incident

Incident Report Template

# Incident Report: [Date] - [Title]

## Summary
[Brief description of incident]

## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored

## Root Cause
[Detailed root cause analysis]

## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]

## Resolution
[Steps taken to resolve]

## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates

Common Incidents

API High Latency

Symptoms: API response times > 500ms

Investigation:

# Check database query performance
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"

# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration

Resolution:

  • Scale API replicas
  • Optimize slow queries
  • Add database indexes
  • Check for N+1 query problems

Database Connection Pool Exhausted

Symptoms: "too many connections" errors

Investigation:

kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

Resolution:

  • Increase connection pool size
  • Kill idle connections
  • Scale database
  • Check for connection leaks

Authentication Failures

Symptoms: Users cannot log in

Investigation:

# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100

# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"

Resolution:

  • Restart Keycloak if needed
  • Check OIDC configuration
  • Verify JWT secret
  • Check network connectivity

Portal Not Loading

Symptoms: Portal returns 500 or blank page

Investigation:

# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100

# Check portal health
curl https://portal.sankofa.nexus/api/health

Resolution:

  • Restart portal deployment
  • Check environment variables
  • Verify Keycloak connectivity
  • Check build errors

Escalation

When to Escalate

  • P0 incident not resolved in 30 minutes
  • P1 incident not resolved in 2 hours
  • Need additional expertise
  • Customer impact is severe

Escalation Path

  1. On-call Engineer → Team Lead
  2. Team Lead → Engineering Manager
  3. Engineering Manager → CTO/VP Engineering
  4. CTO → Executive Team

Emergency Contacts

  • On-call: [Phone/Slack]
  • Team Lead: [Phone/Slack]
  • Engineering Manager: [Phone/Slack]
  • CTO: [Phone/Slack]

Communication

Status Page Updates

  • Update status page during incident
  • Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
  • Include: Status, affected services, estimated resolution time

Customer Communication

  • For P0/P1: Notify affected customers immediately
  • For P2/P3: Include in next status update
  • Be transparent about impact and resolution timeline