- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
7.4 KiB
7.4 KiB
Incident Response Runbook
Overview
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
Incident Severity Levels
P0 - Critical (Immediate Response)
- Complete service outage
- Data loss or corruption
- Security breach
- Response Time: Immediate (< 5 minutes)
- Resolution Target: < 1 hour
P1 - High (Urgent Response)
- Partial service outage affecting multiple users
- Performance degradation > 50%
- Authentication failures
- Response Time: < 15 minutes
- Resolution Target: < 4 hours
P2 - Medium (Standard Response)
- Single feature/service degraded
- Performance degradation 20-50%
- Non-critical errors
- Response Time: < 1 hour
- Resolution Target: < 24 hours
P3 - Low (Normal Response)
- Minor issues
- Cosmetic problems
- Non-blocking errors
- Response Time: < 4 hours
- Resolution Target: < 1 week
Incident Response Process
1. Detection and Triage
Detection Sources
- Monitoring Alerts: Prometheus/Alertmanager
- Error Logs: Loki, application logs
- User Reports: Support tickets, status page
- Health Checks: Automated health check failures
Initial Triage Steps
# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running
# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
2. Incident Declaration
Create Incident Channel
- Create dedicated Slack/Teams channel:
#incident-YYYY-MM-DD-<name> - Invite: On-call engineer, Team lead, Product owner
- Post initial status
Incident Template
INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]
3. Investigation
Common Investigation Commands
Check Pod Status
kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
Check Resource Usage
kubectl top nodes
kubectl top pods --all-namespaces
Check Database
# Connection count
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
# Long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
Check Logs
# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error
# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"
# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"
Check Monitoring
# Access Grafana
open https://grafana.sankofa.nexus
# Check Prometheus alerts
kubectl get prometheusrules -n monitoring
4. Resolution
Common Resolution Actions
Restart Service
kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal
Scale Up
kubectl scale deployment/api --replicas=5 -n api
Rollback Deployment
# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api
Clear Rate Limits (if needed)
# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api
Database Maintenance
# Vacuum database
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "VACUUM ANALYZE;"
# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
5. Post-Incident
Incident Report Template
# Incident Report: [Date] - [Title]
## Summary
[Brief description of incident]
## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored
## Root Cause
[Detailed root cause analysis]
## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]
## Resolution
[Steps taken to resolve]
## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3
## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates
Common Incidents
API High Latency
Symptoms: API response times > 500ms
Investigation:
# Check database query performance
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration
Resolution:
- Scale API replicas
- Optimize slow queries
- Add database indexes
- Check for N+1 query problems
Database Connection Pool Exhausted
Symptoms: "too many connections" errors
Investigation:
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
Resolution:
- Increase connection pool size
- Kill idle connections
- Scale database
- Check for connection leaks
Authentication Failures
Symptoms: Users cannot log in
Investigation:
# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100
# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"
Resolution:
- Restart Keycloak if needed
- Check OIDC configuration
- Verify JWT secret
- Check network connectivity
Portal Not Loading
Symptoms: Portal returns 500 or blank page
Investigation:
# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100
# Check portal health
curl https://portal.sankofa.nexus/api/health
Resolution:
- Restart portal deployment
- Check environment variables
- Verify Keycloak connectivity
- Check build errors
Escalation
When to Escalate
- P0 incident not resolved in 30 minutes
- P1 incident not resolved in 2 hours
- Need additional expertise
- Customer impact is severe
Escalation Path
- On-call Engineer → Team Lead
- Team Lead → Engineering Manager
- Engineering Manager → CTO/VP Engineering
- CTO → Executive Team
Emergency Contacts
- On-call: [Phone/Slack]
- Team Lead: [Phone/Slack]
- Engineering Manager: [Phone/Slack]
- CTO: [Phone/Slack]
Communication
Status Page Updates
- Update status page during incident
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
- Include: Status, affected services, estimated resolution time
Customer Communication
- For P0/P1: Notify affected customers immediately
- For P2/P3: Include in next status update
- Be transparent about impact and resolution timeline