Files

defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements

- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements

2025-12-12 18:01:35 -08:00

7.4 KiB

Raw Permalink Blame History

Incident Response Runbook

Overview

This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.

Incident Severity Levels

P0 - Critical (Immediate Response)

Complete service outage
Data loss or corruption
Security breach
Response Time: Immediate (< 5 minutes)
Resolution Target: < 1 hour

P1 - High (Urgent Response)

Partial service outage affecting multiple users
Performance degradation > 50%
Authentication failures
Response Time: < 15 minutes
Resolution Target: < 4 hours

P2 - Medium (Standard Response)

Single feature/service degraded
Performance degradation 20-50%
Non-critical errors
Response Time: < 1 hour
Resolution Target: < 24 hours

P3 - Low (Normal Response)

Minor issues
Cosmetic problems
Non-blocking errors
Response Time: < 4 hours
Resolution Target: < 1 week

Incident Response Process

1. Detection and Triage

Detection Sources

Monitoring Alerts: Prometheus/Alertmanager
Error Logs: Loki, application logs
User Reports: Support tickets, status page
Health Checks: Automated health check failures

Initial Triage Steps

# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running

# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"

# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"

# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"

# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"

2. Incident Declaration

Create Incident Channel

Create dedicated Slack/Teams channel: #incident-YYYY-MM-DD-<name>
Invite: On-call engineer, Team lead, Product owner
Post initial status

Incident Template

INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]

3. Investigation

Common Investigation Commands

Check Pod Status

kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

Check Resource Usage

kubectl top nodes
kubectl top pods --all-namespaces

Check Database

# Connection count
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"

Check Logs

# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error

# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"

# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"

Check Monitoring

# Access Grafana
open https://grafana.sankofa.nexus

# Check Prometheus alerts
kubectl get prometheusrules -n monitoring

4. Resolution

Common Resolution Actions

Restart Service

kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal

Scale Up

kubectl scale deployment/api --replicas=5 -n api

Rollback Deployment

# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api

Clear Rate Limits (if needed)

# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api

Database Maintenance

# Vacuum database
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "VACUUM ANALYZE;"

# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"

5. Post-Incident

Incident Report Template

# Incident Report: [Date] - [Title]

## Summary
[Brief description of incident]

## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored

## Root Cause
[Detailed root cause analysis]

## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]

## Resolution
[Steps taken to resolve]

## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates

Common Incidents

API High Latency

Symptoms: API response times > 500ms

Investigation:

# Check database query performance
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"

# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration

Resolution:

Scale API replicas
Optimize slow queries
Add database indexes
Check for N+1 query problems

Database Connection Pool Exhausted

Symptoms: "too many connections" errors

Investigation:

kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

Resolution:

Increase connection pool size
Kill idle connections
Scale database
Check for connection leaks

Authentication Failures

Symptoms: Users cannot log in

Investigation:

# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100

# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"

Resolution:

Restart Keycloak if needed
Check OIDC configuration
Verify JWT secret
Check network connectivity

Portal Not Loading

Symptoms: Portal returns 500 or blank page

Investigation:

# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100

# Check portal health
curl https://portal.sankofa.nexus/api/health

Resolution:

Restart portal deployment
Check environment variables
Verify Keycloak connectivity
Check build errors

Escalation

When to Escalate

P0 incident not resolved in 30 minutes
P1 incident not resolved in 2 hours
Need additional expertise
Customer impact is severe

Escalation Path

On-call Engineer → Team Lead
Team Lead → Engineering Manager
Engineering Manager → CTO/VP Engineering
CTO → Executive Team

Emergency Contacts

On-call: [Phone/Slack]
Team Lead: [Phone/Slack]
Engineering Manager: [Phone/Slack]
CTO: [Phone/Slack]

Communication

Status Page Updates

Update status page during incident
Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
Include: Status, affected services, estimated resolution time

Customer Communication

For P0/P1: Notify affected customers immediately
For P2/P3: Include in next status update
Be transparent about impact and resolution timeline

7.4 KiB Raw Permalink Blame History

Incident Response Runbook

Overview

Incident Severity Levels

P0 - Critical (Immediate Response)

P1 - High (Urgent Response)

P2 - Medium (Standard Response)

P3 - Low (Normal Response)

Incident Response Process

1. Detection and Triage

Detection Sources

Initial Triage Steps

2. Incident Declaration

Create Incident Channel

Incident Template

3. Investigation

Common Investigation Commands

4. Resolution

Common Resolution Actions

5. Post-Incident

Incident Report Template

Common Incidents

API High Latency

Database Connection Pool Exhausted

Authentication Failures

Portal Not Loading

Escalation

When to Escalate

Escalation Path

Emergency Contacts

Communication

Status Page Updates

Customer Communication

7.4 KiB

Raw Permalink Blame History