# Rollback Plan ## Overview This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform. ## Rollback Strategy ### GitOps Rollback (Recommended) All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit. ### Manual Rollback For emergency situations, manual rollbacks can be performed directly in Kubernetes. ## Pre-Rollback Checklist - [ ] Identify the commit/tag to rollback to - [ ] Verify the previous version is stable - [ ] Notify team of rollback - [ ] Document reason for rollback - [ ] Check database migration compatibility (if applicable) ## Rollback Procedures ### 1. API Service Rollback #### GitOps Method ```bash # 1. Identify the commit to rollback to git log --oneline api/ # 2. Revert to previous commit or tag cd gitops/apps/api git checkout git push origin main # 3. ArgoCD will automatically sync # Or manually sync: argocd app sync api ``` #### Manual Method ```bash # 1. List deployment history kubectl rollout history deployment/api -n api # 2. View specific revision kubectl rollout history deployment/api -n api --revision= # 3. Rollback to previous revision kubectl rollout undo deployment/api -n api # 4. Or rollback to specific revision kubectl rollout undo deployment/api -n api --to-revision= # 5. Monitor rollback kubectl rollout status deployment/api -n api ``` ### 2. Portal Rollback #### GitOps Method ```bash cd gitops/apps/portal git checkout git push origin main argocd app sync portal ``` #### Manual Method ```bash kubectl rollout undo deployment/portal -n portal kubectl rollout status deployment/portal -n portal ``` ### 3. Database Migration Rollback **⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible. #### Check Migration Status ```bash # Connect to database kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL # Check migration history SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10; ``` #### Rollback Migration (if reversible) ```bash # Run down migration cd api npm run db:migrate:down # Or manually revert SQL kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -f /path/to/rollback.sql ``` #### For Non-Reversible Migrations 1. Create new migration to restore previous state 2. Test in staging first 3. Apply during maintenance window 4. Document data loss risks ### 4. Frontend (Public Site) Rollback #### GitOps Method ```bash cd gitops/apps/frontend git checkout git push origin main argocd app sync frontend ``` #### Manual Method ```bash kubectl rollout undo deployment/frontend -n frontend kubectl rollout status deployment/frontend -n frontend ``` ### 5. Monitoring Stack Rollback ```bash # Rollback Prometheus kubectl rollout undo deployment/prometheus-operator -n monitoring # Rollback Grafana kubectl rollout undo deployment/grafana -n monitoring # Rollback Alertmanager kubectl rollout undo deployment/alertmanager -n monitoring ``` ### 6. Keycloak Rollback ```bash # Rollback Keycloak kubectl rollout undo deployment/keycloak -n keycloak # Verify Keycloak health curl https://keycloak.sankofa.nexus/health ``` ## Post-Rollback Verification ### 1. Health Checks ```bash # API curl -f https://api.sankofa.nexus/health # Portal curl -f https://portal.sankofa.nexus/api/health # Keycloak curl -f https://keycloak.sankofa.nexus/health ``` ### 2. Functional Testing ```bash # Run smoke tests ./scripts/smoke-tests.sh # Test authentication curl -X POST https://api.sankofa.nexus/graphql \ -H "Content-Type: application/json" \ -d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}' ``` ### 3. Monitoring - Check Grafana dashboards for errors - Verify Prometheus metrics are normal - Check Loki logs for errors ### 4. Database Verification ```bash # Verify database connectivity kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT 1" # Check for data integrity issues kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;" ``` ## Rollback Scenarios ### Scenario 1: API Breaking Change **Symptoms**: API returns errors after deployment **Rollback Steps**: 1. Immediately rollback API deployment 2. Verify API health 3. Check error logs 4. Investigate root cause 5. Fix and redeploy ### Scenario 2: Database Migration Failure **Symptoms**: Database errors, application crashes **Rollback Steps**: 1. Stop application deployments 2. Assess migration state 3. Rollback migration if possible 4. Or restore from backup 5. Redeploy previous application version ### Scenario 3: Portal Build Failure **Symptoms**: Portal shows blank page or errors **Rollback Steps**: 1. Rollback portal deployment 2. Verify portal loads 3. Check build logs 4. Fix build issues 5. Redeploy ### Scenario 4: Configuration Error **Symptoms**: Services cannot connect to dependencies **Rollback Steps**: 1. Revert configuration changes in Git 2. ArgoCD will sync automatically 3. Or manually update ConfigMaps/Secrets 4. Restart affected services ## Rollback Testing ### Staging Rollback Test ```bash # 1. Deploy new version to staging argocd app sync api-staging # 2. Test new version ./scripts/smoke-tests.sh --env=staging # 3. Simulate rollback kubectl rollout undo deployment/api -n api-staging # 4. Verify rollback works ./scripts/smoke-tests.sh --env=staging ``` ## Rollback Communication ### Internal Communication - Notify team in #engineering channel - Update incident tracking system - Document in runbook ### External Communication - Update status page if user-facing - Notify affected customers if needed - Post-mortem for P0/P1 incidents ## Prevention ### Pre-Deployment - [ ] All tests passing - [ ] Code review completed - [ ] Staging deployment successful - [ ] Smoke tests passing - [ ] Database migrations tested - [ ] Rollback plan reviewed ### Deployment - [ ] Deploy to staging first - [ ] Monitor staging for 24 hours - [ ] Gradual production rollout (canary) - [ ] Monitor metrics closely - [ ] Have rollback plan ready ## Rollback Decision Matrix | Issue | Severity | Rollback? | |-------|----------|-----------| | Complete outage | P0 | Yes, immediately | | Data corruption | P0 | Yes, immediately | | Security breach | P0 | Yes, immediately | | >50% error rate | P1 | Yes, within 15 min | | Performance >50% degraded | P1 | Yes, within 30 min | | Single feature broken | P2 | Maybe, assess impact | | Minor bugs | P3 | No, fix forward | ## Emergency Contacts - **On-call Engineer**: [Contact] - **Team Lead**: [Contact] - **DevOps Lead**: [Contact]