Files
Sankofa/docs/runbooks/ROLLBACK_PLAN.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

298 lines
6.6 KiB
Markdown

# Rollback Plan
## Overview
This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.
## Rollback Strategy
### GitOps Rollback (Recommended)
All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.
### Manual Rollback
For emergency situations, manual rollbacks can be performed directly in Kubernetes.
## Pre-Rollback Checklist
- [ ] Identify the commit/tag to rollback to
- [ ] Verify the previous version is stable
- [ ] Notify team of rollback
- [ ] Document reason for rollback
- [ ] Check database migration compatibility (if applicable)
## Rollback Procedures
### 1. API Service Rollback
#### GitOps Method
```bash
# 1. Identify the commit to rollback to
git log --oneline api/
# 2. Revert to previous commit or tag
cd gitops/apps/api
git checkout <previous-commit-hash>
git push origin main
# 3. ArgoCD will automatically sync
# Or manually sync:
argocd app sync api
```
#### Manual Method
```bash
# 1. List deployment history
kubectl rollout history deployment/api -n api
# 2. View specific revision
kubectl rollout history deployment/api -n api --revision=<revision-number>
# 3. Rollback to previous revision
kubectl rollout undo deployment/api -n api
# 4. Or rollback to specific revision
kubectl rollout undo deployment/api -n api --to-revision=<revision-number>
# 5. Monitor rollback
kubectl rollout status deployment/api -n api
```
### 2. Portal Rollback
#### GitOps Method
```bash
cd gitops/apps/portal
git checkout <previous-commit-hash>
git push origin main
argocd app sync portal
```
#### Manual Method
```bash
kubectl rollout undo deployment/portal -n portal
kubectl rollout status deployment/portal -n portal
```
### 3. Database Migration Rollback
**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.
#### Check Migration Status
```bash
# Connect to database
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL
# Check migration history
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
```
#### Rollback Migration (if reversible)
```bash
# Run down migration
cd api
npm run db:migrate:down
# Or manually revert SQL
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -f /path/to/rollback.sql
```
#### For Non-Reversible Migrations
1. Create new migration to restore previous state
2. Test in staging first
3. Apply during maintenance window
4. Document data loss risks
### 4. Frontend (Public Site) Rollback
#### GitOps Method
```bash
cd gitops/apps/frontend
git checkout <previous-commit-hash>
git push origin main
argocd app sync frontend
```
#### Manual Method
```bash
kubectl rollout undo deployment/frontend -n frontend
kubectl rollout status deployment/frontend -n frontend
```
### 5. Monitoring Stack Rollback
```bash
# Rollback Prometheus
kubectl rollout undo deployment/prometheus-operator -n monitoring
# Rollback Grafana
kubectl rollout undo deployment/grafana -n monitoring
# Rollback Alertmanager
kubectl rollout undo deployment/alertmanager -n monitoring
```
### 6. Keycloak Rollback
```bash
# Rollback Keycloak
kubectl rollout undo deployment/keycloak -n keycloak
# Verify Keycloak health
curl https://keycloak.sankofa.nexus/health
```
## Post-Rollback Verification
### 1. Health Checks
```bash
# API
curl -f https://api.sankofa.nexus/health
# Portal
curl -f https://portal.sankofa.nexus/api/health
# Keycloak
curl -f https://keycloak.sankofa.nexus/health
```
### 2. Functional Testing
```bash
# Run smoke tests
./scripts/smoke-tests.sh
# Test authentication
curl -X POST https://api.sankofa.nexus/graphql \
-H "Content-Type: application/json" \
-d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
```
### 3. Monitoring
- Check Grafana dashboards for errors
- Verify Prometheus metrics are normal
- Check Loki logs for errors
### 4. Database Verification
```bash
# Verify database connectivity
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1"
# Check for data integrity issues
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
```
## Rollback Scenarios
### Scenario 1: API Breaking Change
**Symptoms**: API returns errors after deployment
**Rollback Steps**:
1. Immediately rollback API deployment
2. Verify API health
3. Check error logs
4. Investigate root cause
5. Fix and redeploy
### Scenario 2: Database Migration Failure
**Symptoms**: Database errors, application crashes
**Rollback Steps**:
1. Stop application deployments
2. Assess migration state
3. Rollback migration if possible
4. Or restore from backup
5. Redeploy previous application version
### Scenario 3: Portal Build Failure
**Symptoms**: Portal shows blank page or errors
**Rollback Steps**:
1. Rollback portal deployment
2. Verify portal loads
3. Check build logs
4. Fix build issues
5. Redeploy
### Scenario 4: Configuration Error
**Symptoms**: Services cannot connect to dependencies
**Rollback Steps**:
1. Revert configuration changes in Git
2. ArgoCD will sync automatically
3. Or manually update ConfigMaps/Secrets
4. Restart affected services
## Rollback Testing
### Staging Rollback Test
```bash
# 1. Deploy new version to staging
argocd app sync api-staging
# 2. Test new version
./scripts/smoke-tests.sh --env=staging
# 3. Simulate rollback
kubectl rollout undo deployment/api -n api-staging
# 4. Verify rollback works
./scripts/smoke-tests.sh --env=staging
```
## Rollback Communication
### Internal Communication
- Notify team in #engineering channel
- Update incident tracking system
- Document in runbook
### External Communication
- Update status page if user-facing
- Notify affected customers if needed
- Post-mortem for P0/P1 incidents
## Prevention
### Pre-Deployment
- [ ] All tests passing
- [ ] Code review completed
- [ ] Staging deployment successful
- [ ] Smoke tests passing
- [ ] Database migrations tested
- [ ] Rollback plan reviewed
### Deployment
- [ ] Deploy to staging first
- [ ] Monitor staging for 24 hours
- [ ] Gradual production rollout (canary)
- [ ] Monitor metrics closely
- [ ] Have rollback plan ready
## Rollback Decision Matrix
| Issue | Severity | Rollback? |
|-------|----------|-----------|
| Complete outage | P0 | Yes, immediately |
| Data corruption | P0 | Yes, immediately |
| Security breach | P0 | Yes, immediately |
| >50% error rate | P1 | Yes, within 15 min |
| Performance >50% degraded | P1 | Yes, within 30 min |
| Single feature broken | P2 | Maybe, assess impact |
| Minor bugs | P3 | No, fix forward |
## Emergency Contacts
- **On-call Engineer**: [Contact]
- **Team Lead**: [Contact]
- **DevOps Lead**: [Contact]