- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
298 lines
6.6 KiB
Markdown
298 lines
6.6 KiB
Markdown
# Rollback Plan
|
|
|
|
## Overview
|
|
|
|
This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.
|
|
|
|
## Rollback Strategy
|
|
|
|
### GitOps Rollback (Recommended)
|
|
|
|
All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.
|
|
|
|
### Manual Rollback
|
|
|
|
For emergency situations, manual rollbacks can be performed directly in Kubernetes.
|
|
|
|
## Pre-Rollback Checklist
|
|
|
|
- [ ] Identify the commit/tag to rollback to
|
|
- [ ] Verify the previous version is stable
|
|
- [ ] Notify team of rollback
|
|
- [ ] Document reason for rollback
|
|
- [ ] Check database migration compatibility (if applicable)
|
|
|
|
## Rollback Procedures
|
|
|
|
### 1. API Service Rollback
|
|
|
|
#### GitOps Method
|
|
```bash
|
|
# 1. Identify the commit to rollback to
|
|
git log --oneline api/
|
|
|
|
# 2. Revert to previous commit or tag
|
|
cd gitops/apps/api
|
|
git checkout <previous-commit-hash>
|
|
git push origin main
|
|
|
|
# 3. ArgoCD will automatically sync
|
|
# Or manually sync:
|
|
argocd app sync api
|
|
```
|
|
|
|
#### Manual Method
|
|
```bash
|
|
# 1. List deployment history
|
|
kubectl rollout history deployment/api -n api
|
|
|
|
# 2. View specific revision
|
|
kubectl rollout history deployment/api -n api --revision=<revision-number>
|
|
|
|
# 3. Rollback to previous revision
|
|
kubectl rollout undo deployment/api -n api
|
|
|
|
# 4. Or rollback to specific revision
|
|
kubectl rollout undo deployment/api -n api --to-revision=<revision-number>
|
|
|
|
# 5. Monitor rollback
|
|
kubectl rollout status deployment/api -n api
|
|
```
|
|
|
|
### 2. Portal Rollback
|
|
|
|
#### GitOps Method
|
|
```bash
|
|
cd gitops/apps/portal
|
|
git checkout <previous-commit-hash>
|
|
git push origin main
|
|
argocd app sync portal
|
|
```
|
|
|
|
#### Manual Method
|
|
```bash
|
|
kubectl rollout undo deployment/portal -n portal
|
|
kubectl rollout status deployment/portal -n portal
|
|
```
|
|
|
|
### 3. Database Migration Rollback
|
|
|
|
**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.
|
|
|
|
#### Check Migration Status
|
|
```bash
|
|
# Connect to database
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL
|
|
|
|
# Check migration history
|
|
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
|
|
```
|
|
|
|
#### Rollback Migration (if reversible)
|
|
```bash
|
|
# Run down migration
|
|
cd api
|
|
npm run db:migrate:down
|
|
|
|
# Or manually revert SQL
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -f /path/to/rollback.sql
|
|
```
|
|
|
|
#### For Non-Reversible Migrations
|
|
1. Create new migration to restore previous state
|
|
2. Test in staging first
|
|
3. Apply during maintenance window
|
|
4. Document data loss risks
|
|
|
|
### 4. Frontend (Public Site) Rollback
|
|
|
|
#### GitOps Method
|
|
```bash
|
|
cd gitops/apps/frontend
|
|
git checkout <previous-commit-hash>
|
|
git push origin main
|
|
argocd app sync frontend
|
|
```
|
|
|
|
#### Manual Method
|
|
```bash
|
|
kubectl rollout undo deployment/frontend -n frontend
|
|
kubectl rollout status deployment/frontend -n frontend
|
|
```
|
|
|
|
### 5. Monitoring Stack Rollback
|
|
|
|
```bash
|
|
# Rollback Prometheus
|
|
kubectl rollout undo deployment/prometheus-operator -n monitoring
|
|
|
|
# Rollback Grafana
|
|
kubectl rollout undo deployment/grafana -n monitoring
|
|
|
|
# Rollback Alertmanager
|
|
kubectl rollout undo deployment/alertmanager -n monitoring
|
|
```
|
|
|
|
### 6. Keycloak Rollback
|
|
|
|
```bash
|
|
# Rollback Keycloak
|
|
kubectl rollout undo deployment/keycloak -n keycloak
|
|
|
|
# Verify Keycloak health
|
|
curl https://keycloak.sankofa.nexus/health
|
|
```
|
|
|
|
## Post-Rollback Verification
|
|
|
|
### 1. Health Checks
|
|
```bash
|
|
# API
|
|
curl -f https://api.sankofa.nexus/health
|
|
|
|
# Portal
|
|
curl -f https://portal.sankofa.nexus/api/health
|
|
|
|
# Keycloak
|
|
curl -f https://keycloak.sankofa.nexus/health
|
|
```
|
|
|
|
### 2. Functional Testing
|
|
```bash
|
|
# Run smoke tests
|
|
./scripts/smoke-tests.sh
|
|
|
|
# Test authentication
|
|
curl -X POST https://api.sankofa.nexus/graphql \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
|
|
```
|
|
|
|
### 3. Monitoring
|
|
- Check Grafana dashboards for errors
|
|
- Verify Prometheus metrics are normal
|
|
- Check Loki logs for errors
|
|
|
|
### 4. Database Verification
|
|
```bash
|
|
# Verify database connectivity
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT 1"
|
|
|
|
# Check for data integrity issues
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
|
|
```
|
|
|
|
## Rollback Scenarios
|
|
|
|
### Scenario 1: API Breaking Change
|
|
|
|
**Symptoms**: API returns errors after deployment
|
|
|
|
**Rollback Steps**:
|
|
1. Immediately rollback API deployment
|
|
2. Verify API health
|
|
3. Check error logs
|
|
4. Investigate root cause
|
|
5. Fix and redeploy
|
|
|
|
### Scenario 2: Database Migration Failure
|
|
|
|
**Symptoms**: Database errors, application crashes
|
|
|
|
**Rollback Steps**:
|
|
1. Stop application deployments
|
|
2. Assess migration state
|
|
3. Rollback migration if possible
|
|
4. Or restore from backup
|
|
5. Redeploy previous application version
|
|
|
|
### Scenario 3: Portal Build Failure
|
|
|
|
**Symptoms**: Portal shows blank page or errors
|
|
|
|
**Rollback Steps**:
|
|
1. Rollback portal deployment
|
|
2. Verify portal loads
|
|
3. Check build logs
|
|
4. Fix build issues
|
|
5. Redeploy
|
|
|
|
### Scenario 4: Configuration Error
|
|
|
|
**Symptoms**: Services cannot connect to dependencies
|
|
|
|
**Rollback Steps**:
|
|
1. Revert configuration changes in Git
|
|
2. ArgoCD will sync automatically
|
|
3. Or manually update ConfigMaps/Secrets
|
|
4. Restart affected services
|
|
|
|
## Rollback Testing
|
|
|
|
### Staging Rollback Test
|
|
```bash
|
|
# 1. Deploy new version to staging
|
|
argocd app sync api-staging
|
|
|
|
# 2. Test new version
|
|
./scripts/smoke-tests.sh --env=staging
|
|
|
|
# 3. Simulate rollback
|
|
kubectl rollout undo deployment/api -n api-staging
|
|
|
|
# 4. Verify rollback works
|
|
./scripts/smoke-tests.sh --env=staging
|
|
```
|
|
|
|
## Rollback Communication
|
|
|
|
### Internal Communication
|
|
- Notify team in #engineering channel
|
|
- Update incident tracking system
|
|
- Document in runbook
|
|
|
|
### External Communication
|
|
- Update status page if user-facing
|
|
- Notify affected customers if needed
|
|
- Post-mortem for P0/P1 incidents
|
|
|
|
## Prevention
|
|
|
|
### Pre-Deployment
|
|
- [ ] All tests passing
|
|
- [ ] Code review completed
|
|
- [ ] Staging deployment successful
|
|
- [ ] Smoke tests passing
|
|
- [ ] Database migrations tested
|
|
- [ ] Rollback plan reviewed
|
|
|
|
### Deployment
|
|
- [ ] Deploy to staging first
|
|
- [ ] Monitor staging for 24 hours
|
|
- [ ] Gradual production rollout (canary)
|
|
- [ ] Monitor metrics closely
|
|
- [ ] Have rollback plan ready
|
|
|
|
## Rollback Decision Matrix
|
|
|
|
| Issue | Severity | Rollback? |
|
|
|-------|----------|-----------|
|
|
| Complete outage | P0 | Yes, immediately |
|
|
| Data corruption | P0 | Yes, immediately |
|
|
| Security breach | P0 | Yes, immediately |
|
|
| >50% error rate | P1 | Yes, within 15 min |
|
|
| Performance >50% degraded | P1 | Yes, within 30 min |
|
|
| Single feature broken | P2 | Maybe, assess impact |
|
|
| Minor bugs | P3 | No, fix forward |
|
|
|
|
## Emergency Contacts
|
|
|
|
- **On-call Engineer**: [Contact]
|
|
- **Team Lead**: [Contact]
|
|
- **DevOps Lead**: [Contact]
|
|
|