# Rollback Plan

## Overview

This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.

## Rollback Strategy

### GitOps Rollback (Recommended)

All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.

### Manual Rollback

For emergency situations, manual rollbacks can be performed directly in Kubernetes.

## Pre-Rollback Checklist

- [ ] Identify the commit/tag to rollback to
- [ ] Verify the previous version is stable
- [ ] Notify team of rollback
- [ ] Document reason for rollback
- [ ] Check database migration compatibility (if applicable)

## Rollback Procedures

### 1. API Service Rollback

#### GitOps Method
```bash
# 1. Identify the commit to rollback to
git log --oneline api/

# 2. Revert to previous commit or tag
cd gitops/apps/api
git checkout <previous-commit-hash>
git push origin main

# 3. ArgoCD will automatically sync
# Or manually sync:
argocd app sync api
```

#### Manual Method
```bash
# 1. List deployment history
kubectl rollout history deployment/api -n api

# 2. View specific revision
kubectl rollout history deployment/api -n api --revision=<revision-number>

# 3. Rollback to previous revision
kubectl rollout undo deployment/api -n api

# 4. Or rollback to specific revision
kubectl rollout undo deployment/api -n api --to-revision=<revision-number>

# 5. Monitor rollback
kubectl rollout status deployment/api -n api
```

### 2. Portal Rollback

#### GitOps Method
```bash
cd gitops/apps/portal
git checkout <previous-commit-hash>
git push origin main
argocd app sync portal
```

#### Manual Method
```bash
kubectl rollout undo deployment/portal -n portal
kubectl rollout status deployment/portal -n portal
```

### 3. Database Migration Rollback

**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.

#### Check Migration Status
```bash
# Connect to database
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL

# Check migration history
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
```

#### Rollback Migration (if reversible)
```bash
# Run down migration
cd api
npm run db:migrate:down

# Or manually revert SQL
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -f /path/to/rollback.sql
```

#### For Non-Reversible Migrations
1. Create new migration to restore previous state
2. Test in staging first
3. Apply during maintenance window
4. Document data loss risks

### 4. Frontend (Public Site) Rollback

#### GitOps Method
```bash
cd gitops/apps/frontend
git checkout <previous-commit-hash>
git push origin main
argocd app sync frontend
```

#### Manual Method
```bash
kubectl rollout undo deployment/frontend -n frontend
kubectl rollout status deployment/frontend -n frontend
```

### 5. Monitoring Stack Rollback

```bash
# Rollback Prometheus
kubectl rollout undo deployment/prometheus-operator -n monitoring

# Rollback Grafana
kubectl rollout undo deployment/grafana -n monitoring

# Rollback Alertmanager
kubectl rollout undo deployment/alertmanager -n monitoring
```

### 6. Keycloak Rollback

```bash
# Rollback Keycloak
kubectl rollout undo deployment/keycloak -n keycloak

# Verify Keycloak health
curl https://keycloak.sankofa.nexus/health
```

## Post-Rollback Verification

### 1. Health Checks
```bash
# API
curl -f https://api.sankofa.nexus/health

# Portal
curl -f https://portal.sankofa.nexus/api/health

# Keycloak
curl -f https://keycloak.sankofa.nexus/health
```

### 2. Functional Testing
```bash
# Run smoke tests
./scripts/smoke-tests.sh

# Test authentication
curl -X POST https://api.sankofa.nexus/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
```

### 3. Monitoring
- Check Grafana dashboards for errors
- Verify Prometheus metrics are normal
- Check Loki logs for errors

### 4. Database Verification
```bash
# Verify database connectivity
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1"

# Check for data integrity issues
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
```

## Rollback Scenarios

### Scenario 1: API Breaking Change

**Symptoms**: API returns errors after deployment

**Rollback Steps**:
1. Immediately rollback API deployment
2. Verify API health
3. Check error logs
4. Investigate root cause
5. Fix and redeploy

### Scenario 2: Database Migration Failure

**Symptoms**: Database errors, application crashes

**Rollback Steps**:
1. Stop application deployments
2. Assess migration state
3. Rollback migration if possible
4. Or restore from backup
5. Redeploy previous application version

### Scenario 3: Portal Build Failure

**Symptoms**: Portal shows blank page or errors

**Rollback Steps**:
1. Rollback portal deployment
2. Verify portal loads
3. Check build logs
4. Fix build issues
5. Redeploy

### Scenario 4: Configuration Error

**Symptoms**: Services cannot connect to dependencies

**Rollback Steps**:
1. Revert configuration changes in Git
2. ArgoCD will sync automatically
3. Or manually update ConfigMaps/Secrets
4. Restart affected services

## Rollback Testing

### Staging Rollback Test
```bash
# 1. Deploy new version to staging
argocd app sync api-staging

# 2. Test new version
./scripts/smoke-tests.sh --env=staging

# 3. Simulate rollback
kubectl rollout undo deployment/api -n api-staging

# 4. Verify rollback works
./scripts/smoke-tests.sh --env=staging
```

## Rollback Communication

### Internal Communication
- Notify team in #engineering channel
- Update incident tracking system
- Document in runbook

### External Communication
- Update status page if user-facing
- Notify affected customers if needed
- Post-mortem for P0/P1 incidents

## Prevention

### Pre-Deployment
- [ ] All tests passing
- [ ] Code review completed
- [ ] Staging deployment successful
- [ ] Smoke tests passing
- [ ] Database migrations tested
- [ ] Rollback plan reviewed

### Deployment
- [ ] Deploy to staging first
- [ ] Monitor staging for 24 hours
- [ ] Gradual production rollout (canary)
- [ ] Monitor metrics closely
- [ ] Have rollback plan ready

## Rollback Decision Matrix

| Issue | Severity | Rollback? |
|-------|----------|-----------|
| Complete outage | P0 | Yes, immediately |
| Data corruption | P0 | Yes, immediately |
| Security breach | P0 | Yes, immediately |
| >50% error rate | P1 | Yes, within 15 min |
| Performance >50% degraded | P1 | Yes, within 30 min |
| Single feature broken | P2 | Maybe, assess impact |
| Minor bugs | P3 | No, fix forward |

## Emergency Contacts

- **On-call Engineer**: [Contact]
- **Team Lead**: [Contact]
- **DevOps Lead**: [Contact]