Files
Sankofa/docs/DEPLOYMENT_EXECUTION_PLAN.md

540 lines
13 KiB
Markdown
Raw Normal View History

# Sankofa Phoenix - Deployment Execution Plan
**Date**: 2025-01-XX
**Status**: Ready for Execution
---
## Executive Summary
This document provides a step-by-step execution plan for deploying Sankofa and Sankofa Phoenix. All prerequisites are complete, VM YAML files are ready, and infrastructure is operational.
---
## Pre-Execution Checklist
### ✅ Completed
- [x] Proxmox infrastructure operational (2 sites)
- [x] All 21 VM YAML files updated with enhanced template
- [x] Guest agent configuration complete
- [x] OS images available (ubuntu-22.04-cloud.img)
- [x] Network configuration verified
- [x] Documentation comprehensive
- [x] Scripts ready for deployment
### ⚠️ Requires Verification
- [ ] Resource quota check (run `./scripts/check-proxmox-quota.sh`)
- [ ] Kubernetes cluster status
- [ ] Database connectivity
- [ ] Keycloak deployment status
---
## Execution Phases
### Phase 1: Resource Verification (15 minutes)
**Objective**: Verify Proxmox resources are sufficient for deployment
**Steps**:
```bash
cd /home/intlc/projects/Sankofa
# 1. Run resource quota check
./scripts/check-proxmox-quota.sh
# 2. Review output
# Expected: Available resources >= 72 CPU, 140 GiB RAM, 278 GiB disk
# 3. If insufficient, document and plan expansion
```
**Success Criteria**:
- ✅ Resources sufficient for all 18 VMs
- ✅ Storage pools have adequate space
- ✅ Network connectivity verified
**Rollback**: None required - verification only
---
### Phase 2: Kubernetes Control Plane (30-60 minutes)
**Objective**: Deploy and verify Kubernetes control plane components
**Steps**:
```bash
# 1. Verify Kubernetes cluster
kubectl cluster-info
kubectl get nodes
# 2. Create namespaces
kubectl create namespace sankofa --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace crossplane-system --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
# 3. Deploy Crossplane
kubectl apply -f gitops/apps/crossplane/
kubectl wait --for=condition=Ready pod -l app=crossplane -n crossplane-system --timeout=300s
# 4. Deploy Proxmox Provider
kubectl apply -f crossplane-provider-proxmox/config/
kubectl wait --for=condition=Installed provider -l pkg.crossplane.io/name=provider-proxmox --timeout=300s
# 5. Create ProviderConfig
kubectl apply -f crossplane-provider-proxmox/config/provider.yaml
# 6. Verify
kubectl get pods -n crossplane-system
kubectl get providerconfig -A
```
**Success Criteria**:
- ✅ Crossplane pods running
- ✅ Proxmox provider installed
- ✅ ProviderConfig ready
**Rollback**:
```bash
kubectl delete -f crossplane-provider-proxmox/config/
kubectl delete -f gitops/apps/crossplane/
```
---
### Phase 3: Database and Identity (30-45 minutes)
**Objective**: Deploy PostgreSQL and Keycloak
**Steps**:
```bash
# 1. Deploy PostgreSQL (if not external)
kubectl apply -f gitops/apps/postgresql/ # If exists
# 2. Run database migrations
cd api
npm install
npm run db:migrate
# 3. Verify migrations
psql -h <db-host> -U postgres -d sankofa -c "\dt" | grep -E "tenants|billing"
# 4. Deploy Keycloak
kubectl apply -f gitops/apps/keycloak/
# 5. Wait for Keycloak ready
kubectl wait --for=condition=Ready pod -l app=keycloak -n sankofa --timeout=600s
# 6. Configure Keycloak clients
kubectl apply -f gitops/apps/keycloak/keycloak-clients.yaml
```
**Success Criteria**:
- ✅ Database migrations complete (26 migrations)
- ✅ Keycloak pods running
- ✅ Keycloak clients configured
**Rollback**:
```bash
kubectl delete -f gitops/apps/keycloak/
# Database rollback: Restore from backup or re-run migrations
```
---
### Phase 4: Application Deployment (30-45 minutes)
**Objective**: Deploy API, Frontend, and Portal
**Steps**:
```bash
# 1. Create secrets
kubectl create secret generic api-secrets -n sankofa \
--from-literal=DB_PASSWORD=<db-password> \
--from-literal=JWT_SECRET=<jwt-secret> \
--from-literal=KEYCLOAK_CLIENT_SECRET=<keycloak-secret> \
--dry-run=client -o yaml | kubectl apply -f -
# 2. Deploy API
kubectl apply -f gitops/apps/api/
kubectl wait --for=condition=Ready pod -l app=api -n sankofa --timeout=300s
# 3. Deploy Frontend
kubectl apply -f gitops/apps/frontend/
kubectl wait --for=condition=Ready pod -l app=frontend -n sankofa --timeout=300s
# 4. Deploy Portal
kubectl apply -f gitops/apps/portal/
kubectl wait --for=condition=Ready pod -l app=portal -n sankofa --timeout=300s
# 5. Verify health endpoints
curl http://api.sankofa.nexus/health
curl http://frontend.sankofa.nexus
curl http://portal.sankofa.nexus
```
**Success Criteria**:
- ✅ All application pods running
- ✅ Health endpoints responding
- ✅ No critical errors in logs
**Rollback**:
```bash
kubectl rollout undo deployment/api -n sankofa
kubectl rollout undo deployment/frontend -n sankofa
kubectl rollout undo deployment/portal -n sankofa
```
---
### Phase 5: Infrastructure VMs (15-30 minutes)
**Objective**: Deploy Nginx Proxy and Cloudflare Tunnel VMs
**Steps**:
```bash
# 1. Deploy Nginx Proxy VM
kubectl apply -f examples/production/nginx-proxy-vm.yaml
# 2. Deploy Cloudflare Tunnel VM
kubectl apply -f examples/production/cloudflare-tunnel-vm.yaml
# 3. Monitor deployment
watch kubectl get proxmoxvm -A
# 4. Wait for VMs ready (check status)
kubectl wait --for=condition=Ready proxmoxvm nginx-proxy-vm -n default --timeout=600s
kubectl wait --for=condition=Ready proxmoxvm cloudflare-tunnel-vm -n default --timeout=600s
# 5. Verify VM creation in Proxmox
ssh root@192.168.11.10 "qm list | grep -E 'nginx-proxy|cloudflare-tunnel'"
# 6. Check guest agent
ssh root@192.168.11.10 "qm guest exec <vmid> -- cat /etc/os-release"
```
**Success Criteria**:
- ✅ Both VMs created and running
- ✅ Guest agent running
- ✅ VMs accessible via SSH
- ✅ Cloud-init completed
**Rollback**:
```bash
kubectl delete proxmoxvm nginx-proxy-vm -n default
kubectl delete proxmoxvm cloudflare-tunnel-vm -n default
```
---
### Phase 6: Application VMs (30-60 minutes)
**Objective**: Deploy all 16 SMOM-DBIS-138 VMs
**Steps**:
```bash
# 1. Deploy all VMs
kubectl apply -f examples/production/smom-dbis-138/
# 2. Monitor deployment (in separate terminal)
watch kubectl get proxmoxvm -A
# 3. Check controller logs (in separate terminal)
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50 -f
# 4. Wait for all VMs ready (this may take 10-30 minutes)
# Monitor progress and verify each VM reaches Ready state
# 5. Verify VM creation
kubectl get proxmoxvm -A -o wide
# 6. Check guest agent on all VMs
for vm in $(kubectl get proxmoxvm -A -o jsonpath='{.items[*].metadata.name}'); do
echo "Checking $vm..."
kubectl get proxmoxvm $vm -A -o jsonpath='{.status.conditions[*].status}'
done
```
**VM Deployment Order** (if deploying sequentially):
1. validator-01, validator-02, validator-03, validator-04
2. sentry-01, sentry-02, sentry-03, sentry-04
3. rpc-node-01, rpc-node-02, rpc-node-03, rpc-node-04
4. services, blockscout, monitoring, management
**Success Criteria**:
- ✅ All 16 VMs created
- ✅ All VMs in Running state
- ✅ Guest agent running on all VMs
- ✅ Cloud-init completed successfully
**Rollback**:
```bash
# Delete all VMs
kubectl delete -f examples/production/smom-dbis-138/
```
---
### Phase 7: Monitoring Stack (20-30 minutes)
**Objective**: Deploy monitoring and observability stack
**Steps**:
```bash
# 1. Deploy Prometheus
kubectl apply -f gitops/apps/monitoring/prometheus/
kubectl wait --for=condition=Ready pod -l app=prometheus -n monitoring --timeout=300s
# 2. Deploy Grafana
kubectl apply -f gitops/apps/monitoring/grafana/
kubectl wait --for=condition=Ready pod -l app=grafana -n monitoring --timeout=300s
# 3. Deploy Loki
kubectl apply -f gitops/apps/monitoring/loki/
kubectl wait --for=condition=Ready pod -l app=loki -n monitoring --timeout=300s
# 4. Deploy Alertmanager
kubectl apply -f gitops/apps/monitoring/alertmanager/
# 5. Deploy backup CronJob
kubectl apply -f gitops/apps/monitoring/backup-cronjob.yaml
# 6. Verify
kubectl get pods -n monitoring
curl http://grafana.sankofa.nexus
```
**Success Criteria**:
- ✅ All monitoring pods running
- ✅ Prometheus scraping metrics
- ✅ Grafana accessible
- ✅ Loki ingesting logs
- ✅ Backup CronJob scheduled
**Rollback**:
```bash
kubectl delete -f gitops/apps/monitoring/
```
---
### Phase 8: Network Configuration (30-45 minutes)
**Objective**: Configure Cloudflare Tunnel, Nginx, and DNS
**Steps**:
```bash
# 1. Configure Cloudflare Tunnel
./scripts/configure-cloudflare-tunnel.sh
# Or manually:
# - Create tunnel in Cloudflare dashboard
# - Download credentials JSON
# - Upload to cloudflare-tunnel-vm: /etc/cloudflared/tunnel-credentials.json
# - Update /etc/cloudflared/config.yaml with ingress rules
# - Restart cloudflared service
# 2. Configure Nginx Proxy
./scripts/configure-nginx-proxy.sh
# Or manually:
# - SSH into nginx-proxy-vm
# - Update /etc/nginx/conf.d/*.conf
# - Run certbot for SSL certificates
# - Test: nginx -t
# - Reload: systemctl reload nginx
# 3. Configure DNS
./scripts/setup-dns-records.sh
# Or manually in Cloudflare:
# - Create A/CNAME records
# - Point to Cloudflare Tunnel
# - Enable proxy (orange cloud)
```
**Success Criteria**:
- ✅ Cloudflare Tunnel connected
- ✅ Nginx proxying correctly
- ✅ DNS records created
- ✅ SSL certificates issued
- ✅ Services accessible via public URLs
**Rollback**:
- Revert DNS changes in Cloudflare
- Restore previous Nginx configuration
- Disable Cloudflare Tunnel
---
### Phase 9: Multi-Tenancy Setup (15-20 minutes)
**Objective**: Create system tenant and configure multi-tenancy
**Steps**:
```bash
# 1. Get API endpoint and admin token
API_URL="http://api.sankofa.nexus/graphql"
ADMIN_TOKEN="<get-from-keycloak>"
# 2. Create system tenant
curl -X POST $API_URL \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"query": "mutation { createTenant(input: { name: \"system\", tier: SOVEREIGN }) { id name billingAccountId } }"
}'
# 3. Get system tenant ID from response
SYSTEM_TENANT_ID="<from-response>"
# 4. Add admin user to system tenant
curl -X POST $API_URL \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d "{
\"query\": \"mutation { addUserToTenant(tenantId: \\\"$SYSTEM_TENANT_ID\\\", userId: \\\"<admin-user-id>\\\", role: TENANT_OWNER) }\"
}"
# 5. Verify tenant
curl -X POST $API_URL \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{
"query": "query { myTenant { id name status tier } }"
}'
```
**Success Criteria**:
- ✅ System tenant created
- ✅ Admin user assigned
- ✅ Tenant accessible via API
- ✅ RBAC working correctly
**Rollback**:
- Delete tenant via API (if supported)
- Or manually remove from database
---
### Phase 10: Verification and Testing (30-45 minutes)
**Objective**: Verify deployment and run tests
**Steps**:
```bash
# 1. Health checks
curl http://api.sankofa.nexus/health
curl http://frontend.sankofa.nexus
curl http://portal.sankofa.nexus
curl http://keycloak.sankofa.nexus/health
# 2. Check all VMs
kubectl get proxmoxvm -A
# 3. Check all pods
kubectl get pods -A
# 4. Run smoke tests
./scripts/smoke-tests.sh
# 5. Run performance tests (optional)
./scripts/performance-test.sh
# 6. Verify monitoring
curl http://grafana.sankofa.nexus
kubectl get pods -n monitoring
# 7. Check backups
./scripts/verify-backups.sh
```
**Success Criteria**:
- ✅ All health checks passing
- ✅ All VMs running
- ✅ All pods running
- ✅ Smoke tests passing
- ✅ Monitoring operational
- ✅ Backups configured
**Rollback**: N/A - verification only
---
## Execution Timeline
### Estimated Total Time: 4-6 hours
| Phase | Duration | Dependencies |
|-------|----------|--------------|
| Phase 1: Resource Verification | 15 min | None |
| Phase 2: Kubernetes Control Plane | 30-60 min | Kubernetes cluster |
| Phase 3: Database and Identity | 30-45 min | Phase 2 |
| Phase 4: Application Deployment | 30-45 min | Phase 3 |
| Phase 5: Infrastructure VMs | 15-30 min | Phase 2, Phase 4 |
| Phase 6: Application VMs | 30-60 min | Phase 5 |
| Phase 7: Monitoring Stack | 20-30 min | Phase 2 |
| Phase 8: Network Configuration | 30-45 min | Phase 5 |
| Phase 9: Multi-Tenancy Setup | 15-20 min | Phase 3, Phase 4 |
| Phase 10: Verification and Testing | 30-45 min | All phases |
---
## Risk Mitigation
### High-Risk Areas
1. **VM Deployment**: May take longer than expected
- **Mitigation**: Monitor closely, allow extra time
2. **Network Configuration**: DNS propagation delays
- **Mitigation**: Test with IP addresses first, then DNS
3. **Database Migrations**: Potential data loss
- **Mitigation**: Backup before migrations, test in staging first
### Rollback Procedures
- Each phase includes rollback steps
- Document any issues encountered
- Keep backups of all configurations
---
## Post-Deployment
### Immediate (First 24 hours)
- [ ] Monitor all services
- [ ] Review logs for errors
- [ ] Verify all VMs accessible
- [ ] Check monitoring dashboards
- [ ] Verify backups running
### Short-term (First week)
- [ ] Performance optimization
- [ ] Security hardening
- [ ] Documentation updates
- [ ] Team training
- [ ] Support procedures
---
## Success Criteria
### Technical
- ✅ All 18 VMs deployed and running
- ✅ All services healthy
- ✅ Guest agent on all VMs
- ✅ Monitoring operational
- ✅ Backups configured
### Functional
- ✅ Portal accessible
- ✅ API responding
- ✅ Multi-tenancy working
- ✅ Resource provisioning functional
---
**Last Updated**: 2025-01-XX
**Status**: Ready for Execution