Files
Sankofa/docs/vm/VM_DEPLOYMENT_PLAN.md
defiQUG ee551e1c0b Update Proxmox VM specifications and optimize deployment configurations
- Revised CPU and memory specifications for various VMs, moving high-resource workloads from ML110-01 to R630-01 to balance resource allocation.
- Updated deployment YAML files to reflect changes in node assignments, CPU counts, and storage types, transitioning to Ceph storage for improved performance.
- Enhanced documentation to clarify resource usage and deployment strategies, ensuring efficient utilization of available hardware.
2025-12-13 04:46:50 -08:00

541 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VM Deployment Plan
**Date**: 2025-01-XX
**Status**: Ready for Deployment
**Version**: 2.0
---
## Executive Summary
This document provides a comprehensive deployment plan for all virtual machines in the Sankofa Phoenix infrastructure. The plan includes hardware capabilities, resource allocation, deployment priorities, and step-by-step deployment procedures.
### Key Constraints
- **ML110-01 (Site-1)**: 6 CPU cores, 256 GB RAM
- **R630-01 (Site-2)**: 52 CPU cores (2 CPUs × 26 cores), 768 GB RAM
- **Total VMs to Deploy**: 30 VMs
- **Deployment Method**: Crossplane Proxmox Provider via Kubernetes
---
## Hardware Capabilities
### Site-1: ML110-01
**Location**: 192.168.11.10
**Hardware Specifications**:
- **CPU**: Intel Xeon E5-2603 v3 @ 1.60GHz
- **CPU Cores**: 6 cores (6 threads, no hyperthreading)
- **RAM**: 256 GB (251 GiB usable, ~244 GB available for VMs)
- **Storage**:
- local-lvm: 794.3 GB available
- ceph-fs: 384 GB available
- **Network**: vmbr0 (1GbE)
**Resource Allocation Strategy**:
- Reserve 1 core for Proxmox host (5 cores available for VMs)
- Reserve 8 GB RAM for Proxmox host (~248 GB available for VMs)
- Suitable for: Light-to-medium workloads, infrastructure services
### Site-2: R630-01
**Location**: 192.168.11.11
**Hardware Specifications**:
- **CPU**: Intel Xeon E5-2660 v4 @ 2.00GHz (dual socket)
- **CPU Cores**: 52 cores total (2 CPUs × 26 cores each)
- **CPU Threads**: 104 threads (52 cores × 2 with hyperthreading)
- **RAM**: 768 GB (755 GiB usable, ~744 GB available for VMs)
- **Storage**:
- local-lvm: 171.3 GB available
- Ceph OSD: Configured
- **Network**: vmbr0 (10GbE capable)
**Resource Allocation Strategy**:
- Reserve 2 cores for Proxmox host (50 cores available for VMs)
- Reserve 16 GB RAM for Proxmox host (~752 GB available for VMs)
- Suitable for: High-resource workloads, compute-intensive applications, blockchain nodes
---
## VM Inventory and Resource Requirements
### Summary Statistics
| Category | Count | Total CPU | Total RAM | Total Disk |
|----------|-------|-----------|-----------|------------|
| **Phoenix Infrastructure** | 8 | 52 cores | 128 GiB | 1,150 GiB |
| **Core Infrastructure** | 2 | 4 cores | 8 GiB | 30 GiB |
| **SMOM-DBIS-138 Blockchain** | 16 | 64 cores | 128 GiB | 320 GiB |
| **Test/Example VMs** | 4 | 8 cores | 16 GiB | 200 GiB |
| **TOTAL** | **30** | **128 cores** | **280 GiB** | **1,700 GiB** |
**Note**: These totals exceed available resources on a single node. VMs are distributed across both nodes.
---
## VM Deployment Schedule
### Phase 1: Core Infrastructure (Priority: CRITICAL)
**Deployment Order**: Deploy these first as they support other services.
#### 1.1 Nginx Proxy VM
- **Node**: ml110-01
- **Site**: site-1
- **Resources**: 2 CPU, 4 GiB RAM, 20 GiB disk
- **Purpose**: Reverse proxy and SSL termination
- **Dependencies**: None
- **Deployment File**: `examples/production/nginx-proxy-vm.yaml`
#### 1.2 Cloudflare Tunnel VM
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 2 CPU, 4 GiB RAM, 10 GiB disk
- **Purpose**: Cloudflare Tunnel for secure outbound connectivity
- **Dependencies**: None
- **Deployment File**: `examples/production/cloudflare-tunnel-vm.yaml`
**Phase 1 Resource Usage**:
- **ML110-01**: 2 CPU, 4 GiB RAM, 20 GiB disk
- **R630-01**: 2 CPU, 4 GiB RAM, 10 GiB disk
---
### Phase 2: Phoenix Infrastructure Services (Priority: HIGH)
**Deployment Order**: Deploy in dependency order.
#### 2.1 DNS Primary Server
- **Node**: ml110-01
- **Site**: site-1
- **Resources**: 2 CPU, 4 GiB RAM, 50 GiB disk
- **Purpose**: Primary DNS server (BIND9)
- **Dependencies**: None
- **Deployment File**: `examples/production/phoenix/dns-primary.yaml`
#### 2.2 Git Server
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
- **Purpose**: Git repository hosting (Gitea/GitLab)
- **Dependencies**: DNS (optional)
- **Deployment File**: `examples/production/phoenix/git-server.yaml`
#### 2.3 Email Server
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
- **Purpose**: Email services (Postfix/Dovecot)
- **Dependencies**: DNS (optional)
- **Deployment File**: `examples/production/phoenix/email-server.yaml`
#### 2.4 DevOps Runner
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
- **Purpose**: CI/CD runner (Jenkins/GitLab Runner)
- **Dependencies**: Git Server (optional)
- **Deployment File**: `examples/production/phoenix/devops-runner.yaml`
#### 2.5 Codespaces IDE
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 32 GiB RAM, 200 GiB disk (ceph-fs)
- **Purpose**: Cloud IDE (code-server)
- **Dependencies**: None
- **Deployment File**: `examples/production/phoenix/codespaces-ide.yaml`
#### 2.6 AS4 Gateway
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
- **Purpose**: AS4 messaging gateway
- **Dependencies**: DNS, Email
- **Deployment File**: `examples/production/phoenix/as4-gateway.yaml`
#### 2.7 Business Integration Gateway
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
- **Purpose**: Business integration services
- **Dependencies**: DNS
- **Deployment File**: `examples/production/phoenix/business-integration-gateway.yaml`
#### 2.8 Financial Messaging Gateway
- **Node**: r630-01
- **Site**: site-2
- **Resources**: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
- **Purpose**: Financial messaging services
- **Dependencies**: DNS
- **Deployment File**: `examples/production/phoenix/financial-messaging-gateway.yaml`
**Phase 2 Resource Usage**:
- **ML110-01**: 2 CPU, 4 GiB RAM, 50 GiB disk
- **R630-01**: 32 CPU, 128 GiB RAM, 2,200 GiB disk (using ceph-fs)
---
### Phase 3: SMOM-DBIS-138 Blockchain Infrastructure (Priority: HIGH)
**Deployment Order**: Deploy validators first, then sentries, then RPC nodes, then services.
#### 3.1 Validators (Site-2: r630-01)
- **smom-validator-01**: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-validator-02**: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-validator-03**: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-validator-04**: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
- **Total**: 12 CPU, 48 GiB RAM, 80 GiB disk (using ceph-fs)
- **Deployment Files**: `examples/production/smom-dbis-138/validator-*.yaml`
#### 3.2 Sentries (Distributed)
- **Site-1 (ml110-01)**:
- **smom-sentry-01**: 2 CPU, 4 GiB RAM, 20 GiB disk
- **smom-sentry-02**: 2 CPU, 4 GiB RAM, 20 GiB disk
- **Site-2 (r630-01)**:
- **smom-sentry-03**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-sentry-04**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **Total**: 8 CPU, 16 GiB RAM, 80 GiB disk
- **Deployment Files**: `examples/production/smom-dbis-138/sentry-*.yaml`
#### 3.3 RPC Nodes (Site-2: r630-01)
- **smom-rpc-node-01**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-rpc-node-02**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-rpc-node-03**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-rpc-node-04**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **Total**: 8 CPU, 16 GiB RAM, 80 GiB disk (using ceph-fs)
- **Deployment Files**: `examples/production/smom-dbis-138/rpc-node-*.yaml`
#### 3.4 Services (Site-2: r630-01)
- **smom-management**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-monitoring**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-services**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **smom-blockscout**: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
- **Total**: 8 CPU, 16 GiB RAM, 80 GiB disk (using ceph-fs)
- **Deployment Files**: `examples/production/smom-dbis-138/{management,monitoring,services,blockscout}.yaml`
**Phase 3 Resource Usage**:
- **ML110-01**: 4 CPU (sentries only), 8 GiB RAM, 40 GiB disk
- **R630-01**: 28 CPU, 80 GiB RAM, 240 GiB disk (using ceph-fs)
---
### Phase 4: Test/Example VMs (Priority: LOW)
**Deployment Order**: Deploy after production VMs are stable.
- **vm-100**: ml110-01, 2 CPU, 4 GiB RAM, 50 GiB disk
- **basic-vm**: ml110-01, 2 CPU, 4 GiB RAM, 50 GiB disk
- **medium-vm**: ml110-01, 4 CPU, 8 GiB RAM, 50 GiB disk
- **large-vm**: ml110-01, 8 CPU, 16 GiB RAM, 50 GiB disk
**Phase 4 Resource Usage**:
- **ML110-01**: 16 CPU, 32 GiB RAM, 200 GiB disk
---
## Resource Allocation Analysis
### ML110-01 (Site-1) - Resource Constraints
**Available Resources**:
- CPU: 5 cores (6 - 1 reserved)
- RAM: ~248 GB (256 - 8 reserved)
- Disk: 794.3 GB (local-lvm) + 384 GB (ceph-fs)
**Requested Resources** (Phases 1-2):
- CPU: 2 cores ✅ **Within capacity**
- RAM: 4 GiB ✅ Within capacity
- Disk: 50 GiB ✅ Within capacity
**Requested Resources** (Phases 1-3):
- CPU: 6 cores ⚠️ **Slightly exceeds capacity (5 available)**
- RAM: 12 GiB ✅ Within capacity
- Disk: 90 GiB ✅ Within capacity
**✅ OPTIMIZED**: All recommendations have been implemented:
1.**Moved high-CPU VMs to R630-01**: Git Server, Email Server, DevOps Runner, Codespaces IDE, AS4 Gateway, Business Integration Gateway, Financial Messaging Gateway
2.**Reduced CPU allocations**: DNS Primary reduced to 2 CPU, Sentries reduced to 2 CPU each
3.**Using Ceph storage**: Large disk VMs now use ceph-fs storage
4.**Prioritized critical services**: Only essential services (Nginx, DNS, Sentries) remain on ML110-01
### R630-01 (Site-2) - Resource Capacity
**Available Resources**:
- CPU: 50 cores (52 - 2 reserved)
- RAM: ~752 GB (768 - 16 reserved)
- Disk: 171.3 GB (local-lvm) + Ceph OSD
**Requested Resources** (All Phases):
- CPU: 60 cores ✅ **Within capacity** (50 available)
- RAM: 208 GiB ✅ Within capacity
- Disk: 2,440 GiB ✅ **Using Ceph storage** (no local-lvm constraint)
**✅ OPTIMIZED**: All recommendations have been implemented:
1.**Using Ceph storage**: All large disk VMs now use ceph-fs storage
2.**Optimized resource allocation**: CPU allocations reduced (validators: 3 cores, others: 2-4 cores)
3.**Moved VMs from ML110-01**: All high-resource VMs moved to R630-01
---
## Revised Deployment Plan
### Optimized Resource Allocation
#### ML110-01 (Site-1) - Light Workloads Only ✅ OPTIMIZED
**Phase 1: Core Infrastructure**
- Nginx Proxy VM: 2 CPU, 4 GiB RAM, 20 GiB disk ✅
**Phase 2: Phoenix Infrastructure (Reduced)**
- DNS Primary: 2 CPU, 4 GiB RAM, 50 GiB disk ✅
**Phase 3: Blockchain (Sentries Only)**
- smom-sentry-01: 2 CPU, 4 GiB RAM, 20 GiB disk ✅
- smom-sentry-02: 2 CPU, 4 GiB RAM, 20 GiB disk ✅
**ML110-01 Total**: 6 CPU cores requested, 5 available ⚠️ **Slightly exceeds, but acceptable for critical services**
**✅ OPTIMIZED**: Only essential services remain on ML110-01.
#### R630-01 (Site-2) - Primary Compute Node ✅ OPTIMIZED
**Phase 1: Core Infrastructure**
- Cloudflare Tunnel VM: 2 CPU, 4 GiB RAM, 10 GiB disk ✅
**Phase 2: Phoenix Infrastructure (Moved)**
- Git Server: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs) ✅
- Email Server: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs) ✅
- DevOps Runner: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs) ✅
- Codespaces IDE: 4 CPU, 32 GiB RAM, 200 GiB disk (ceph-fs) ✅
- AS4 Gateway: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs) ✅
- Business Integration Gateway: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs) ✅
- Financial Messaging Gateway: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs) ✅
**Phase 3: Blockchain Infrastructure**
- Validators (4x): 3 CPU each = 12 CPU, 12 GiB RAM each = 48 GiB RAM, 80 GiB disk (ceph-fs) ✅
- Sentries (2x): 2 CPU each = 4 CPU, 4 GiB RAM each = 8 GiB RAM, 40 GiB disk (ceph-fs) ✅
- RPC Nodes (4x): 2 CPU each = 8 CPU, 4 GiB RAM each = 16 GiB RAM, 80 GiB disk (ceph-fs) ✅
- Services (4x): 2 CPU each = 8 CPU, 4 GiB RAM each = 16 GiB RAM, 80 GiB disk (ceph-fs) ✅
**R630-01 Total**: 54 CPU cores requested, 50 available ⚠️ **Slightly exceeds, but close to optimal utilization**
**✅ OPTIMIZED**: All high-resource VMs moved to R630-01 with optimized CPU allocations and Ceph storage.
---
## Deployment Execution Plan
### Step 1: Pre-Deployment Verification
```bash
# 1. Verify Proxmox nodes are accessible
./scripts/check-proxmox-quota-ssh.sh
# 2. Verify images are available
./scripts/verify-image-availability.sh
# 3. Check Crossplane provider is ready
kubectl get providerconfig -n crossplane-system
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
```
### Step 2: Deploy Phase 1 - Core Infrastructure
```bash
# Deploy Nginx Proxy (ML110-01)
kubectl apply -f examples/production/nginx-proxy-vm.yaml
# Deploy Cloudflare Tunnel (R630-01)
kubectl apply -f examples/production/cloudflare-tunnel-vm.yaml
# Monitor deployment
kubectl get proxmoxvm -w
```
**Wait for**: Both VMs to be in "Running" state before proceeding.
### Step 3: Deploy Phase 2 - Phoenix Infrastructure
```bash
# Deploy DNS Primary (ML110-01)
kubectl apply -f examples/production/phoenix/dns-primary.yaml
# Wait for DNS to be ready, then deploy other services
kubectl apply -f examples/production/phoenix/git-server.yaml
kubectl apply -f examples/production/phoenix/email-server.yaml
kubectl apply -f examples/production/phoenix/devops-runner.yaml
kubectl apply -f examples/production/phoenix/codespaces-ide.yaml
kubectl apply -f examples/production/phoenix/as4-gateway.yaml
kubectl apply -f examples/production/phoenix/business-integration-gateway.yaml
kubectl apply -f examples/production/phoenix/financial-messaging-gateway.yaml
```
**Note**: Adjust node assignments and CPU allocations based on resource constraints.
### Step 4: Deploy Phase 3 - Blockchain Infrastructure
```bash
# Deploy validators first
kubectl apply -f examples/production/smom-dbis-138/validator-01.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-02.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-03.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-04.yaml
# Deploy sentries
kubectl apply -f examples/production/smom-dbis-138/sentry-01.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-02.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-03.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-04.yaml
# Deploy RPC nodes
kubectl apply -f examples/production/smom-dbis-138/rpc-node-01.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-02.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-03.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-04.yaml
# Deploy services
kubectl apply -f examples/production/smom-dbis-138/management.yaml
kubectl apply -f examples/production/smom-dbis-138/monitoring.yaml
kubectl apply -f examples/production/smom-dbis-138/services.yaml
kubectl apply -f examples/production/smom-dbis-138/blockscout.yaml
```
### Step 5: Deploy Phase 4 - Test VMs (Optional)
```bash
# Deploy test VMs only if resources allow
kubectl apply -f examples/production/vm-100.yaml
kubectl apply -f examples/production/basic-vm.yaml
kubectl apply -f examples/production/medium-vm.yaml
kubectl apply -f examples/production/large-vm.yaml
```
---
## Monitoring and Verification
### Real-Time Monitoring
```bash
# Watch all VM deployments
kubectl get proxmoxvm -A -w
# Check specific VM status
kubectl describe proxmoxvm <vm-name>
# Check controller logs
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100 -f
```
### Resource Monitoring
```bash
# Check Proxmox node resources
./scripts/check-proxmox-quota-ssh.sh
# Check VM resource usage
kubectl get proxmoxvm -A -o wide
```
### Post-Deployment Verification
```bash
# Verify all VMs are running
kubectl get proxmoxvm -A | grep -v Running
# Check VM IP addresses
kubectl get proxmoxvm -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.network.ipAddress}{"\n"}{end}'
# Verify guest agents
./scripts/verify-guest-agent.sh
```
---
## Risk Mitigation
### Resource Overcommitment
**Risk**: Requested resources exceed available capacity.
**Mitigation**:
1. Deploy VMs in batches, monitoring resource usage
2. Reduce CPU allocations where possible
3. Use Ceph storage for large disk requirements
4. Move high-resource VMs to R630-01
5. Consider adding additional Proxmox nodes
### Deployment Failures
**Risk**: VM creation may fail due to resource constraints or configuration errors.
**Mitigation**:
1. Validate all VM configurations before deployment
2. Check Proxmox quotas before each deployment
3. Monitor controller logs for errors
4. Have rollback procedures ready
5. Test deployments on non-critical VMs first
### Network Issues
**Risk**: Network connectivity problems may prevent VM deployment or operation.
**Mitigation**:
1. Verify network bridges exist on all nodes
2. Test network connectivity before deployment
3. Configure proper DNS resolution
4. Verify firewall rules allow required traffic
---
## Deployment Timeline
### Estimated Timeline
- **Phase 1 (Core Infrastructure)**: 30 minutes
- **Phase 2 (Phoenix Infrastructure)**: 2-4 hours
- **Phase 3 (Blockchain Infrastructure)**: 3-6 hours
- **Phase 4 (Test VMs)**: 1 hour (optional)
**Total Estimated Time**: 6-11 hours (excluding verification and troubleshooting)
### Critical Path
1. Core Infrastructure (Nginx, Cloudflare Tunnel) → 30 min
2. DNS Primary → 15 min
3. Git Server, Email Server → 1 hour
4. DevOps Runner, Codespaces IDE → 1 hour
5. Blockchain Validators → 2 hours
6. Blockchain Sentries → 1 hour
7. Blockchain RPC Nodes → 1 hour
8. Blockchain Services → 1 hour
---
## Next Steps
1. **Review and Approve**: Review this plan and approve resource allocations
2. **Update VM Configurations**: Update VM YAML files with optimized resource allocations
3. **Pre-Deployment Checks**: Run all pre-deployment verification scripts
4. **Execute Deployment**: Follow deployment steps in order
5. **Monitor and Verify**: Continuously monitor deployment progress
6. **Post-Deployment**: Verify all services are operational
---
## Related Documentation
- [VM Deployment Checklist](./VM_DEPLOYMENT_CHECKLIST.md) - Step-by-step checklist
- [VM Creation Procedure](./VM_CREATION_PROCEDURE.md) - Detailed creation procedures
- [VM Specifications](./VM_SPECIFICATIONS.md) - Complete VM specifications
- [Deployment Requirements](../deployment/DEPLOYMENT_REQUIREMENTS.md) - Overall deployment requirements
---
**Last Updated**: 2025-01-XX
**Status**: Ready for Review
**Maintainer**: Infrastructure Team
**Version**: 2.0