Files
Sankofa/docs/vm/VM_DEPLOYMENT_PLAN.md
defiQUG ee551e1c0b Update Proxmox VM specifications and optimize deployment configurations
- Revised CPU and memory specifications for various VMs, moving high-resource workloads from ML110-01 to R630-01 to balance resource allocation.
- Updated deployment YAML files to reflect changes in node assignments, CPU counts, and storage types, transitioning to Ceph storage for improved performance.
- Enhanced documentation to clarify resource usage and deployment strategies, ensuring efficient utilization of available hardware.
2025-12-13 04:46:50 -08:00

18 KiB
Raw Blame History

VM Deployment Plan

Date: 2025-01-XX
Status: Ready for Deployment
Version: 2.0


Executive Summary

This document provides a comprehensive deployment plan for all virtual machines in the Sankofa Phoenix infrastructure. The plan includes hardware capabilities, resource allocation, deployment priorities, and step-by-step deployment procedures.

Key Constraints

  • ML110-01 (Site-1): 6 CPU cores, 256 GB RAM
  • R630-01 (Site-2): 52 CPU cores (2 CPUs × 26 cores), 768 GB RAM
  • Total VMs to Deploy: 30 VMs
  • Deployment Method: Crossplane Proxmox Provider via Kubernetes

Hardware Capabilities

Site-1: ML110-01

Location: 192.168.11.10
Hardware Specifications:

  • CPU: Intel Xeon E5-2603 v3 @ 1.60GHz
  • CPU Cores: 6 cores (6 threads, no hyperthreading)
  • RAM: 256 GB (251 GiB usable, ~244 GB available for VMs)
  • Storage:
    • local-lvm: 794.3 GB available
    • ceph-fs: 384 GB available
  • Network: vmbr0 (1GbE)

Resource Allocation Strategy:

  • Reserve 1 core for Proxmox host (5 cores available for VMs)
  • Reserve 8 GB RAM for Proxmox host (~248 GB available for VMs)
  • Suitable for: Light-to-medium workloads, infrastructure services

Site-2: R630-01

Location: 192.168.11.11
Hardware Specifications:

  • CPU: Intel Xeon E5-2660 v4 @ 2.00GHz (dual socket)
  • CPU Cores: 52 cores total (2 CPUs × 26 cores each)
  • CPU Threads: 104 threads (52 cores × 2 with hyperthreading)
  • RAM: 768 GB (755 GiB usable, ~744 GB available for VMs)
  • Storage:
    • local-lvm: 171.3 GB available
    • Ceph OSD: Configured
  • Network: vmbr0 (10GbE capable)

Resource Allocation Strategy:

  • Reserve 2 cores for Proxmox host (50 cores available for VMs)
  • Reserve 16 GB RAM for Proxmox host (~752 GB available for VMs)
  • Suitable for: High-resource workloads, compute-intensive applications, blockchain nodes

VM Inventory and Resource Requirements

Summary Statistics

Category Count Total CPU Total RAM Total Disk
Phoenix Infrastructure 8 52 cores 128 GiB 1,150 GiB
Core Infrastructure 2 4 cores 8 GiB 30 GiB
SMOM-DBIS-138 Blockchain 16 64 cores 128 GiB 320 GiB
Test/Example VMs 4 8 cores 16 GiB 200 GiB
TOTAL 30 128 cores 280 GiB 1,700 GiB

Note: These totals exceed available resources on a single node. VMs are distributed across both nodes.


VM Deployment Schedule

Phase 1: Core Infrastructure (Priority: CRITICAL)

Deployment Order: Deploy these first as they support other services.

1.1 Nginx Proxy VM

  • Node: ml110-01
  • Site: site-1
  • Resources: 2 CPU, 4 GiB RAM, 20 GiB disk
  • Purpose: Reverse proxy and SSL termination
  • Dependencies: None
  • Deployment File: examples/production/nginx-proxy-vm.yaml

1.2 Cloudflare Tunnel VM

  • Node: r630-01
  • Site: site-2
  • Resources: 2 CPU, 4 GiB RAM, 10 GiB disk
  • Purpose: Cloudflare Tunnel for secure outbound connectivity
  • Dependencies: None
  • Deployment File: examples/production/cloudflare-tunnel-vm.yaml

Phase 1 Resource Usage:

  • ML110-01: 2 CPU, 4 GiB RAM, 20 GiB disk
  • R630-01: 2 CPU, 4 GiB RAM, 10 GiB disk

Phase 2: Phoenix Infrastructure Services (Priority: HIGH)

Deployment Order: Deploy in dependency order.

2.1 DNS Primary Server

  • Node: ml110-01
  • Site: site-1
  • Resources: 2 CPU, 4 GiB RAM, 50 GiB disk
  • Purpose: Primary DNS server (BIND9)
  • Dependencies: None
  • Deployment File: examples/production/phoenix/dns-primary.yaml

2.2 Git Server

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
  • Purpose: Git repository hosting (Gitea/GitLab)
  • Dependencies: DNS (optional)
  • Deployment File: examples/production/phoenix/git-server.yaml

2.3 Email Server

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • Purpose: Email services (Postfix/Dovecot)
  • Dependencies: DNS (optional)
  • Deployment File: examples/production/phoenix/email-server.yaml

2.4 DevOps Runner

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • Purpose: CI/CD runner (Jenkins/GitLab Runner)
  • Dependencies: Git Server (optional)
  • Deployment File: examples/production/phoenix/devops-runner.yaml

2.5 Codespaces IDE

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 32 GiB RAM, 200 GiB disk (ceph-fs)
  • Purpose: Cloud IDE (code-server)
  • Dependencies: None
  • Deployment File: examples/production/phoenix/codespaces-ide.yaml

2.6 AS4 Gateway

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
  • Purpose: AS4 messaging gateway
  • Dependencies: DNS, Email
  • Deployment File: examples/production/phoenix/as4-gateway.yaml

2.7 Business Integration Gateway

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • Purpose: Business integration services
  • Dependencies: DNS
  • Deployment File: examples/production/phoenix/business-integration-gateway.yaml

2.8 Financial Messaging Gateway

  • Node: r630-01
  • Site: site-2
  • Resources: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
  • Purpose: Financial messaging services
  • Dependencies: DNS
  • Deployment File: examples/production/phoenix/financial-messaging-gateway.yaml

Phase 2 Resource Usage:

  • ML110-01: 2 CPU, 4 GiB RAM, 50 GiB disk
  • R630-01: 32 CPU, 128 GiB RAM, 2,200 GiB disk (using ceph-fs)

Phase 3: SMOM-DBIS-138 Blockchain Infrastructure (Priority: HIGH)

Deployment Order: Deploy validators first, then sentries, then RPC nodes, then services.

3.1 Validators (Site-2: r630-01)

  • smom-validator-01: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-validator-02: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-validator-03: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-validator-04: 3 CPU, 12 GiB RAM, 20 GiB disk (ceph-fs)
  • Total: 12 CPU, 48 GiB RAM, 80 GiB disk (using ceph-fs)
  • Deployment Files: examples/production/smom-dbis-138/validator-*.yaml

3.2 Sentries (Distributed)

  • Site-1 (ml110-01):
    • smom-sentry-01: 2 CPU, 4 GiB RAM, 20 GiB disk
    • smom-sentry-02: 2 CPU, 4 GiB RAM, 20 GiB disk
  • Site-2 (r630-01):
    • smom-sentry-03: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
    • smom-sentry-04: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • Total: 8 CPU, 16 GiB RAM, 80 GiB disk
  • Deployment Files: examples/production/smom-dbis-138/sentry-*.yaml

3.3 RPC Nodes (Site-2: r630-01)

  • smom-rpc-node-01: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-rpc-node-02: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-rpc-node-03: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-rpc-node-04: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • Total: 8 CPU, 16 GiB RAM, 80 GiB disk (using ceph-fs)
  • Deployment Files: examples/production/smom-dbis-138/rpc-node-*.yaml

3.4 Services (Site-2: r630-01)

  • smom-management: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-monitoring: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-services: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • smom-blockscout: 2 CPU, 4 GiB RAM, 20 GiB disk (ceph-fs)
  • Total: 8 CPU, 16 GiB RAM, 80 GiB disk (using ceph-fs)
  • Deployment Files: examples/production/smom-dbis-138/{management,monitoring,services,blockscout}.yaml

Phase 3 Resource Usage:

  • ML110-01: 4 CPU (sentries only), 8 GiB RAM, 40 GiB disk
  • R630-01: 28 CPU, 80 GiB RAM, 240 GiB disk (using ceph-fs)

Phase 4: Test/Example VMs (Priority: LOW)

Deployment Order: Deploy after production VMs are stable.

  • vm-100: ml110-01, 2 CPU, 4 GiB RAM, 50 GiB disk
  • basic-vm: ml110-01, 2 CPU, 4 GiB RAM, 50 GiB disk
  • medium-vm: ml110-01, 4 CPU, 8 GiB RAM, 50 GiB disk
  • large-vm: ml110-01, 8 CPU, 16 GiB RAM, 50 GiB disk

Phase 4 Resource Usage:

  • ML110-01: 16 CPU, 32 GiB RAM, 200 GiB disk

Resource Allocation Analysis

ML110-01 (Site-1) - Resource Constraints

Available Resources:

  • CPU: 5 cores (6 - 1 reserved)
  • RAM: ~248 GB (256 - 8 reserved)
  • Disk: 794.3 GB (local-lvm) + 384 GB (ceph-fs)

Requested Resources (Phases 1-2):

  • CPU: 2 cores Within capacity
  • RAM: 4 GiB Within capacity
  • Disk: 50 GiB Within capacity

Requested Resources (Phases 1-3):

  • CPU: 6 cores ⚠️ Slightly exceeds capacity (5 available)
  • RAM: 12 GiB Within capacity
  • Disk: 90 GiB Within capacity

OPTIMIZED: All recommendations have been implemented:

  1. Moved high-CPU VMs to R630-01: Git Server, Email Server, DevOps Runner, Codespaces IDE, AS4 Gateway, Business Integration Gateway, Financial Messaging Gateway
  2. Reduced CPU allocations: DNS Primary reduced to 2 CPU, Sentries reduced to 2 CPU each
  3. Using Ceph storage: Large disk VMs now use ceph-fs storage
  4. Prioritized critical services: Only essential services (Nginx, DNS, Sentries) remain on ML110-01

R630-01 (Site-2) - Resource Capacity

Available Resources:

  • CPU: 50 cores (52 - 2 reserved)
  • RAM: ~752 GB (768 - 16 reserved)
  • Disk: 171.3 GB (local-lvm) + Ceph OSD

Requested Resources (All Phases):

  • CPU: 60 cores Within capacity (50 available)
  • RAM: 208 GiB Within capacity
  • Disk: 2,440 GiB Using Ceph storage (no local-lvm constraint)

OPTIMIZED: All recommendations have been implemented:

  1. Using Ceph storage: All large disk VMs now use ceph-fs storage
  2. Optimized resource allocation: CPU allocations reduced (validators: 3 cores, others: 2-4 cores)
  3. Moved VMs from ML110-01: All high-resource VMs moved to R630-01

Revised Deployment Plan

Optimized Resource Allocation

ML110-01 (Site-1) - Light Workloads Only OPTIMIZED

Phase 1: Core Infrastructure

  • Nginx Proxy VM: 2 CPU, 4 GiB RAM, 20 GiB disk

Phase 2: Phoenix Infrastructure (Reduced)

  • DNS Primary: 2 CPU, 4 GiB RAM, 50 GiB disk

Phase 3: Blockchain (Sentries Only)

  • smom-sentry-01: 2 CPU, 4 GiB RAM, 20 GiB disk
  • smom-sentry-02: 2 CPU, 4 GiB RAM, 20 GiB disk

ML110-01 Total: 6 CPU cores requested, 5 available ⚠️ Slightly exceeds, but acceptable for critical services

OPTIMIZED: Only essential services remain on ML110-01.

R630-01 (Site-2) - Primary Compute Node OPTIMIZED

Phase 1: Core Infrastructure

  • Cloudflare Tunnel VM: 2 CPU, 4 GiB RAM, 10 GiB disk

Phase 2: Phoenix Infrastructure (Moved)

  • Git Server: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
  • Email Server: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • DevOps Runner: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • Codespaces IDE: 4 CPU, 32 GiB RAM, 200 GiB disk (ceph-fs)
  • AS4 Gateway: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)
  • Business Integration Gateway: 4 CPU, 16 GiB RAM, 200 GiB disk (ceph-fs)
  • Financial Messaging Gateway: 4 CPU, 16 GiB RAM, 500 GiB disk (ceph-fs)

Phase 3: Blockchain Infrastructure

  • Validators (4x): 3 CPU each = 12 CPU, 12 GiB RAM each = 48 GiB RAM, 80 GiB disk (ceph-fs)
  • Sentries (2x): 2 CPU each = 4 CPU, 4 GiB RAM each = 8 GiB RAM, 40 GiB disk (ceph-fs)
  • RPC Nodes (4x): 2 CPU each = 8 CPU, 4 GiB RAM each = 16 GiB RAM, 80 GiB disk (ceph-fs)
  • Services (4x): 2 CPU each = 8 CPU, 4 GiB RAM each = 16 GiB RAM, 80 GiB disk (ceph-fs)

R630-01 Total: 54 CPU cores requested, 50 available ⚠️ Slightly exceeds, but close to optimal utilization

OPTIMIZED: All high-resource VMs moved to R630-01 with optimized CPU allocations and Ceph storage.


Deployment Execution Plan

Step 1: Pre-Deployment Verification

# 1. Verify Proxmox nodes are accessible
./scripts/check-proxmox-quota-ssh.sh

# 2. Verify images are available
./scripts/verify-image-availability.sh

# 3. Check Crossplane provider is ready
kubectl get providerconfig -n crossplane-system
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox

Step 2: Deploy Phase 1 - Core Infrastructure

# Deploy Nginx Proxy (ML110-01)
kubectl apply -f examples/production/nginx-proxy-vm.yaml

# Deploy Cloudflare Tunnel (R630-01)
kubectl apply -f examples/production/cloudflare-tunnel-vm.yaml

# Monitor deployment
kubectl get proxmoxvm -w

Wait for: Both VMs to be in "Running" state before proceeding.

Step 3: Deploy Phase 2 - Phoenix Infrastructure

# Deploy DNS Primary (ML110-01)
kubectl apply -f examples/production/phoenix/dns-primary.yaml

# Wait for DNS to be ready, then deploy other services
kubectl apply -f examples/production/phoenix/git-server.yaml
kubectl apply -f examples/production/phoenix/email-server.yaml
kubectl apply -f examples/production/phoenix/devops-runner.yaml
kubectl apply -f examples/production/phoenix/codespaces-ide.yaml
kubectl apply -f examples/production/phoenix/as4-gateway.yaml
kubectl apply -f examples/production/phoenix/business-integration-gateway.yaml
kubectl apply -f examples/production/phoenix/financial-messaging-gateway.yaml

Note: Adjust node assignments and CPU allocations based on resource constraints.

Step 4: Deploy Phase 3 - Blockchain Infrastructure

# Deploy validators first
kubectl apply -f examples/production/smom-dbis-138/validator-01.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-02.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-03.yaml
kubectl apply -f examples/production/smom-dbis-138/validator-04.yaml

# Deploy sentries
kubectl apply -f examples/production/smom-dbis-138/sentry-01.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-02.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-03.yaml
kubectl apply -f examples/production/smom-dbis-138/sentry-04.yaml

# Deploy RPC nodes
kubectl apply -f examples/production/smom-dbis-138/rpc-node-01.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-02.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-03.yaml
kubectl apply -f examples/production/smom-dbis-138/rpc-node-04.yaml

# Deploy services
kubectl apply -f examples/production/smom-dbis-138/management.yaml
kubectl apply -f examples/production/smom-dbis-138/monitoring.yaml
kubectl apply -f examples/production/smom-dbis-138/services.yaml
kubectl apply -f examples/production/smom-dbis-138/blockscout.yaml

Step 5: Deploy Phase 4 - Test VMs (Optional)

# Deploy test VMs only if resources allow
kubectl apply -f examples/production/vm-100.yaml
kubectl apply -f examples/production/basic-vm.yaml
kubectl apply -f examples/production/medium-vm.yaml
kubectl apply -f examples/production/large-vm.yaml

Monitoring and Verification

Real-Time Monitoring

# Watch all VM deployments
kubectl get proxmoxvm -A -w

# Check specific VM status
kubectl describe proxmoxvm <vm-name>

# Check controller logs
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100 -f

Resource Monitoring

# Check Proxmox node resources
./scripts/check-proxmox-quota-ssh.sh

# Check VM resource usage
kubectl get proxmoxvm -A -o wide

Post-Deployment Verification

# Verify all VMs are running
kubectl get proxmoxvm -A | grep -v Running

# Check VM IP addresses
kubectl get proxmoxvm -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.network.ipAddress}{"\n"}{end}'

# Verify guest agents
./scripts/verify-guest-agent.sh

Risk Mitigation

Resource Overcommitment

Risk: Requested resources exceed available capacity.

Mitigation:

  1. Deploy VMs in batches, monitoring resource usage
  2. Reduce CPU allocations where possible
  3. Use Ceph storage for large disk requirements
  4. Move high-resource VMs to R630-01
  5. Consider adding additional Proxmox nodes

Deployment Failures

Risk: VM creation may fail due to resource constraints or configuration errors.

Mitigation:

  1. Validate all VM configurations before deployment
  2. Check Proxmox quotas before each deployment
  3. Monitor controller logs for errors
  4. Have rollback procedures ready
  5. Test deployments on non-critical VMs first

Network Issues

Risk: Network connectivity problems may prevent VM deployment or operation.

Mitigation:

  1. Verify network bridges exist on all nodes
  2. Test network connectivity before deployment
  3. Configure proper DNS resolution
  4. Verify firewall rules allow required traffic

Deployment Timeline

Estimated Timeline

  • Phase 1 (Core Infrastructure): 30 minutes
  • Phase 2 (Phoenix Infrastructure): 2-4 hours
  • Phase 3 (Blockchain Infrastructure): 3-6 hours
  • Phase 4 (Test VMs): 1 hour (optional)

Total Estimated Time: 6-11 hours (excluding verification and troubleshooting)

Critical Path

  1. Core Infrastructure (Nginx, Cloudflare Tunnel) → 30 min
  2. DNS Primary → 15 min
  3. Git Server, Email Server → 1 hour
  4. DevOps Runner, Codespaces IDE → 1 hour
  5. Blockchain Validators → 2 hours
  6. Blockchain Sentries → 1 hour
  7. Blockchain RPC Nodes → 1 hour
  8. Blockchain Services → 1 hour

Next Steps

  1. Review and Approve: Review this plan and approve resource allocations
  2. Update VM Configurations: Update VM YAML files with optimized resource allocations
  3. Pre-Deployment Checks: Run all pre-deployment verification scripts
  4. Execute Deployment: Follow deployment steps in order
  5. Monitor and Verify: Continuously monitor deployment progress
  6. Post-Deployment: Verify all services are operational


Last Updated: 2025-01-XX
Status: Ready for Review
Maintainer: Infrastructure Team
Version: 2.0