Files
proxmox/docs/02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md

361 lines
14 KiB
Markdown
Raw Permalink Normal View History

# Orchestration Deployment Guide - Enterprise-Grade
**Navigation:** [Home](/docs/01-getting-started/README.md) > [Architecture](/docs/01-getting-started/README.md) > Orchestration Deployment Guide
**Sankofa / Phoenix / PanTel · ChainID 138 · Proxmox + Cloudflare Zero Trust + Dual ISP + 6×/28**
**Last Updated:** 2025-01-20
**Document Version:** 1.1
**Status:** 🟢 Active Documentation
---
## Overview
This is the **complete orchestration technical plan** for your environment, using your actual **Spectrum /28 #1** and **placeholders for the other five /28 blocks**, explicitly mapping to your hardware:
- **2× ER605** (edge + HA/failover design)
- **3× ES216G switches**
- **1× ML110 Gen9** (management / seed / bootstrap)
- **4× Dell R630** (compute cluster; 512GB RAM each; 2×600GB boot; 6×250GB SSD)
This guide provides a **buildable blueprint**: network, VLANs, Proxmox cluster, IPAM, CCIP next-phase matrix, Cloudflare Zero Trust, and operational runbooks.
---
## Table of Contents
**Estimated Reading Time:** 45 minutes
**Progress:** Use this TOC to track your reading progress
1. ✅ [Core Principles](#core-principles) - *Foundation concepts*
2. ✅ [Physical Topology & Roles](#physical-topology--roles) - *Hardware layout*
3. ✅ [ISP & Public IP Plan](#isp--public-ip-plan) - *Public IP allocation*
4. ✅ [Layer-2 & VLAN Orchestration](#layer-2--vlan-orchestration) - *VLAN configuration*
5. ✅ [Routing, NAT, and Egress Segmentation](#routing-nat-and-egress-segmentation) - *Network routing*
6. ✅ [Proxmox Cluster Orchestration](#proxmox-cluster-orchestration) - *Proxmox setup*
7. ✅ [Cloudflare Zero Trust Orchestration](#cloudflare-zero-trust-orchestration) - *Cloudflare integration*
8. ✅ [VMID Allocation Registry](#vmid-allocation-registry) - *VMID planning*
9. ✅ [CCIP Fleet Deployment Matrix](#ccip-fleet-deployment-matrix) - *CCIP deployment*
10. ✅ [Deployment Orchestration Workflow](#deployment-orchestration-workflow) - *Deployment process*
11. ✅ [Operational Runbooks](#operational-runbooks) - *Operations guide*
---
## Core Principles
1. **No public IPs on Proxmox hosts or LXCs/VMs** (default)
2. **Inbound access = Cloudflare Zero Trust + cloudflared** (primary)
3. **Public IPs are used for:**
- ER605 WAN addressing
- **Egress NAT pools** (role-based allowlisting)
- **Break-glass** emergency endpoints only
4. **Segmentation by VLAN/VRF**: consensus vs services vs sovereign tenants vs ops
5. **Deterministic VMID registry** + IPAM that matches
---
## Physical Topology & Roles
> **Reference:** For complete hardware role assignments, physical topology, and detailed specifications, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#1-physical-topology--hardware-roles)**.
> **Hardware Inventory:** For complete physical hardware inventory including IP addresses, credentials, hostnames, and detailed specifications, see **[PHYSICAL_HARDWARE_INVENTORY.md](PHYSICAL_HARDWARE_INVENTORY.md)** ⭐⭐⭐.
**Summary:**
- **2× ER605** (edge + HA/failover design)
- **3× ES216G switches** (core, compute, mgmt)
- **1× ML110 Gen9** (management / seed / bootstrap) - IP: 192.168.11.10
- **4× Dell R630** (compute cluster; 512GB RAM each; 2×600GB boot; 6×250GB SSD)
---
## ISP & Public IP Plan
> **Reference:** For complete public IP block plan, usage policy, and NAT pool assignments, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#2-isp--public-ip-plan-6--28)**.
**Summary:**
- **Block #1** (76.53.10.32/28): Router WAN + break-glass VIPs ✅ Configured
- **Blocks #2-6**: Placeholders for CCIP Commit, Execute, RMN, Service, and Sovereign tenant egress NAT pools
---
## Layer-2 & VLAN Orchestration
> **Reference:** For complete VLAN orchestration plan, subnet allocations, and switching configuration, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#3-layer-2--vlan-orchestration-plan)**.
**Summary:**
- **19 VLANs** defined with complete subnet plan
- **VLAN 11**: MGMT-LAN (192.168.11.0/24) - Current flat LAN
- **VLANs 110-203**: Service-specific VLANs (10.x.0.0/24 or /20 or /22)
- **Migration path**: From flat LAN to VLANs while maintaining compatibility
---
## Routing, NAT, and Egress Segmentation
> **Reference:** For complete routing configuration, NAT policies, and egress segmentation details, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#4-routing-nat-and-egress-segmentation-er605)**.
**Summary:**
- **Inbound NAT**: Default none (Cloudflare Tunnel primary)
- **Outbound NAT**: Role-based pools using /28 blocks #2-6
- **Egress Segmentation**: CCIP Commit → Block #2, Execute → Block #3, RMN → Block #4, Services → Block #5, Sovereign → Block #6
---
## Proxmox Cluster Orchestration
> **Reference:** For complete Proxmox cluster orchestration, networking, and storage details, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#5-proxmox-cluster-orchestration)**.
**Summary:**
- **Node Layout**: ml110 (mgmt) + r630-01..04 (compute)
- **Networking**: VLAN-aware bridge `vmbr0` with native VLAN 11
- **Storage**: ZFS recommended for R630 data SSDs
---
## Cloudflare Zero Trust Orchestration
> **Reference:** For complete Cloudflare Zero Trust orchestration, cloudflared gateway pattern, and tunnel configuration, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#6-cloudflare-zero-trust-orchestration)**.
**Summary:**
- **2 cloudflared LXCs** for redundancy (ML110 + R630)
- **Tunnels for**: Blockscout, FireFly, Gitea, internal admin dashboards
- **Proxmox UI**: LAN-only (publish via Cloudflare Access if needed)
For detailed Cloudflare configuration guides, see:
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)**
- **[../04-configuration/cloudflare/CLOUDFLARE_DNS_TO_CONTAINERS.md](../04-configuration/cloudflare/CLOUDFLARE_DNS_TO_CONTAINERS.md)**
---
## VMID Allocation Registry
> **Reference:** For complete VMID allocation registry with detailed breakdowns, see **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)**.
**Summary:**
- **Total Allocated**: 11,000 VMIDs (1000-13999)
- **Besu Network**: 4,000 VMIDs (1000-4999)
- **CCIP**: 200 VMIDs (5400-5599)
- **Sovereign Cloud Band**: 4,000 VMIDs (10000-13999)
See also **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#7-complete-vmid-and-network-allocation-table)** for VMID-to-VLAN mapping.
---
## CCIP Fleet Deployment Matrix
### Lane A — Minimum Production Fleet
**Total new CCIP nodes:** 41 (or 43 if you add 2 monitoring nodes)
### VMIDs + Hostnames
| Group | Count | VMIDs | Hostname Pattern |
|-------|------:|------:|------------------|
| Ops/Admin | 2 | 54005401 | `ccip-ops-01..02` |
| Monitoring (optional) | 2 | 54025403 | `ccip-mon-01..02` |
| Commit Oracles | 16 | 54105425 | `ccip-commit-01..16` |
| Execute Oracles | 16 | 54405455 | `ccip-exec-01..16` |
| RMN | 7 | 54705476 | `ccip-rmn-01..07` |
### Private IP Assignments (VLAN-based)
Once VLANs are active, assign:
| Role | VLAN | Subnet |
|------|-----:|--------|
| Ops/Admin | 130 | 10.130.0.0/24 |
| Commit | 132 | 10.132.0.0/24 |
| Execute | 133 | 10.133.0.0/24 |
| RMN | 134 | 10.134.0.0/24 |
> **Interim Plan:** While still on the flat LAN, use 192.168.11.170-212 (cleared 2026-02-01). Migrate to VLANs when ready.
### Egress NAT Mapping (Public blocks placeholder)
- Commit VLAN (10.132.0.0/24) → **Block #2** `<PUBLIC_BLOCK_2>/28`
- Execute VLAN (10.133.0.0/24) → **Block #3** `<PUBLIC_BLOCK_3>/28`
- RMN VLAN (10.134.0.0/24) → **Block #4** `<PUBLIC_BLOCK_4>/28`
See **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** for complete specification.
---
## Deployment Orchestration Workflow
### Deployment Workflow Diagram
```mermaid
flowchart TD
Start[Start Deployment] --> Phase0[Phase 0: Validate Foundation]
Phase0 --> Check1{Foundation Valid?}
Check1 -->|No| Fix1[Fix Issues]
Fix1 --> Phase0
Check1 -->|Yes| Phase1[Phase 1: Enable VLANs]
Phase1 --> Verify1{VLANs Working?}
Verify1 -->|No| FixVLAN[Fix VLAN Config]
FixVLAN --> Phase1
Verify1 -->|Yes| Phase2[Phase 2: Deploy Observability]
Phase2 --> Verify2{Monitoring Active?}
Verify2 -->|No| FixMonitor[Fix Monitoring]
FixMonitor --> Phase2
Verify2 -->|Yes| Phase3[Phase 3: Deploy CCIP Fleet]
Phase3 --> Verify3{CCIP Nodes Running?}
Verify3 -->|No| FixCCIP[Fix CCIP Config]
FixCCIP --> Phase3
Verify3 -->|Yes| Phase4[Phase 4: Deploy Sovereign Tenants]
Phase4 --> Verify4{Tenants Operational?}
Verify4 -->|No| FixTenants[Fix Tenant Config]
FixTenants --> Phase4
Verify4 -->|Yes| Complete[Deployment Complete]
```
### Phase 0 — Validate Foundation
1. ✅ Confirm ER605-A WAN1 static: **76.53.10.34/28**, GW **76.53.10.33**
2. ⏳ Confirm WAN2 on ER605-A (ISP #2) failover
3. ⏳ Confirm ES216G trunks and native VLAN 11 mgmt access is stable
4. ⏳ Confirm Proxmox mgmt reachable only from trusted admin endpoints
### Phase 1 — VLAN Enablement
1. ⏳ Configure ES216G trunk ports
2. ⏳ Enable VLAN-aware bridge `vmbr0` on Proxmox nodes
3. ⏳ Create VLAN interfaces on ER605 for routing + DHCP (where appropriate)
4. ⏳ Move services one domain at a time (start with monitoring)
### Phase 2 — Observability First
1. ⏳ Deploy monitoring stack (Prometheus/Grafana/Loki/Alertmanager)
2. ⏳ Publish Grafana via Cloudflare Access (not public IPs)
3. ⏳ Set alerts for node health, disk, latency, chain metrics
### Phase 3 — CCIP Fleet (Lane A)
1. ⏳ Deploy CCIP Ops/Admin
2. ⏳ Deploy 16 commit nodes (VLAN 132)
3. ⏳ Deploy 16 execute nodes (VLAN 133)
4. ⏳ Deploy 7 RMN nodes (VLAN 134)
5. ⏳ Apply ER605 outbound NAT pools per VLAN using /28 blocks #2#4 placeholders
6. ⏳ Verify node egress identity by role (allowlisting ready)
### Phase 4 — Sovereign Tenant Rollout
1. ⏳ Stand up Phoenix Sovereign Cloud Band VLANs 200203
2. ⏳ Apply Block #6 egress NAT
3. ⏳ Enforce tenant isolation (ACLs, deny east-west)
---
## Operational Runbooks
### Network Operations
- **[../04-configuration/ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration guide
- **[../06-besu/BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** - Besu allowlist management
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare Zero Trust setup
### Deployment Operations
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP fleet deployment
- **[DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Pre-deployment validation
### Troubleshooting
- **[../09-troubleshooting/TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions
- **[../09-troubleshooting/QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
---
## Deliverables
### Completed ✅
- ✅ Authoritative VLAN and subnet plan
- ✅ Public block usage model (with placeholders for 5 blocks)
- ✅ Proxmox cluster topology plan
- ✅ CCIP fleet deployment matrix
- ✅ Stepwise orchestration workflow
### Pending ⏳
- ⏳ Exact NAT/VIP rules (requires public blocks #2-6)
- ⏳ ER605-B role decision (standby edge vs dedicated sovereign edge)
- ⏳ VLAN migration execution
- ⏳ CCIP fleet deployment
---
## Next Steps
### To Finalize Placeholders
Paste the other five /28 blocks in the same format as Block #1:
- Network / Gateway / Usable / Broadcast
And specify:
- ER605-B usage: **standby edge** OR **dedicated sovereign edge**
Then we can produce:
- **Exact NAT pool assignment sheet** per role
- **Break-glass VIP table**
- **Complete ER605 configuration**
---
## Related Documentation
### Prerequisites
- **[../01-getting-started/PREREQUISITES.md](/docs/01-getting-started/PREREQUISITES.md)** - System requirements and prerequisites
- **[../03-deployment/DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Pre-deployment validation checklist
### Architecture
- **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md)** ⭐⭐⭐ - Complete network architecture (authoritative reference)
- **[PHYSICAL_HARDWARE_INVENTORY.md](PHYSICAL_HARDWARE_INVENTORY.md)** ⭐⭐⭐ - Physical hardware inventory and specifications
- **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)** ⭐⭐⭐ - VMID allocation registry
- **[DOMAIN_STRUCTURE.md](DOMAIN_STRUCTURE.md)** ⭐⭐ - Domain structure and DNS assignments
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment specification
### Configuration
- **[../04-configuration/ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare Zero Trust setup
### Operations
- **[../03-deployment/OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** - Operational procedures
- **[../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md](../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Deployment status
- **[../09-troubleshooting/TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide
### Best Practices
- **[../10-best-practices/RECOMMENDATIONS_AND_SUGGESTIONS.md](../10-best-practices/RECOMMENDATIONS_AND_SUGGESTIONS.md)** - Comprehensive recommendations
- **[../10-best-practices/IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md)** - Implementation checklist
### Reference
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index
---
**Document Status:** Complete (v1.1)
**Maintained By:** Infrastructure Team
**Review Cycle:** Monthly
**Last Updated:** 2025-01-20
---
## Change Log
### Version 1.1 (2025-01-20)
- Removed duplicate network architecture content
- Added references to NETWORK_ARCHITECTURE.md
- Added deployment workflow Mermaid diagram
- Added ASCII art process flow
- Added breadcrumb navigation
- Added status indicators
### Version 1.0 (2024-12-15)
- Initial version
- Complete deployment orchestration guide