Files
proxmox/docs/02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

361 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Orchestration Deployment Guide - Enterprise-Grade
**Navigation:** [Home](/docs/01-getting-started/README.md) > [Architecture](/docs/01-getting-started/README.md) > Orchestration Deployment Guide
**Sankofa / Phoenix / PanTel · ChainID 138 · Proxmox + Cloudflare Zero Trust + Dual ISP + 6×/28**
**Last Updated:** 2025-01-20
**Document Version:** 1.1
**Status:** 🟢 Active Documentation
---
## Overview
This is the **complete orchestration technical plan** for your environment, using your actual **Spectrum /28 #1** and **placeholders for the other five /28 blocks**, explicitly mapping to your hardware:
- **2× ER605** (edge + HA/failover design)
- **3× ES216G switches**
- **1× ML110 Gen9** (management / seed / bootstrap)
- **4× Dell R630** (compute cluster; 512GB RAM each; 2×600GB boot; 6×250GB SSD)
This guide provides a **buildable blueprint**: network, VLANs, Proxmox cluster, IPAM, CCIP next-phase matrix, Cloudflare Zero Trust, and operational runbooks.
---
## Table of Contents
**Estimated Reading Time:** 45 minutes
**Progress:** Use this TOC to track your reading progress
1. ✅ [Core Principles](#core-principles) - *Foundation concepts*
2. ✅ [Physical Topology & Roles](#physical-topology--roles) - *Hardware layout*
3. ✅ [ISP & Public IP Plan](#isp--public-ip-plan) - *Public IP allocation*
4. ✅ [Layer-2 & VLAN Orchestration](#layer-2--vlan-orchestration) - *VLAN configuration*
5. ✅ [Routing, NAT, and Egress Segmentation](#routing-nat-and-egress-segmentation) - *Network routing*
6. ✅ [Proxmox Cluster Orchestration](#proxmox-cluster-orchestration) - *Proxmox setup*
7. ✅ [Cloudflare Zero Trust Orchestration](#cloudflare-zero-trust-orchestration) - *Cloudflare integration*
8. ✅ [VMID Allocation Registry](#vmid-allocation-registry) - *VMID planning*
9. ✅ [CCIP Fleet Deployment Matrix](#ccip-fleet-deployment-matrix) - *CCIP deployment*
10. ✅ [Deployment Orchestration Workflow](#deployment-orchestration-workflow) - *Deployment process*
11. ✅ [Operational Runbooks](#operational-runbooks) - *Operations guide*
---
## Core Principles
1. **No public IPs on Proxmox hosts or LXCs/VMs** (default)
2. **Inbound access = Cloudflare Zero Trust + cloudflared** (primary)
3. **Public IPs are used for:**
- ER605 WAN addressing
- **Egress NAT pools** (role-based allowlisting)
- **Break-glass** emergency endpoints only
4. **Segmentation by VLAN/VRF**: consensus vs services vs sovereign tenants vs ops
5. **Deterministic VMID registry** + IPAM that matches
---
## Physical Topology & Roles
> **Reference:** For complete hardware role assignments, physical topology, and detailed specifications, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#1-physical-topology--hardware-roles)**.
> **Hardware Inventory:** For complete physical hardware inventory including IP addresses, credentials, hostnames, and detailed specifications, see **[PHYSICAL_HARDWARE_INVENTORY.md](PHYSICAL_HARDWARE_INVENTORY.md)** ⭐⭐⭐.
**Summary:**
- **2× ER605** (edge + HA/failover design)
- **3× ES216G switches** (core, compute, mgmt)
- **1× ML110 Gen9** (management / seed / bootstrap) - IP: 192.168.11.10
- **4× Dell R630** (compute cluster; 512GB RAM each; 2×600GB boot; 6×250GB SSD)
---
## ISP & Public IP Plan
> **Reference:** For complete public IP block plan, usage policy, and NAT pool assignments, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#2-isp--public-ip-plan-6--28)**.
**Summary:**
- **Block #1** (76.53.10.32/28): Router WAN + break-glass VIPs ✅ Configured
- **Blocks #2-6**: Placeholders for CCIP Commit, Execute, RMN, Service, and Sovereign tenant egress NAT pools
---
## Layer-2 & VLAN Orchestration
> **Reference:** For complete VLAN orchestration plan, subnet allocations, and switching configuration, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#3-layer-2--vlan-orchestration-plan)**.
**Summary:**
- **19 VLANs** defined with complete subnet plan
- **VLAN 11**: MGMT-LAN (192.168.11.0/24) - Current flat LAN
- **VLANs 110-203**: Service-specific VLANs (10.x.0.0/24 or /20 or /22)
- **Migration path**: From flat LAN to VLANs while maintaining compatibility
---
## Routing, NAT, and Egress Segmentation
> **Reference:** For complete routing configuration, NAT policies, and egress segmentation details, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#4-routing-nat-and-egress-segmentation-er605)**.
**Summary:**
- **Inbound NAT**: Default none (Cloudflare Tunnel primary)
- **Outbound NAT**: Role-based pools using /28 blocks #2-6
- **Egress Segmentation**: CCIP Commit → Block #2, Execute → Block #3, RMN → Block #4, Services → Block #5, Sovereign → Block #6
---
## Proxmox Cluster Orchestration
> **Reference:** For complete Proxmox cluster orchestration, networking, and storage details, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#5-proxmox-cluster-orchestration)**.
**Summary:**
- **Node Layout**: ml110 (mgmt) + r630-01..04 (compute)
- **Networking**: VLAN-aware bridge `vmbr0` with native VLAN 11
- **Storage**: ZFS recommended for R630 data SSDs
---
## Cloudflare Zero Trust Orchestration
> **Reference:** For complete Cloudflare Zero Trust orchestration, cloudflared gateway pattern, and tunnel configuration, see **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#6-cloudflare-zero-trust-orchestration)**.
**Summary:**
- **2 cloudflared LXCs** for redundancy (ML110 + R630)
- **Tunnels for**: Blockscout, FireFly, Gitea, internal admin dashboards
- **Proxmox UI**: LAN-only (publish via Cloudflare Access if needed)
For detailed Cloudflare configuration guides, see:
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)**
- **[../04-configuration/cloudflare/CLOUDFLARE_DNS_TO_CONTAINERS.md](../04-configuration/cloudflare/CLOUDFLARE_DNS_TO_CONTAINERS.md)**
---
## VMID Allocation Registry
> **Reference:** For complete VMID allocation registry with detailed breakdowns, see **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)**.
**Summary:**
- **Total Allocated**: 11,000 VMIDs (1000-13999)
- **Besu Network**: 4,000 VMIDs (1000-4999)
- **CCIP**: 200 VMIDs (5400-5599)
- **Sovereign Cloud Band**: 4,000 VMIDs (10000-13999)
See also **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md#7-complete-vmid-and-network-allocation-table)** for VMID-to-VLAN mapping.
---
## CCIP Fleet Deployment Matrix
### Lane A — Minimum Production Fleet
**Total new CCIP nodes:** 41 (or 43 if you add 2 monitoring nodes)
### VMIDs + Hostnames
| Group | Count | VMIDs | Hostname Pattern |
|-------|------:|------:|------------------|
| Ops/Admin | 2 | 54005401 | `ccip-ops-01..02` |
| Monitoring (optional) | 2 | 54025403 | `ccip-mon-01..02` |
| Commit Oracles | 16 | 54105425 | `ccip-commit-01..16` |
| Execute Oracles | 16 | 54405455 | `ccip-exec-01..16` |
| RMN | 7 | 54705476 | `ccip-rmn-01..07` |
### Private IP Assignments (VLAN-based)
Once VLANs are active, assign:
| Role | VLAN | Subnet |
|------|-----:|--------|
| Ops/Admin | 130 | 10.130.0.0/24 |
| Commit | 132 | 10.132.0.0/24 |
| Execute | 133 | 10.133.0.0/24 |
| RMN | 134 | 10.134.0.0/24 |
> **Interim Plan:** While still on the flat LAN, use 192.168.11.170-212 (cleared 2026-02-01). Migrate to VLANs when ready.
### Egress NAT Mapping (Public blocks placeholder)
- Commit VLAN (10.132.0.0/24) → **Block #2** `<PUBLIC_BLOCK_2>/28`
- Execute VLAN (10.133.0.0/24) → **Block #3** `<PUBLIC_BLOCK_3>/28`
- RMN VLAN (10.134.0.0/24) → **Block #4** `<PUBLIC_BLOCK_4>/28`
See **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** for complete specification.
---
## Deployment Orchestration Workflow
### Deployment Workflow Diagram
```mermaid
flowchart TD
Start[Start Deployment] --> Phase0[Phase 0: Validate Foundation]
Phase0 --> Check1{Foundation Valid?}
Check1 -->|No| Fix1[Fix Issues]
Fix1 --> Phase0
Check1 -->|Yes| Phase1[Phase 1: Enable VLANs]
Phase1 --> Verify1{VLANs Working?}
Verify1 -->|No| FixVLAN[Fix VLAN Config]
FixVLAN --> Phase1
Verify1 -->|Yes| Phase2[Phase 2: Deploy Observability]
Phase2 --> Verify2{Monitoring Active?}
Verify2 -->|No| FixMonitor[Fix Monitoring]
FixMonitor --> Phase2
Verify2 -->|Yes| Phase3[Phase 3: Deploy CCIP Fleet]
Phase3 --> Verify3{CCIP Nodes Running?}
Verify3 -->|No| FixCCIP[Fix CCIP Config]
FixCCIP --> Phase3
Verify3 -->|Yes| Phase4[Phase 4: Deploy Sovereign Tenants]
Phase4 --> Verify4{Tenants Operational?}
Verify4 -->|No| FixTenants[Fix Tenant Config]
FixTenants --> Phase4
Verify4 -->|Yes| Complete[Deployment Complete]
```
### Phase 0 — Validate Foundation
1. ✅ Confirm ER605-A WAN1 static: **76.53.10.34/28**, GW **76.53.10.33**
2. ⏳ Confirm WAN2 on ER605-A (ISP #2) failover
3. ⏳ Confirm ES216G trunks and native VLAN 11 mgmt access is stable
4. ⏳ Confirm Proxmox mgmt reachable only from trusted admin endpoints
### Phase 1 — VLAN Enablement
1. ⏳ Configure ES216G trunk ports
2. ⏳ Enable VLAN-aware bridge `vmbr0` on Proxmox nodes
3. ⏳ Create VLAN interfaces on ER605 for routing + DHCP (where appropriate)
4. ⏳ Move services one domain at a time (start with monitoring)
### Phase 2 — Observability First
1. ⏳ Deploy monitoring stack (Prometheus/Grafana/Loki/Alertmanager)
2. ⏳ Publish Grafana via Cloudflare Access (not public IPs)
3. ⏳ Set alerts for node health, disk, latency, chain metrics
### Phase 3 — CCIP Fleet (Lane A)
1. ⏳ Deploy CCIP Ops/Admin
2. ⏳ Deploy 16 commit nodes (VLAN 132)
3. ⏳ Deploy 16 execute nodes (VLAN 133)
4. ⏳ Deploy 7 RMN nodes (VLAN 134)
5. ⏳ Apply ER605 outbound NAT pools per VLAN using /28 blocks #2#4 placeholders
6. ⏳ Verify node egress identity by role (allowlisting ready)
### Phase 4 — Sovereign Tenant Rollout
1. ⏳ Stand up Phoenix Sovereign Cloud Band VLANs 200203
2. ⏳ Apply Block #6 egress NAT
3. ⏳ Enforce tenant isolation (ACLs, deny east-west)
---
## Operational Runbooks
### Network Operations
- **[../04-configuration/ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration guide
- **[../06-besu/BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** - Besu allowlist management
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare Zero Trust setup
### Deployment Operations
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP fleet deployment
- **[DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Pre-deployment validation
### Troubleshooting
- **[../09-troubleshooting/TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions
- **[../09-troubleshooting/QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
---
## Deliverables
### Completed ✅
- ✅ Authoritative VLAN and subnet plan
- ✅ Public block usage model (with placeholders for 5 blocks)
- ✅ Proxmox cluster topology plan
- ✅ CCIP fleet deployment matrix
- ✅ Stepwise orchestration workflow
### Pending ⏳
- ⏳ Exact NAT/VIP rules (requires public blocks #2-6)
- ⏳ ER605-B role decision (standby edge vs dedicated sovereign edge)
- ⏳ VLAN migration execution
- ⏳ CCIP fleet deployment
---
## Next Steps
### To Finalize Placeholders
Paste the other five /28 blocks in the same format as Block #1:
- Network / Gateway / Usable / Broadcast
And specify:
- ER605-B usage: **standby edge** OR **dedicated sovereign edge**
Then we can produce:
- **Exact NAT pool assignment sheet** per role
- **Break-glass VIP table**
- **Complete ER605 configuration**
---
## Related Documentation
### Prerequisites
- **[../01-getting-started/PREREQUISITES.md](/docs/01-getting-started/PREREQUISITES.md)** - System requirements and prerequisites
- **[../03-deployment/DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Pre-deployment validation checklist
### Architecture
- **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md)** ⭐⭐⭐ - Complete network architecture (authoritative reference)
- **[PHYSICAL_HARDWARE_INVENTORY.md](PHYSICAL_HARDWARE_INVENTORY.md)** ⭐⭐⭐ - Physical hardware inventory and specifications
- **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)** ⭐⭐⭐ - VMID allocation registry
- **[DOMAIN_STRUCTURE.md](DOMAIN_STRUCTURE.md)** ⭐⭐ - Domain structure and DNS assignments
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment specification
### Configuration
- **[../04-configuration/ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration
- **[../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare Zero Trust setup
### Operations
- **[../03-deployment/OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** - Operational procedures
- **[../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md](../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Deployment status
- **[../09-troubleshooting/TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide
### Best Practices
- **[../10-best-practices/RECOMMENDATIONS_AND_SUGGESTIONS.md](../10-best-practices/RECOMMENDATIONS_AND_SUGGESTIONS.md)** - Comprehensive recommendations
- **[../10-best-practices/IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md)** - Implementation checklist
### Reference
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index
---
**Document Status:** Complete (v1.1)
**Maintained By:** Infrastructure Team
**Review Cycle:** Monthly
**Last Updated:** 2025-01-20
---
## Change Log
### Version 1.1 (2025-01-20)
- Removed duplicate network architecture content
- Added references to NETWORK_ARCHITECTURE.md
- Added deployment workflow Mermaid diagram
- Added ASCII art process flow
- Added breadcrumb navigation
- Added status indicators
### Version 1.0 (2024-12-15)
- Initial version
- Complete deployment orchestration guide