# Complete Ecosystem Improvement Plan **Date**: 2026-01-05 **Status**: 📋 **COMPREHENSIVE PLAN** **Scope**: Complete infrastructure ecosystem optimization --- ## Executive Summary This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing: 1. **Workload Distribution** - ml110 is overloaded (34 containers) while R630 servers are underutilized 2. **IP Conflict Resolution** - 192.168.11.14 conflict needs investigation 3. **Network Architecture** - VLAN migration and routing improvements 4. **Cloudflare/DNS** - Tunnel configuration, DNS cleanup, and routing fixes 5. **Storage Optimization** - Enable and optimize storage on R630 servers 6. **Service Migration** - Redistribute workloads for better performance 7. **Monitoring & Documentation** - Complete infrastructure visibility **Current State**: ⚠️ **Suboptimal** - ml110 handling 100% of workload with least powerful hardware **Target State**: ✅ **Optimized** - Balanced workload distribution across all servers --- ## Phase 1: Critical Issues Resolution (Week 1-2) ### 1.1 IP Conflict Investigation & Resolution **Issue**: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based **Actions**: - [ ] Get MAC address of device using 192.168.11.14 - [ ] Identify device type from MAC vendor database - [ ] Check physical r630-04 server status (power, console/iDRAC) - [ ] Verify r630-04 actual IP address and Proxmox installation - [ ] Check for orphaned VMs on all Proxmox hosts - [ ] Resolve IP conflict (reassign IP or remove conflicting device) - [ ] Update documentation with correct IP assignments **Deliverable**: Resolved IP conflict, identified actual r630-04 status **Priority**: 🔴 **CRITICAL** --- ### 1.2 Cloudflare Tunnel Configuration Fix **Issue**: Tunnel `rpc-http-pub.d-bis.org` is DOWN, routing incorrectly **Actions**: - [ ] Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx - `explorer.d-bis.org` → `http://192.168.11.21:80` - `rpc-http-pub.d-bis.org` → `http://192.168.11.21:80` - `rpc-http-prv.d-bis.org` → `http://192.168.11.21:80` - `dbis-admin.d-bis.org` → `http://192.168.11.21:80` - `dbis-api.d-bis.org` → `http://192.168.11.21:80` - `dbis-api-2.d-bis.org` → `http://192.168.11.21:80` - `mim4u.org` → `http://192.168.11.21:80` - `www.mim4u.org` → `http://192.168.11.21:80` - [ ] Keep WebSocket endpoints routing directly to RPC nodes - [ ] Verify tunnel health after changes - [ ] Test all endpoints **Deliverable**: All tunnels healthy, routing through central Nginx **Priority**: 🔴 **CRITICAL** --- ### 1.3 DNS Records Cleanup & Migration **Issues**: - Missing CNAME records for RPC and DBIS services - Duplicate A records - Inconsistent proxy status **Actions**: - [ ] Create missing CNAME records: - `rpc-http-pub.d-bis.org` → `.cfargotunnel.com` - `rpc-ws-pub.d-bis.org` → `.cfargotunnel.com` - `rpc-http-prv.d-bis.org` → `.cfargotunnel.com` - `rpc-ws-prv.d-bis.org` → `.cfargotunnel.com` - `dbis-admin.d-bis.org` → `.cfargotunnel.com` - `dbis-api.d-bis.org` → `.cfargotunnel.com` - `dbis-api-2.d-bis.org` → `.cfargotunnel.com` - `mim4u.org` → `.cfargotunnel.com` - `www.mim4u.org` → `.cfargotunnel.com` - [ ] Remove duplicate A records: - `besu.d-bis.org` (keep one IP) - `blockscout.d-bis.org` (keep one IP) - `explorer.d-bis.org` (keep one IP) - `d-bis.org` (keep 20.215.32.15) - [ ] Enable proxy (orange cloud) for all public services - [ ] Standardize TTL settings **Deliverable**: Clean DNS configuration, all services accessible via tunnels **Priority**: 🔴 **CRITICAL** --- ## Phase 2: Storage & Infrastructure Optimization (Week 2-3) ### 2.1 Storage Activation on R630 Servers **Issue**: Storage pools disabled on r630-01 and r630-02 **Actions**: - [ ] **r630-01**: Enable local-lvm and thin1 storage pools - [ ] **r630-02**: Verify and enable thin storage pools - [ ] Verify storage is accessible and working - [ ] Test VM creation on both hosts - [ ] Document storage configuration **Deliverable**: All storage pools active and ready for VM deployment **Priority**: 🔴 **HIGH** (blocks workload migration) --- ### 2.2 Cluster Configuration Verification **Actions**: - [ ] Verify cluster recognizes all hostnames correctly - [ ] Update any remaining references to old hostnames (pve, pve2) - [ ] Verify quorum is maintained - [ ] Test cluster operations (migration, HA) - [ ] Document cluster configuration **Deliverable**: Cluster fully operational with correct hostnames **Priority**: 🟡 **MEDIUM** --- ## Phase 3: Workload Redistribution (Week 3-5) ### 3.1 Workload Analysis & Migration Plan **Current State**: - **ml110**: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz - **r630-01**: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz - **r630-02**: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz **Target Distribution**: | Server | Current | Target | Migration | |--------|---------|--------|-----------| | **ml110** | 34 containers | 10-15 containers | Keep lightweight/management | | **r630-01** | 3 containers | 15-20 containers | Add medium workload VMs | | **r630-02** | 11 containers | 15-20 containers | Add heavy workload VMs | **Migration Strategy**: #### Keep on ml110 (Management/Infrastructure): - VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx) - Lightweight management services #### Migrate to r630-01 (Medium Workload): - Besu Validators (1000-1004): 40GB RAM, 20 cores total - DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores - Application Services (7800-7811): ~30GB RAM #### Migrate to r630-02 (Heavy Workload): - Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total - Besu Sentries (1500-1503): 16GB RAM, 8 cores total - Blockscout (5000): Database-intensive - Firefly (6200-6201): Web3 gateway services **Actions**: - [ ] Create detailed migration plan with downtime windows - [ ] Backup all containers before migration - [ ] Test migration process with one container first - [ ] Migrate containers in batches (by service type) - [ ] Verify services after migration - [ ] Update documentation with new locations **Deliverable**: Balanced workload distribution across all servers **Priority**: 🔴 **HIGH** (improves performance significantly) --- ## Phase 4: Network Architecture Improvements (Week 4-6) ### 4.1 VLAN Migration Planning **Current**: Flat LAN (192.168.11.0/24) **Target**: VLAN-based segmentation (16+ VLANs) **Actions**: - [ ] Review VLAN plan from NETWORK_ARCHITECTURE.md - [ ] Configure ES216G switches for VLAN trunking - [ ] Enable VLAN-aware bridge on Proxmox hosts - [ ] Create VLAN interfaces on ER605 router - [ ] Migrate services to appropriate VLANs - [ ] Test inter-VLAN routing - [ ] Update firewall rules **Key VLANs**: - VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility - VLAN 110: BESU-VAL (10.110.0.0/24) - Validators - VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries - VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes - VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer - VLAN 130-134: CCIP networks - VLAN 200-203: Sovereign tenants **Deliverable**: VLAN-based network segmentation implemented **Priority**: 🟡 **MEDIUM** (improves security and organization) --- ## Phase 5: Service Optimization (Week 5-7) ### 5.1 Nginx Architecture Review **Current**: Multiple Nginx instances - Central Nginx (VMID 105): Nginx Proxy Manager - Blockscout Nginx (VMID 5000): Local Nginx - MIM Nginx (VMID 7810): Local Nginx - RPC Nginx (VMIDs 2500-2502): SSL termination **Actions**: - [ ] Document purpose of each Nginx instance - [ ] Verify all routing is correct - [ ] Consider consolidation opportunities - [ ] Standardize SSL certificate management - [ ] Optimize Nginx configurations **Deliverable**: Documented and optimized Nginx architecture **Priority**: 🟢 **LOW** --- ## Phase 6: Documentation & Automation (Week 6-8) ### 6.1 Infrastructure Documentation **Actions**: - [ ] Create complete infrastructure map - [ ] Document all IP assignments - [ ] Document all service locations - [ ] Create network topology diagrams - [ ] Document all configurations - [ ] Create runbooks for common operations **Deliverable**: Complete infrastructure documentation **Priority**: 🟡 **MEDIUM** --- ## Success Metrics ### Performance Improvements | Metric | Current | Target | Improvement | |--------|---------|--------|-------------| | ml110 CPU Usage | High (75% memory) | <50% | 33% reduction | | ml110 Memory Usage | 75% | <50% | 33% reduction | | r630-01 Utilization | 1% | 40-60% | Better resource use | | r630-02 Utilization | 2% | 40-60% | Better resource use | | Average Response Time | Baseline | -20% | Faster responses | ### Availability Improvements | Metric | Current | Target | |--------|---------|--------| | Cloudflare Tunnel Uptime | 40-60% | >99% | | Service Availability | Variable | >99.5% | | DNS Resolution | Some issues | 100% | --- ## Timeline Summary | Phase | Duration | Key Deliverables | |-------|----------|------------------| | **Phase 1** | Weeks 1-2 | Critical issues resolved | | **Phase 2** | Weeks 2-3 | Storage optimized, infrastructure ready | | **Phase 3** | Weeks 3-5 | Workload redistributed | | **Phase 4** | Weeks 4-6 | Network architecture improved | | **Phase 5** | Weeks 5-7 | Services optimized | | **Phase 6** | Weeks 6-8 | Documentation complete | **Total Timeline**: 8 weeks (with some phases overlapping) --- ## Next Steps ### Immediate (This Week) 1. **Start IP Conflict Investigation** - Get MAC address of 192.168.11.14 - Check physical r630-04 status - Identify what's using the IP 2. **Fix Cloudflare Tunnel** - Update tunnel routing configuration - Test all endpoints 3. **Clean Up DNS** - Remove duplicate records - Create missing CNAME records --- **Last Updated**: 2026-01-05 **Status**: 📋 **PLAN READY FOR EXECUTION**