- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
9.9 KiB
Complete Ecosystem Improvement Plan
Date: 2026-01-05
Status: 📋 COMPREHENSIVE PLAN
Scope: Complete infrastructure ecosystem optimization
Executive Summary
This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing:
- Workload Distribution - ml110 is overloaded (34 containers) while R630 servers are underutilized
- IP Conflict Resolution - 192.168.11.14 conflict needs investigation
- Network Architecture - VLAN migration and routing improvements
- Cloudflare/DNS - Tunnel configuration, DNS cleanup, and routing fixes
- Storage Optimization - Enable and optimize storage on R630 servers
- Service Migration - Redistribute workloads for better performance
- Monitoring & Documentation - Complete infrastructure visibility
Current State: ⚠️ Suboptimal - ml110 handling 100% of workload with least powerful hardware
Target State: ✅ Optimized - Balanced workload distribution across all servers
Phase 1: Critical Issues Resolution (Week 1-2)
1.1 IP Conflict Investigation & Resolution
Issue: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based
Actions:
- Get MAC address of device using 192.168.11.14
- Identify device type from MAC vendor database
- Check physical r630-04 server status (power, console/iDRAC)
- Verify r630-04 actual IP address and Proxmox installation
- Check for orphaned VMs on all Proxmox hosts
- Resolve IP conflict (reassign IP or remove conflicting device)
- Update documentation with correct IP assignments
Deliverable: Resolved IP conflict, identified actual r630-04 status
Priority: 🔴 CRITICAL
1.2 Cloudflare Tunnel Configuration Fix
Issue: Tunnel rpc-http-pub.d-bis.org is DOWN, routing incorrectly
Actions:
- Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx
explorer.d-bis.org→http://192.168.11.21:80rpc-http-pub.d-bis.org→http://192.168.11.21:80rpc-http-prv.d-bis.org→http://192.168.11.21:80dbis-admin.d-bis.org→http://192.168.11.21:80dbis-api.d-bis.org→http://192.168.11.21:80dbis-api-2.d-bis.org→http://192.168.11.21:80mim4u.org→http://192.168.11.21:80www.mim4u.org→http://192.168.11.21:80
- Keep WebSocket endpoints routing directly to RPC nodes
- Verify tunnel health after changes
- Test all endpoints
Deliverable: All tunnels healthy, routing through central Nginx
Priority: 🔴 CRITICAL
1.3 DNS Records Cleanup & Migration
Issues:
- Missing CNAME records for RPC and DBIS services
- Duplicate A records
- Inconsistent proxy status
Actions:
- Create missing CNAME records:
rpc-http-pub.d-bis.org→<tunnel-id>.cfargotunnel.comrpc-ws-pub.d-bis.org→<tunnel-id>.cfargotunnel.comrpc-http-prv.d-bis.org→<tunnel-id>.cfargotunnel.comrpc-ws-prv.d-bis.org→<tunnel-id>.cfargotunnel.comdbis-admin.d-bis.org→<tunnel-id>.cfargotunnel.comdbis-api.d-bis.org→<tunnel-id>.cfargotunnel.comdbis-api-2.d-bis.org→<tunnel-id>.cfargotunnel.commim4u.org→<tunnel-id>.cfargotunnel.comwww.mim4u.org→<tunnel-id>.cfargotunnel.com
- Remove duplicate A records:
besu.d-bis.org(keep one IP)blockscout.d-bis.org(keep one IP)explorer.d-bis.org(keep one IP)d-bis.org(keep 20.215.32.15)
- Enable proxy (orange cloud) for all public services
- Standardize TTL settings
Deliverable: Clean DNS configuration, all services accessible via tunnels
Priority: 🔴 CRITICAL
Phase 2: Storage & Infrastructure Optimization (Week 2-3)
2.1 Storage Activation on R630 Servers
Issue: Storage pools disabled on r630-01 and r630-02
Actions:
- r630-01: Enable local-lvm and thin1 storage pools
- r630-02: Verify and enable thin storage pools
- Verify storage is accessible and working
- Test VM creation on both hosts
- Document storage configuration
Deliverable: All storage pools active and ready for VM deployment
Priority: 🔴 HIGH (blocks workload migration)
2.2 Cluster Configuration Verification
Actions:
- Verify cluster recognizes all hostnames correctly
- Update any remaining references to old hostnames (pve, pve2)
- Verify quorum is maintained
- Test cluster operations (migration, HA)
- Document cluster configuration
Deliverable: Cluster fully operational with correct hostnames
Priority: 🟡 MEDIUM
Phase 3: Workload Redistribution (Week 3-5)
3.1 Workload Analysis & Migration Plan
Current State:
- ml110: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz
- r630-01: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz
- r630-02: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz
Target Distribution:
| Server | Current | Target | Migration |
|---|---|---|---|
| ml110 | 34 containers | 10-15 containers | Keep lightweight/management |
| r630-01 | 3 containers | 15-20 containers | Add medium workload VMs |
| r630-02 | 11 containers | 15-20 containers | Add heavy workload VMs |
Migration Strategy:
Keep on ml110 (Management/Infrastructure):
- VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx)
- Lightweight management services
Migrate to r630-01 (Medium Workload):
- Besu Validators (1000-1004): 40GB RAM, 20 cores total
- DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores
- Application Services (7800-7811): ~30GB RAM
Migrate to r630-02 (Heavy Workload):
- Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total
- Besu Sentries (1500-1503): 16GB RAM, 8 cores total
- Blockscout (5000): Database-intensive
- Firefly (6200-6201): Web3 gateway services
Actions:
- Create detailed migration plan with downtime windows
- Backup all containers before migration
- Test migration process with one container first
- Migrate containers in batches (by service type)
- Verify services after migration
- Update documentation with new locations
Deliverable: Balanced workload distribution across all servers
Priority: 🔴 HIGH (improves performance significantly)
Phase 4: Network Architecture Improvements (Week 4-6)
4.1 VLAN Migration Planning
Current: Flat LAN (192.168.11.0/24)
Target: VLAN-based segmentation (16+ VLANs)
Actions:
- Review VLAN plan from NETWORK_ARCHITECTURE.md
- Configure ES216G switches for VLAN trunking
- Enable VLAN-aware bridge on Proxmox hosts
- Create VLAN interfaces on ER605 router
- Migrate services to appropriate VLANs
- Test inter-VLAN routing
- Update firewall rules
Key VLANs:
- VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility
- VLAN 110: BESU-VAL (10.110.0.0/24) - Validators
- VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries
- VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes
- VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer
- VLAN 130-134: CCIP networks
- VLAN 200-203: Sovereign tenants
Deliverable: VLAN-based network segmentation implemented
Priority: 🟡 MEDIUM (improves security and organization)
Phase 5: Service Optimization (Week 5-7)
5.1 Nginx Architecture Review
Current: Multiple Nginx instances
- Central Nginx (VMID 105): Nginx Proxy Manager
- Blockscout Nginx (VMID 5000): Local Nginx
- MIM Nginx (VMID 7810): Local Nginx
- RPC Nginx (VMIDs 2500-2502): SSL termination
Actions:
- Document purpose of each Nginx instance
- Verify all routing is correct
- Consider consolidation opportunities
- Standardize SSL certificate management
- Optimize Nginx configurations
Deliverable: Documented and optimized Nginx architecture
Priority: 🟢 LOW
Phase 6: Documentation & Automation (Week 6-8)
6.1 Infrastructure Documentation
Actions:
- Create complete infrastructure map
- Document all IP assignments
- Document all service locations
- Create network topology diagrams
- Document all configurations
- Create runbooks for common operations
Deliverable: Complete infrastructure documentation
Priority: 🟡 MEDIUM
Success Metrics
Performance Improvements
| Metric | Current | Target | Improvement |
|---|---|---|---|
| ml110 CPU Usage | High (75% memory) | <50% | 33% reduction |
| ml110 Memory Usage | 75% | <50% | 33% reduction |
| r630-01 Utilization | 1% | 40-60% | Better resource use |
| r630-02 Utilization | 2% | 40-60% | Better resource use |
| Average Response Time | Baseline | -20% | Faster responses |
Availability Improvements
| Metric | Current | Target |
|---|---|---|
| Cloudflare Tunnel Uptime | 40-60% | >99% |
| Service Availability | Variable | >99.5% |
| DNS Resolution | Some issues | 100% |
Timeline Summary
| Phase | Duration | Key Deliverables |
|---|---|---|
| Phase 1 | Weeks 1-2 | Critical issues resolved |
| Phase 2 | Weeks 2-3 | Storage optimized, infrastructure ready |
| Phase 3 | Weeks 3-5 | Workload redistributed |
| Phase 4 | Weeks 4-6 | Network architecture improved |
| Phase 5 | Weeks 5-7 | Services optimized |
| Phase 6 | Weeks 6-8 | Documentation complete |
Total Timeline: 8 weeks (with some phases overlapping)
Next Steps
Immediate (This Week)
-
Start IP Conflict Investigation
- Get MAC address of 192.168.11.14
- Check physical r630-04 status
- Identify what's using the IP
-
Fix Cloudflare Tunnel
- Update tunnel routing configuration
- Test all endpoints
-
Clean Up DNS
- Remove duplicate records
- Create missing CNAME records
Last Updated: 2026-01-05
Status: 📋 PLAN READY FOR EXECUTION