Files
proxmox/reports/ECOSYSTEM_IMPROVEMENT_PLAN.md
defiQUG cb47cce074 Complete markdown files cleanup and organization
- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.
2026-01-06 01:46:25 -08:00

9.9 KiB

Complete Ecosystem Improvement Plan

Date: 2026-01-05
Status: 📋 COMPREHENSIVE PLAN
Scope: Complete infrastructure ecosystem optimization


Executive Summary

This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing:

  1. Workload Distribution - ml110 is overloaded (34 containers) while R630 servers are underutilized
  2. IP Conflict Resolution - 192.168.11.14 conflict needs investigation
  3. Network Architecture - VLAN migration and routing improvements
  4. Cloudflare/DNS - Tunnel configuration, DNS cleanup, and routing fixes
  5. Storage Optimization - Enable and optimize storage on R630 servers
  6. Service Migration - Redistribute workloads for better performance
  7. Monitoring & Documentation - Complete infrastructure visibility

Current State: ⚠️ Suboptimal - ml110 handling 100% of workload with least powerful hardware
Target State: Optimized - Balanced workload distribution across all servers


Phase 1: Critical Issues Resolution (Week 1-2)

1.1 IP Conflict Investigation & Resolution

Issue: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based

Actions:

  • Get MAC address of device using 192.168.11.14
  • Identify device type from MAC vendor database
  • Check physical r630-04 server status (power, console/iDRAC)
  • Verify r630-04 actual IP address and Proxmox installation
  • Check for orphaned VMs on all Proxmox hosts
  • Resolve IP conflict (reassign IP or remove conflicting device)
  • Update documentation with correct IP assignments

Deliverable: Resolved IP conflict, identified actual r630-04 status

Priority: 🔴 CRITICAL


1.2 Cloudflare Tunnel Configuration Fix

Issue: Tunnel rpc-http-pub.d-bis.org is DOWN, routing incorrectly

Actions:

  • Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx
    • explorer.d-bis.orghttp://192.168.11.21:80
    • rpc-http-pub.d-bis.orghttp://192.168.11.21:80
    • rpc-http-prv.d-bis.orghttp://192.168.11.21:80
    • dbis-admin.d-bis.orghttp://192.168.11.21:80
    • dbis-api.d-bis.orghttp://192.168.11.21:80
    • dbis-api-2.d-bis.orghttp://192.168.11.21:80
    • mim4u.orghttp://192.168.11.21:80
    • www.mim4u.orghttp://192.168.11.21:80
  • Keep WebSocket endpoints routing directly to RPC nodes
  • Verify tunnel health after changes
  • Test all endpoints

Deliverable: All tunnels healthy, routing through central Nginx

Priority: 🔴 CRITICAL


1.3 DNS Records Cleanup & Migration

Issues:

  • Missing CNAME records for RPC and DBIS services
  • Duplicate A records
  • Inconsistent proxy status

Actions:

  • Create missing CNAME records:
    • rpc-http-pub.d-bis.org<tunnel-id>.cfargotunnel.com
    • rpc-ws-pub.d-bis.org<tunnel-id>.cfargotunnel.com
    • rpc-http-prv.d-bis.org<tunnel-id>.cfargotunnel.com
    • rpc-ws-prv.d-bis.org<tunnel-id>.cfargotunnel.com
    • dbis-admin.d-bis.org<tunnel-id>.cfargotunnel.com
    • dbis-api.d-bis.org<tunnel-id>.cfargotunnel.com
    • dbis-api-2.d-bis.org<tunnel-id>.cfargotunnel.com
    • mim4u.org<tunnel-id>.cfargotunnel.com
    • www.mim4u.org<tunnel-id>.cfargotunnel.com
  • Remove duplicate A records:
    • besu.d-bis.org (keep one IP)
    • blockscout.d-bis.org (keep one IP)
    • explorer.d-bis.org (keep one IP)
    • d-bis.org (keep 20.215.32.15)
  • Enable proxy (orange cloud) for all public services
  • Standardize TTL settings

Deliverable: Clean DNS configuration, all services accessible via tunnels

Priority: 🔴 CRITICAL


Phase 2: Storage & Infrastructure Optimization (Week 2-3)

2.1 Storage Activation on R630 Servers

Issue: Storage pools disabled on r630-01 and r630-02

Actions:

  • r630-01: Enable local-lvm and thin1 storage pools
  • r630-02: Verify and enable thin storage pools
  • Verify storage is accessible and working
  • Test VM creation on both hosts
  • Document storage configuration

Deliverable: All storage pools active and ready for VM deployment

Priority: 🔴 HIGH (blocks workload migration)


2.2 Cluster Configuration Verification

Actions:

  • Verify cluster recognizes all hostnames correctly
  • Update any remaining references to old hostnames (pve, pve2)
  • Verify quorum is maintained
  • Test cluster operations (migration, HA)
  • Document cluster configuration

Deliverable: Cluster fully operational with correct hostnames

Priority: 🟡 MEDIUM


Phase 3: Workload Redistribution (Week 3-5)

3.1 Workload Analysis & Migration Plan

Current State:

  • ml110: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz
  • r630-01: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz
  • r630-02: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz

Target Distribution:

Server Current Target Migration
ml110 34 containers 10-15 containers Keep lightweight/management
r630-01 3 containers 15-20 containers Add medium workload VMs
r630-02 11 containers 15-20 containers Add heavy workload VMs

Migration Strategy:

Keep on ml110 (Management/Infrastructure):

  • VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx)
  • Lightweight management services

Migrate to r630-01 (Medium Workload):

  • Besu Validators (1000-1004): 40GB RAM, 20 cores total
  • DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores
  • Application Services (7800-7811): ~30GB RAM

Migrate to r630-02 (Heavy Workload):

  • Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total
  • Besu Sentries (1500-1503): 16GB RAM, 8 cores total
  • Blockscout (5000): Database-intensive
  • Firefly (6200-6201): Web3 gateway services

Actions:

  • Create detailed migration plan with downtime windows
  • Backup all containers before migration
  • Test migration process with one container first
  • Migrate containers in batches (by service type)
  • Verify services after migration
  • Update documentation with new locations

Deliverable: Balanced workload distribution across all servers

Priority: 🔴 HIGH (improves performance significantly)


Phase 4: Network Architecture Improvements (Week 4-6)

4.1 VLAN Migration Planning

Current: Flat LAN (192.168.11.0/24)
Target: VLAN-based segmentation (16+ VLANs)

Actions:

  • Review VLAN plan from NETWORK_ARCHITECTURE.md
  • Configure ES216G switches for VLAN trunking
  • Enable VLAN-aware bridge on Proxmox hosts
  • Create VLAN interfaces on ER605 router
  • Migrate services to appropriate VLANs
  • Test inter-VLAN routing
  • Update firewall rules

Key VLANs:

  • VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility
  • VLAN 110: BESU-VAL (10.110.0.0/24) - Validators
  • VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries
  • VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes
  • VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer
  • VLAN 130-134: CCIP networks
  • VLAN 200-203: Sovereign tenants

Deliverable: VLAN-based network segmentation implemented

Priority: 🟡 MEDIUM (improves security and organization)


Phase 5: Service Optimization (Week 5-7)

5.1 Nginx Architecture Review

Current: Multiple Nginx instances

  • Central Nginx (VMID 105): Nginx Proxy Manager
  • Blockscout Nginx (VMID 5000): Local Nginx
  • MIM Nginx (VMID 7810): Local Nginx
  • RPC Nginx (VMIDs 2500-2502): SSL termination

Actions:

  • Document purpose of each Nginx instance
  • Verify all routing is correct
  • Consider consolidation opportunities
  • Standardize SSL certificate management
  • Optimize Nginx configurations

Deliverable: Documented and optimized Nginx architecture

Priority: 🟢 LOW


Phase 6: Documentation & Automation (Week 6-8)

6.1 Infrastructure Documentation

Actions:

  • Create complete infrastructure map
  • Document all IP assignments
  • Document all service locations
  • Create network topology diagrams
  • Document all configurations
  • Create runbooks for common operations

Deliverable: Complete infrastructure documentation

Priority: 🟡 MEDIUM


Success Metrics

Performance Improvements

Metric Current Target Improvement
ml110 CPU Usage High (75% memory) <50% 33% reduction
ml110 Memory Usage 75% <50% 33% reduction
r630-01 Utilization 1% 40-60% Better resource use
r630-02 Utilization 2% 40-60% Better resource use
Average Response Time Baseline -20% Faster responses

Availability Improvements

Metric Current Target
Cloudflare Tunnel Uptime 40-60% >99%
Service Availability Variable >99.5%
DNS Resolution Some issues 100%

Timeline Summary

Phase Duration Key Deliverables
Phase 1 Weeks 1-2 Critical issues resolved
Phase 2 Weeks 2-3 Storage optimized, infrastructure ready
Phase 3 Weeks 3-5 Workload redistributed
Phase 4 Weeks 4-6 Network architecture improved
Phase 5 Weeks 5-7 Services optimized
Phase 6 Weeks 6-8 Documentation complete

Total Timeline: 8 weeks (with some phases overlapping)


Next Steps

Immediate (This Week)

  1. Start IP Conflict Investigation

    • Get MAC address of 192.168.11.14
    • Check physical r630-04 status
    • Identify what's using the IP
  2. Fix Cloudflare Tunnel

    • Update tunnel routing configuration
    • Test all endpoints
  3. Clean Up DNS

    • Remove duplicate records
    • Create missing CNAME records

Last Updated: 2026-01-05
Status: 📋 PLAN READY FOR EXECUTION