Files
proxmox/reports/ECOSYSTEM_IMPROVEMENT_PLAN.md

315 lines
9.9 KiB
Markdown
Raw Permalink Normal View History

# Complete Ecosystem Improvement Plan
**Date**: 2026-01-05
**Status**: 📋 **COMPREHENSIVE PLAN**
**Scope**: Complete infrastructure ecosystem optimization
---
## Executive Summary
This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing:
1. **Workload Distribution** - ml110 is overloaded (34 containers) while R630 servers are underutilized
2. **IP Conflict Resolution** - 192.168.11.14 conflict needs investigation
3. **Network Architecture** - VLAN migration and routing improvements
4. **Cloudflare/DNS** - Tunnel configuration, DNS cleanup, and routing fixes
5. **Storage Optimization** - Enable and optimize storage on R630 servers
6. **Service Migration** - Redistribute workloads for better performance
7. **Monitoring & Documentation** - Complete infrastructure visibility
**Current State**: ⚠️ **Suboptimal** - ml110 handling 100% of workload with least powerful hardware
**Target State**: ✅ **Optimized** - Balanced workload distribution across all servers
---
## Phase 1: Critical Issues Resolution (Week 1-2)
### 1.1 IP Conflict Investigation & Resolution
**Issue**: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based
**Actions**:
- [ ] Get MAC address of device using 192.168.11.14
- [ ] Identify device type from MAC vendor database
- [ ] Check physical r630-04 server status (power, console/iDRAC)
- [ ] Verify r630-04 actual IP address and Proxmox installation
- [ ] Check for orphaned VMs on all Proxmox hosts
- [ ] Resolve IP conflict (reassign IP or remove conflicting device)
- [ ] Update documentation with correct IP assignments
**Deliverable**: Resolved IP conflict, identified actual r630-04 status
**Priority**: 🔴 **CRITICAL**
---
### 1.2 Cloudflare Tunnel Configuration Fix
**Issue**: Tunnel `rpc-http-pub.d-bis.org` is DOWN, routing incorrectly
**Actions**:
- [ ] Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx
- `explorer.d-bis.org``http://192.168.11.21:80`
- `rpc-http-pub.d-bis.org``http://192.168.11.21:80`
- `rpc-http-prv.d-bis.org``http://192.168.11.21:80`
- `dbis-admin.d-bis.org``http://192.168.11.21:80`
- `dbis-api.d-bis.org``http://192.168.11.21:80`
- `dbis-api-2.d-bis.org``http://192.168.11.21:80`
- `mim4u.org``http://192.168.11.21:80`
- `www.mim4u.org``http://192.168.11.21:80`
- [ ] Keep WebSocket endpoints routing directly to RPC nodes
- [ ] Verify tunnel health after changes
- [ ] Test all endpoints
**Deliverable**: All tunnels healthy, routing through central Nginx
**Priority**: 🔴 **CRITICAL**
---
### 1.3 DNS Records Cleanup & Migration
**Issues**:
- Missing CNAME records for RPC and DBIS services
- Duplicate A records
- Inconsistent proxy status
**Actions**:
- [ ] Create missing CNAME records:
- `rpc-http-pub.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `rpc-ws-pub.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `rpc-http-prv.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `rpc-ws-prv.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `dbis-admin.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `dbis-api.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `dbis-api-2.d-bis.org``<tunnel-id>.cfargotunnel.com`
- `mim4u.org``<tunnel-id>.cfargotunnel.com`
- `www.mim4u.org``<tunnel-id>.cfargotunnel.com`
- [ ] Remove duplicate A records:
- `besu.d-bis.org` (keep one IP)
- `blockscout.d-bis.org` (keep one IP)
- `explorer.d-bis.org` (keep one IP)
- `d-bis.org` (keep 20.215.32.15)
- [ ] Enable proxy (orange cloud) for all public services
- [ ] Standardize TTL settings
**Deliverable**: Clean DNS configuration, all services accessible via tunnels
**Priority**: 🔴 **CRITICAL**
---
## Phase 2: Storage & Infrastructure Optimization (Week 2-3)
### 2.1 Storage Activation on R630 Servers
**Issue**: Storage pools disabled on r630-01 and r630-02
**Actions**:
- [ ] **r630-01**: Enable local-lvm and thin1 storage pools
- [ ] **r630-02**: Verify and enable thin storage pools
- [ ] Verify storage is accessible and working
- [ ] Test VM creation on both hosts
- [ ] Document storage configuration
**Deliverable**: All storage pools active and ready for VM deployment
**Priority**: 🔴 **HIGH** (blocks workload migration)
---
### 2.2 Cluster Configuration Verification
**Actions**:
- [ ] Verify cluster recognizes all hostnames correctly
- [ ] Update any remaining references to old hostnames (pve, pve2)
- [ ] Verify quorum is maintained
- [ ] Test cluster operations (migration, HA)
- [ ] Document cluster configuration
**Deliverable**: Cluster fully operational with correct hostnames
**Priority**: 🟡 **MEDIUM**
---
## Phase 3: Workload Redistribution (Week 3-5)
### 3.1 Workload Analysis & Migration Plan
**Current State**:
- **ml110**: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz
- **r630-01**: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz
- **r630-02**: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz
**Target Distribution**:
| Server | Current | Target | Migration |
|--------|---------|--------|-----------|
| **ml110** | 34 containers | 10-15 containers | Keep lightweight/management |
| **r630-01** | 3 containers | 15-20 containers | Add medium workload VMs |
| **r630-02** | 11 containers | 15-20 containers | Add heavy workload VMs |
**Migration Strategy**:
#### Keep on ml110 (Management/Infrastructure):
- VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx)
- Lightweight management services
#### Migrate to r630-01 (Medium Workload):
- Besu Validators (1000-1004): 40GB RAM, 20 cores total
- DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores
- Application Services (7800-7811): ~30GB RAM
#### Migrate to r630-02 (Heavy Workload):
- Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total
- Besu Sentries (1500-1503): 16GB RAM, 8 cores total
- Blockscout (5000): Database-intensive
- Firefly (6200-6201): Web3 gateway services
**Actions**:
- [ ] Create detailed migration plan with downtime windows
- [ ] Backup all containers before migration
- [ ] Test migration process with one container first
- [ ] Migrate containers in batches (by service type)
- [ ] Verify services after migration
- [ ] Update documentation with new locations
**Deliverable**: Balanced workload distribution across all servers
**Priority**: 🔴 **HIGH** (improves performance significantly)
---
## Phase 4: Network Architecture Improvements (Week 4-6)
### 4.1 VLAN Migration Planning
**Current**: Flat LAN (192.168.11.0/24)
**Target**: VLAN-based segmentation (16+ VLANs)
**Actions**:
- [ ] Review VLAN plan from NETWORK_ARCHITECTURE.md
- [ ] Configure ES216G switches for VLAN trunking
- [ ] Enable VLAN-aware bridge on Proxmox hosts
- [ ] Create VLAN interfaces on ER605 router
- [ ] Migrate services to appropriate VLANs
- [ ] Test inter-VLAN routing
- [ ] Update firewall rules
**Key VLANs**:
- VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility
- VLAN 110: BESU-VAL (10.110.0.0/24) - Validators
- VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries
- VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes
- VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer
- VLAN 130-134: CCIP networks
- VLAN 200-203: Sovereign tenants
**Deliverable**: VLAN-based network segmentation implemented
**Priority**: 🟡 **MEDIUM** (improves security and organization)
---
## Phase 5: Service Optimization (Week 5-7)
### 5.1 Nginx Architecture Review
**Current**: Multiple Nginx instances
- Central Nginx (VMID 105): Nginx Proxy Manager
- Blockscout Nginx (VMID 5000): Local Nginx
- MIM Nginx (VMID 7810): Local Nginx
- RPC Nginx (VMIDs 2500-2502): SSL termination
**Actions**:
- [ ] Document purpose of each Nginx instance
- [ ] Verify all routing is correct
- [ ] Consider consolidation opportunities
- [ ] Standardize SSL certificate management
- [ ] Optimize Nginx configurations
**Deliverable**: Documented and optimized Nginx architecture
**Priority**: 🟢 **LOW**
---
## Phase 6: Documentation & Automation (Week 6-8)
### 6.1 Infrastructure Documentation
**Actions**:
- [ ] Create complete infrastructure map
- [ ] Document all IP assignments
- [ ] Document all service locations
- [ ] Create network topology diagrams
- [ ] Document all configurations
- [ ] Create runbooks for common operations
**Deliverable**: Complete infrastructure documentation
**Priority**: 🟡 **MEDIUM**
---
## Success Metrics
### Performance Improvements
| Metric | Current | Target | Improvement |
|--------|---------|--------|-------------|
| ml110 CPU Usage | High (75% memory) | <50% | 33% reduction |
| ml110 Memory Usage | 75% | <50% | 33% reduction |
| r630-01 Utilization | 1% | 40-60% | Better resource use |
| r630-02 Utilization | 2% | 40-60% | Better resource use |
| Average Response Time | Baseline | -20% | Faster responses |
### Availability Improvements
| Metric | Current | Target |
|--------|---------|--------|
| Cloudflare Tunnel Uptime | 40-60% | >99% |
| Service Availability | Variable | >99.5% |
| DNS Resolution | Some issues | 100% |
---
## Timeline Summary
| Phase | Duration | Key Deliverables |
|-------|----------|------------------|
| **Phase 1** | Weeks 1-2 | Critical issues resolved |
| **Phase 2** | Weeks 2-3 | Storage optimized, infrastructure ready |
| **Phase 3** | Weeks 3-5 | Workload redistributed |
| **Phase 4** | Weeks 4-6 | Network architecture improved |
| **Phase 5** | Weeks 5-7 | Services optimized |
| **Phase 6** | Weeks 6-8 | Documentation complete |
**Total Timeline**: 8 weeks (with some phases overlapping)
---
## Next Steps
### Immediate (This Week)
1. **Start IP Conflict Investigation**
- Get MAC address of 192.168.11.14
- Check physical r630-04 status
- Identify what's using the IP
2. **Fix Cloudflare Tunnel**
- Update tunnel routing configuration
- Test all endpoints
3. **Clean Up DNS**
- Remove duplicate records
- Create missing CNAME records
---
**Last Updated**: 2026-01-05
**Status**: 📋 **PLAN READY FOR EXECUTION**