315 lines
9.9 KiB
Markdown
315 lines
9.9 KiB
Markdown
|
|
# Complete Ecosystem Improvement Plan
|
||
|
|
|
||
|
|
**Date**: 2026-01-05
|
||
|
|
**Status**: 📋 **COMPREHENSIVE PLAN**
|
||
|
|
**Scope**: Complete infrastructure ecosystem optimization
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing:
|
||
|
|
|
||
|
|
1. **Workload Distribution** - ml110 is overloaded (34 containers) while R630 servers are underutilized
|
||
|
|
2. **IP Conflict Resolution** - 192.168.11.14 conflict needs investigation
|
||
|
|
3. **Network Architecture** - VLAN migration and routing improvements
|
||
|
|
4. **Cloudflare/DNS** - Tunnel configuration, DNS cleanup, and routing fixes
|
||
|
|
5. **Storage Optimization** - Enable and optimize storage on R630 servers
|
||
|
|
6. **Service Migration** - Redistribute workloads for better performance
|
||
|
|
7. **Monitoring & Documentation** - Complete infrastructure visibility
|
||
|
|
|
||
|
|
**Current State**: ⚠️ **Suboptimal** - ml110 handling 100% of workload with least powerful hardware
|
||
|
|
**Target State**: ✅ **Optimized** - Balanced workload distribution across all servers
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 1: Critical Issues Resolution (Week 1-2)
|
||
|
|
|
||
|
|
### 1.1 IP Conflict Investigation & Resolution
|
||
|
|
|
||
|
|
**Issue**: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Get MAC address of device using 192.168.11.14
|
||
|
|
- [ ] Identify device type from MAC vendor database
|
||
|
|
- [ ] Check physical r630-04 server status (power, console/iDRAC)
|
||
|
|
- [ ] Verify r630-04 actual IP address and Proxmox installation
|
||
|
|
- [ ] Check for orphaned VMs on all Proxmox hosts
|
||
|
|
- [ ] Resolve IP conflict (reassign IP or remove conflicting device)
|
||
|
|
- [ ] Update documentation with correct IP assignments
|
||
|
|
|
||
|
|
**Deliverable**: Resolved IP conflict, identified actual r630-04 status
|
||
|
|
|
||
|
|
**Priority**: 🔴 **CRITICAL**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 1.2 Cloudflare Tunnel Configuration Fix
|
||
|
|
|
||
|
|
**Issue**: Tunnel `rpc-http-pub.d-bis.org` is DOWN, routing incorrectly
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx
|
||
|
|
- `explorer.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `rpc-http-pub.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `rpc-http-prv.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `dbis-admin.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `dbis-api.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `dbis-api-2.d-bis.org` → `http://192.168.11.21:80`
|
||
|
|
- `mim4u.org` → `http://192.168.11.21:80`
|
||
|
|
- `www.mim4u.org` → `http://192.168.11.21:80`
|
||
|
|
- [ ] Keep WebSocket endpoints routing directly to RPC nodes
|
||
|
|
- [ ] Verify tunnel health after changes
|
||
|
|
- [ ] Test all endpoints
|
||
|
|
|
||
|
|
**Deliverable**: All tunnels healthy, routing through central Nginx
|
||
|
|
|
||
|
|
**Priority**: 🔴 **CRITICAL**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 1.3 DNS Records Cleanup & Migration
|
||
|
|
|
||
|
|
**Issues**:
|
||
|
|
- Missing CNAME records for RPC and DBIS services
|
||
|
|
- Duplicate A records
|
||
|
|
- Inconsistent proxy status
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Create missing CNAME records:
|
||
|
|
- `rpc-http-pub.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `rpc-ws-pub.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `rpc-http-prv.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `rpc-ws-prv.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `dbis-admin.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `dbis-api.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `dbis-api-2.d-bis.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `mim4u.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- `www.mim4u.org` → `<tunnel-id>.cfargotunnel.com`
|
||
|
|
- [ ] Remove duplicate A records:
|
||
|
|
- `besu.d-bis.org` (keep one IP)
|
||
|
|
- `blockscout.d-bis.org` (keep one IP)
|
||
|
|
- `explorer.d-bis.org` (keep one IP)
|
||
|
|
- `d-bis.org` (keep 20.215.32.15)
|
||
|
|
- [ ] Enable proxy (orange cloud) for all public services
|
||
|
|
- [ ] Standardize TTL settings
|
||
|
|
|
||
|
|
**Deliverable**: Clean DNS configuration, all services accessible via tunnels
|
||
|
|
|
||
|
|
**Priority**: 🔴 **CRITICAL**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 2: Storage & Infrastructure Optimization (Week 2-3)
|
||
|
|
|
||
|
|
### 2.1 Storage Activation on R630 Servers
|
||
|
|
|
||
|
|
**Issue**: Storage pools disabled on r630-01 and r630-02
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] **r630-01**: Enable local-lvm and thin1 storage pools
|
||
|
|
- [ ] **r630-02**: Verify and enable thin storage pools
|
||
|
|
- [ ] Verify storage is accessible and working
|
||
|
|
- [ ] Test VM creation on both hosts
|
||
|
|
- [ ] Document storage configuration
|
||
|
|
|
||
|
|
**Deliverable**: All storage pools active and ready for VM deployment
|
||
|
|
|
||
|
|
**Priority**: 🔴 **HIGH** (blocks workload migration)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2.2 Cluster Configuration Verification
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Verify cluster recognizes all hostnames correctly
|
||
|
|
- [ ] Update any remaining references to old hostnames (pve, pve2)
|
||
|
|
- [ ] Verify quorum is maintained
|
||
|
|
- [ ] Test cluster operations (migration, HA)
|
||
|
|
- [ ] Document cluster configuration
|
||
|
|
|
||
|
|
**Deliverable**: Cluster fully operational with correct hostnames
|
||
|
|
|
||
|
|
**Priority**: 🟡 **MEDIUM**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 3: Workload Redistribution (Week 3-5)
|
||
|
|
|
||
|
|
### 3.1 Workload Analysis & Migration Plan
|
||
|
|
|
||
|
|
**Current State**:
|
||
|
|
- **ml110**: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz
|
||
|
|
- **r630-01**: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz
|
||
|
|
- **r630-02**: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz
|
||
|
|
|
||
|
|
**Target Distribution**:
|
||
|
|
|
||
|
|
| Server | Current | Target | Migration |
|
||
|
|
|--------|---------|--------|-----------|
|
||
|
|
| **ml110** | 34 containers | 10-15 containers | Keep lightweight/management |
|
||
|
|
| **r630-01** | 3 containers | 15-20 containers | Add medium workload VMs |
|
||
|
|
| **r630-02** | 11 containers | 15-20 containers | Add heavy workload VMs |
|
||
|
|
|
||
|
|
**Migration Strategy**:
|
||
|
|
|
||
|
|
#### Keep on ml110 (Management/Infrastructure):
|
||
|
|
- VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx)
|
||
|
|
- Lightweight management services
|
||
|
|
|
||
|
|
#### Migrate to r630-01 (Medium Workload):
|
||
|
|
- Besu Validators (1000-1004): 40GB RAM, 20 cores total
|
||
|
|
- DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores
|
||
|
|
- Application Services (7800-7811): ~30GB RAM
|
||
|
|
|
||
|
|
#### Migrate to r630-02 (Heavy Workload):
|
||
|
|
- Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total
|
||
|
|
- Besu Sentries (1500-1503): 16GB RAM, 8 cores total
|
||
|
|
- Blockscout (5000): Database-intensive
|
||
|
|
- Firefly (6200-6201): Web3 gateway services
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Create detailed migration plan with downtime windows
|
||
|
|
- [ ] Backup all containers before migration
|
||
|
|
- [ ] Test migration process with one container first
|
||
|
|
- [ ] Migrate containers in batches (by service type)
|
||
|
|
- [ ] Verify services after migration
|
||
|
|
- [ ] Update documentation with new locations
|
||
|
|
|
||
|
|
**Deliverable**: Balanced workload distribution across all servers
|
||
|
|
|
||
|
|
**Priority**: 🔴 **HIGH** (improves performance significantly)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 4: Network Architecture Improvements (Week 4-6)
|
||
|
|
|
||
|
|
### 4.1 VLAN Migration Planning
|
||
|
|
|
||
|
|
**Current**: Flat LAN (192.168.11.0/24)
|
||
|
|
**Target**: VLAN-based segmentation (16+ VLANs)
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Review VLAN plan from NETWORK_ARCHITECTURE.md
|
||
|
|
- [ ] Configure ES216G switches for VLAN trunking
|
||
|
|
- [ ] Enable VLAN-aware bridge on Proxmox hosts
|
||
|
|
- [ ] Create VLAN interfaces on ER605 router
|
||
|
|
- [ ] Migrate services to appropriate VLANs
|
||
|
|
- [ ] Test inter-VLAN routing
|
||
|
|
- [ ] Update firewall rules
|
||
|
|
|
||
|
|
**Key VLANs**:
|
||
|
|
- VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility
|
||
|
|
- VLAN 110: BESU-VAL (10.110.0.0/24) - Validators
|
||
|
|
- VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries
|
||
|
|
- VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes
|
||
|
|
- VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer
|
||
|
|
- VLAN 130-134: CCIP networks
|
||
|
|
- VLAN 200-203: Sovereign tenants
|
||
|
|
|
||
|
|
**Deliverable**: VLAN-based network segmentation implemented
|
||
|
|
|
||
|
|
**Priority**: 🟡 **MEDIUM** (improves security and organization)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 5: Service Optimization (Week 5-7)
|
||
|
|
|
||
|
|
### 5.1 Nginx Architecture Review
|
||
|
|
|
||
|
|
**Current**: Multiple Nginx instances
|
||
|
|
- Central Nginx (VMID 105): Nginx Proxy Manager
|
||
|
|
- Blockscout Nginx (VMID 5000): Local Nginx
|
||
|
|
- MIM Nginx (VMID 7810): Local Nginx
|
||
|
|
- RPC Nginx (VMIDs 2500-2502): SSL termination
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Document purpose of each Nginx instance
|
||
|
|
- [ ] Verify all routing is correct
|
||
|
|
- [ ] Consider consolidation opportunities
|
||
|
|
- [ ] Standardize SSL certificate management
|
||
|
|
- [ ] Optimize Nginx configurations
|
||
|
|
|
||
|
|
**Deliverable**: Documented and optimized Nginx architecture
|
||
|
|
|
||
|
|
**Priority**: 🟢 **LOW**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 6: Documentation & Automation (Week 6-8)
|
||
|
|
|
||
|
|
### 6.1 Infrastructure Documentation
|
||
|
|
|
||
|
|
**Actions**:
|
||
|
|
- [ ] Create complete infrastructure map
|
||
|
|
- [ ] Document all IP assignments
|
||
|
|
- [ ] Document all service locations
|
||
|
|
- [ ] Create network topology diagrams
|
||
|
|
- [ ] Document all configurations
|
||
|
|
- [ ] Create runbooks for common operations
|
||
|
|
|
||
|
|
**Deliverable**: Complete infrastructure documentation
|
||
|
|
|
||
|
|
**Priority**: 🟡 **MEDIUM**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Metrics
|
||
|
|
|
||
|
|
### Performance Improvements
|
||
|
|
|
||
|
|
| Metric | Current | Target | Improvement |
|
||
|
|
|--------|---------|--------|-------------|
|
||
|
|
| ml110 CPU Usage | High (75% memory) | <50% | 33% reduction |
|
||
|
|
| ml110 Memory Usage | 75% | <50% | 33% reduction |
|
||
|
|
| r630-01 Utilization | 1% | 40-60% | Better resource use |
|
||
|
|
| r630-02 Utilization | 2% | 40-60% | Better resource use |
|
||
|
|
| Average Response Time | Baseline | -20% | Faster responses |
|
||
|
|
|
||
|
|
### Availability Improvements
|
||
|
|
|
||
|
|
| Metric | Current | Target |
|
||
|
|
|--------|---------|--------|
|
||
|
|
| Cloudflare Tunnel Uptime | 40-60% | >99% |
|
||
|
|
| Service Availability | Variable | >99.5% |
|
||
|
|
| DNS Resolution | Some issues | 100% |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Timeline Summary
|
||
|
|
|
||
|
|
| Phase | Duration | Key Deliverables |
|
||
|
|
|-------|----------|------------------|
|
||
|
|
| **Phase 1** | Weeks 1-2 | Critical issues resolved |
|
||
|
|
| **Phase 2** | Weeks 2-3 | Storage optimized, infrastructure ready |
|
||
|
|
| **Phase 3** | Weeks 3-5 | Workload redistributed |
|
||
|
|
| **Phase 4** | Weeks 4-6 | Network architecture improved |
|
||
|
|
| **Phase 5** | Weeks 5-7 | Services optimized |
|
||
|
|
| **Phase 6** | Weeks 6-8 | Documentation complete |
|
||
|
|
|
||
|
|
**Total Timeline**: 8 weeks (with some phases overlapping)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
### Immediate (This Week)
|
||
|
|
|
||
|
|
1. **Start IP Conflict Investigation**
|
||
|
|
- Get MAC address of 192.168.11.14
|
||
|
|
- Check physical r630-04 status
|
||
|
|
- Identify what's using the IP
|
||
|
|
|
||
|
|
2. **Fix Cloudflare Tunnel**
|
||
|
|
- Update tunnel routing configuration
|
||
|
|
- Test all endpoints
|
||
|
|
|
||
|
|
3. **Clean Up DNS**
|
||
|
|
- Remove duplicate records
|
||
|
|
- Create missing CNAME records
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: 2026-01-05
|
||
|
|
**Status**: 📋 **PLAN READY FOR EXECUTION**
|