Files

defiQUG cb47cce074 Complete markdown files cleanup and organization

- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.

2026-01-06 01:46:25 -08:00

9.9 KiB

Raw Blame History

Complete Ecosystem Improvement Plan

Date: 2026-01-05
Status: 📋 COMPREHENSIVE PLAN
Scope: Complete infrastructure ecosystem optimization

Executive Summary

This document provides a comprehensive plan to optimize the entire infrastructure ecosystem, addressing:

Workload Distribution - ml110 is overloaded (34 containers) while R630 servers are underutilized
IP Conflict Resolution - 192.168.11.14 conflict needs investigation
Network Architecture - VLAN migration and routing improvements
Cloudflare/DNS - Tunnel configuration, DNS cleanup, and routing fixes
Storage Optimization - Enable and optimize storage on R630 servers
Service Migration - Redistribute workloads for better performance
Monitoring & Documentation - Complete infrastructure visibility

Current State: ⚠️ Suboptimal - ml110 handling 100% of workload with least powerful hardware
Target State: ✅ Optimized - Balanced workload distribution across all servers

Phase 1: Critical Issues Resolution (Week 1-2)

1.1 IP Conflict Investigation & Resolution

Issue: 192.168.11.14 is responding with Ubuntu SSH banner, but Proxmox is Debian-based

Actions:

Get MAC address of device using 192.168.11.14
Identify device type from MAC vendor database
Check physical r630-04 server status (power, console/iDRAC)
Verify r630-04 actual IP address and Proxmox installation
Check for orphaned VMs on all Proxmox hosts
Resolve IP conflict (reassign IP or remove conflicting device)
Update documentation with correct IP assignments

Deliverable: Resolved IP conflict, identified actual r630-04 status

Priority: 🔴 CRITICAL

1.2 Cloudflare Tunnel Configuration Fix

Issue: Tunnel rpc-http-pub.d-bis.org is DOWN, routing incorrectly

Actions:

Update Cloudflare tunnel configuration to route HTTP endpoints to central Nginx
- explorer.d-bis.org → http://192.168.11.21:80
- rpc-http-pub.d-bis.org → http://192.168.11.21:80
- rpc-http-prv.d-bis.org → http://192.168.11.21:80
- dbis-admin.d-bis.org → http://192.168.11.21:80
- dbis-api.d-bis.org → http://192.168.11.21:80
- dbis-api-2.d-bis.org → http://192.168.11.21:80
- mim4u.org → http://192.168.11.21:80
- www.mim4u.org → http://192.168.11.21:80
Keep WebSocket endpoints routing directly to RPC nodes
Verify tunnel health after changes
Test all endpoints

Deliverable: All tunnels healthy, routing through central Nginx

Priority: 🔴 CRITICAL

1.3 DNS Records Cleanup & Migration

Issues:

Missing CNAME records for RPC and DBIS services
Duplicate A records
Inconsistent proxy status

Actions:

Create missing CNAME records:
- rpc-http-pub.d-bis.org → <tunnel-id>.cfargotunnel.com
- rpc-ws-pub.d-bis.org → <tunnel-id>.cfargotunnel.com
- rpc-http-prv.d-bis.org → <tunnel-id>.cfargotunnel.com
- rpc-ws-prv.d-bis.org → <tunnel-id>.cfargotunnel.com
- dbis-admin.d-bis.org → <tunnel-id>.cfargotunnel.com
- dbis-api.d-bis.org → <tunnel-id>.cfargotunnel.com
- dbis-api-2.d-bis.org → <tunnel-id>.cfargotunnel.com
- mim4u.org → <tunnel-id>.cfargotunnel.com
- www.mim4u.org → <tunnel-id>.cfargotunnel.com
Remove duplicate A records:
- besu.d-bis.org (keep one IP)
- blockscout.d-bis.org (keep one IP)
- explorer.d-bis.org (keep one IP)
- d-bis.org (keep 20.215.32.15)
Enable proxy (orange cloud) for all public services
Standardize TTL settings

Deliverable: Clean DNS configuration, all services accessible via tunnels

Priority: 🔴 CRITICAL

Phase 2: Storage & Infrastructure Optimization (Week 2-3)

2.1 Storage Activation on R630 Servers

Issue: Storage pools disabled on r630-01 and r630-02

Actions:

r630-01: Enable local-lvm and thin1 storage pools
r630-02: Verify and enable thin storage pools
Verify storage is accessible and working
Test VM creation on both hosts
Document storage configuration

Deliverable: All storage pools active and ready for VM deployment

Priority: 🔴 HIGH (blocks workload migration)

2.2 Cluster Configuration Verification

Actions:

Verify cluster recognizes all hostnames correctly
Update any remaining references to old hostnames (pve, pve2)
Verify quorum is maintained
Test cluster operations (migration, HA)
Document cluster configuration

Deliverable: Cluster fully operational with correct hostnames

Priority: 🟡 MEDIUM

Phase 3: Workload Redistribution (Week 3-5)

3.1 Workload Analysis & Migration Plan

Current State:

ml110: 34 containers, 94GB RAM used, 75% memory usage, 6 cores @ 1.60GHz
r630-01: 3 containers, 6.4GB RAM used, 1% memory usage, 32 cores @ 2.40GHz
r630-02: 11 containers, 4.4GB RAM used, 2% memory usage, 56 cores @ 2.00GHz

Target Distribution:

Server	Current	Target	Migration
ml110	34 containers	10-15 containers	Keep lightweight/management
r630-01	3 containers	15-20 containers	Add medium workload VMs
r630-02	11 containers	15-20 containers	Add heavy workload VMs

Migration Strategy:

Keep on ml110 (Management/Infrastructure):

VMID 100-105, 130: Infrastructure services (mail, datacenter, cloudflared, omada, gitea, nginx)
Lightweight management services

Migrate to r630-01 (Medium Workload):

Besu Validators (1000-1004): 40GB RAM, 20 cores total
DBIS Core Services (10100-10151): ~40GB RAM, ~20 cores
Application Services (7800-7811): ~30GB RAM

Migrate to r630-02 (Heavy Workload):

Besu RPC Nodes (2500-2502): 48GB RAM, 12 cores total
Besu Sentries (1500-1503): 16GB RAM, 8 cores total
Blockscout (5000): Database-intensive
Firefly (6200-6201): Web3 gateway services

Actions:

Create detailed migration plan with downtime windows
Backup all containers before migration
Test migration process with one container first
Migrate containers in batches (by service type)
Verify services after migration
Update documentation with new locations

Deliverable: Balanced workload distribution across all servers

Priority: 🔴 HIGH (improves performance significantly)

Phase 4: Network Architecture Improvements (Week 4-6)

4.1 VLAN Migration Planning

Current: Flat LAN (192.168.11.0/24)
Target: VLAN-based segmentation (16+ VLANs)

Actions:

Review VLAN plan from NETWORK_ARCHITECTURE.md
Configure ES216G switches for VLAN trunking
Enable VLAN-aware bridge on Proxmox hosts
Create VLAN interfaces on ER605 router
Migrate services to appropriate VLANs
Test inter-VLAN routing
Update firewall rules

Key VLANs:

VLAN 11: MGMT-LAN (192.168.11.0/24) - Legacy compatibility
VLAN 110: BESU-VAL (10.110.0.0/24) - Validators
VLAN 111: BESU-SEN (10.111.0.0/24) - Sentries
VLAN 112: BESU-RPC (10.112.0.0/24) - RPC nodes
VLAN 120: BLOCKSCOUT (10.120.0.0/24) - Explorer
VLAN 130-134: CCIP networks
VLAN 200-203: Sovereign tenants

Deliverable: VLAN-based network segmentation implemented

Priority: 🟡 MEDIUM (improves security and organization)

Phase 5: Service Optimization (Week 5-7)

5.1 Nginx Architecture Review

Current: Multiple Nginx instances

Central Nginx (VMID 105): Nginx Proxy Manager
Blockscout Nginx (VMID 5000): Local Nginx
MIM Nginx (VMID 7810): Local Nginx
RPC Nginx (VMIDs 2500-2502): SSL termination

Actions:

Document purpose of each Nginx instance
Verify all routing is correct
Consider consolidation opportunities
Standardize SSL certificate management
Optimize Nginx configurations

Deliverable: Documented and optimized Nginx architecture

Priority: 🟢 LOW

Phase 6: Documentation & Automation (Week 6-8)

6.1 Infrastructure Documentation

Actions:

Create complete infrastructure map
Document all IP assignments
Document all service locations
Create network topology diagrams
Document all configurations
Create runbooks for common operations

Deliverable: Complete infrastructure documentation

Priority: 🟡 MEDIUM

Success Metrics

Performance Improvements

Metric	Current	Target	Improvement
ml110 CPU Usage	High (75% memory)	<50%	33% reduction
ml110 Memory Usage	75%	<50%	33% reduction
r630-01 Utilization	1%	40-60%	Better resource use
r630-02 Utilization	2%	40-60%	Better resource use
Average Response Time	Baseline	-20%	Faster responses

Availability Improvements

Metric	Current	Target
Cloudflare Tunnel Uptime	40-60%	>99%
Service Availability	Variable	>99.5%
DNS Resolution	Some issues	100%

Timeline Summary

Phase	Duration	Key Deliverables
Phase 1	Weeks 1-2	Critical issues resolved
Phase 2	Weeks 2-3	Storage optimized, infrastructure ready
Phase 3	Weeks 3-5	Workload redistributed
Phase 4	Weeks 4-6	Network architecture improved
Phase 5	Weeks 5-7	Services optimized
Phase 6	Weeks 6-8	Documentation complete

Total Timeline: 8 weeks (with some phases overlapping)

Next Steps

Immediate (This Week)

Start IP Conflict Investigation
- Get MAC address of 192.168.11.14
- Check physical r630-04 status
- Identify what's using the IP
Fix Cloudflare Tunnel
- Update tunnel routing configuration
- Test all endpoints
Clean Up DNS
- Remove duplicate records
- Create missing CNAME records

Last Updated: 2026-01-05
Status: 📋 PLAN READY FOR EXECUTION

9.9 KiB Raw Blame History

Complete Ecosystem Improvement Plan

Executive Summary

Phase 1: Critical Issues Resolution (Week 1-2)

1.1 IP Conflict Investigation & Resolution

1.2 Cloudflare Tunnel Configuration Fix

1.3 DNS Records Cleanup & Migration

Phase 2: Storage & Infrastructure Optimization (Week 2-3)

2.1 Storage Activation on R630 Servers

2.2 Cluster Configuration Verification

Phase 3: Workload Redistribution (Week 3-5)

3.1 Workload Analysis & Migration Plan

Keep on ml110 (Management/Infrastructure):

Migrate to r630-01 (Medium Workload):

Migrate to r630-02 (Heavy Workload):

Phase 4: Network Architecture Improvements (Week 4-6)

4.1 VLAN Migration Planning

Phase 5: Service Optimization (Week 5-7)

5.1 Nginx Architecture Review

Phase 6: Documentation & Automation (Week 6-8)

6.1 Infrastructure Documentation

Success Metrics

Performance Improvements

Availability Improvements

Timeline Summary

Next Steps

Immediate (This Week)

9.9 KiB

Raw Blame History