# Verification Scripts and Documentation - Gaps and TODOs **Last Updated:** 2026-03-02 **Document Version:** 1.0 **Status:** Active Documentation --- **Date**: 2026-01-20 **Status**: Gap Analysis Complete **Purpose**: Identify all placeholders, missing components, and incomplete implementations **Documentation note (2026-03-02):** Runbook placeholders (e.g. `your-token`, `your-password`) are intentional examples. In production, use values from `.env` only; do not commit secrets. [INGRESS_VERIFICATION_RUNBOOK.md](INGRESS_VERIFICATION_RUNBOOK.md) updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands. --- ## Critical Missing Components ### 1. Missing Script: `scripts/verify/backup-npmplus.sh` **Status**: ✅ **CREATED** (scripts/verify/backup-npmplus.sh) **Referenced in**: - `docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md` (lines 39, 150, 437, 480) **Required Functionality**: - Automated backup of NPMplus database (`/data/database.sqlite`) - Export of proxy hosts via API - Export of certificates via API - Certificate file backup from disk - Compression and timestamping - Configurable backup destination **Action Required**: Create the script with all backup procedures documented in `NPMPLUS_BACKUP_RESTORE.md`. --- ## Placeholders and TBD Values ### 2. Nginx Config Paths - TBD Values **Location**: `scripts/verify/verify-backend-vms.sh` **Status**: ✅ **RESOLVED** - Paths set in scripts/verify/verify-backend-vms.sh: - VMID 10130: `/etc/nginx/sites-available/dbis-frontend` - VMID 2400: `/etc/nginx/sites-available/thirdweb-rpc` **Required Actions** (if paths differ on actual VMs): 1. **VMID 10130 (dbis-frontend)**: - Determine actual nginx config path - Common locations: `/etc/nginx/sites-available/dbis-frontend` or `/etc/nginx/sites-available/dbis-admin` - Update script with actual path - Verify config exists and is enabled 2. **VMID 2400 (thirdweb-rpc-1)**: - Determine actual nginx config path - Common locations: `/etc/nginx/sites-available/thirdweb-rpc` or `/etc/nginx/sites-available/rpc` - Update script with actual path - Verify config exists and is enabled **Impact**: Script will skip nginx config verification for these VMs until resolved. --- ### 3. Sankofa Cutover Plan - Target Placeholders **Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md` **Placeholders to Replace** (once Sankofa services are deployed): - `` (appears 10 times) - `` (appears 10 times) - `⚠️ TBD` values in table (lines 60-64) **Domain-Specific Targets Needed**: | Domain | Current (Wrong) | Target (TBD) | |--------|----------------|--------------| | `sankofa.nexus` | 192.168.11.140:80 | `:` | | `www.sankofa.nexus` | 192.168.11.140:80 | `:` | | `phoenix.sankofa.nexus` | 192.168.11.140:80 | `:` | | `www.phoenix.sankofa.nexus` | 192.168.11.140:80 | `:` | | `the-order.sankofa.nexus` | 192.168.11.140:80 | `:` | **Action Required**: Update placeholders with actual Sankofa service IPs and ports once deployed. --- ## Documentation Placeholders ### 4. Generic Placeholders in Runbooks **Location**: Multiple files **Replacements Needed**: #### `INGRESS_VERIFICATION_RUNBOOK.md`: - Line 23: `CLOUDFLARE_API_TOKEN="your-token"` → Should reference `.env` file - Line 25: `CLOUDFLARE_EMAIL="your-email"` → Should reference `.env` file - Line 26: `CLOUDFLARE_API_KEY="your-key"` → Should reference `.env` file - Line 31: `NPM_PASSWORD="your-password"` → Should reference `.env` file - Lines 91, 101, 213: Similar placeholders in examples **Note**: These are intentional examples, but should be clearly marked as such and reference `.env` file usage. #### `NPMPLUS_BACKUP_RESTORE.md`: - Line 84: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable) - Line 304: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable) #### `SANKOFA_CUTOVER_PLAN.md`: - Line 125: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable) - Line 178: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable) **Status (2026-03-02):** Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env. --- ### 5. Source of Truth JSON - Verifier Field **Location**: `docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json` (line 5) **Current**: `"verifier": "operator-name"` **Expected**: Should be dynamically set by script using `$USER` or actual operator name. **Status**: ✅ **HANDLED** - The `generate-source-of-truth.sh` script uses `env.USER // "unknown"` which is correct. The example JSON file is just a template. **Action Required**: None - script implementation is correct. --- ## Implementation Gaps ### 6. Source of Truth Generation - File Path Dependencies **Location**: `scripts/verify/generate-source-of-truth.sh` **Potential Issues**: - Script expects specific output file names from verification scripts - If verification scripts don't run first, JSON will be empty or have defaults - No validation that source files exist before parsing **Expected File Dependencies**: ```bash $EVIDENCE_DIR/dns-verification-*/all_dns_records.json $EVIDENCE_DIR/udm-pro-verification-*/verification_results.json $EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json $EVIDENCE_DIR/npmplus-verification-*/certificates.json $EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json $EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json ``` **Action Required**: - Add file existence checks before parsing - Provide clear error messages if dependencies are missing - Add option to generate partial source-of-truth if some verifications haven't run --- ### 7. Backend VM Verification - Service-Specific Checks **Location**: `scripts/verify/verify-backend-vms.sh` **Gaps Identified**: 1. **Besu RPC VMs (2101, 2201)**: - Script checks for RPC endpoints but doesn't verify Besu-specific health checks - Should test actual RPC calls (e.g., `eth_chainId`) not just HTTP status - WebSocket port (8546) verification is minimal 2. **Node.js API VMs (10150, 10151)**: - Only checks port 3000 is listening - Doesn't verify API health endpoint exists - Should test actual API endpoint (e.g., `/health` or `/api/health`) 3. **Blockscout VM (5000)**: - Checks nginx on port 80 and Blockscout on port 4000 - Should verify Blockscout API is responding (e.g., `/api/health`) **Action Required**: - Add service-specific health check functions - Implement actual RPC/API endpoint testing beyond port checks - Document expected health check endpoints per service type --- ### 8. End-to-End Routing - WebSocket Testing **Location**: `scripts/verify/verify-end-to-end-routing.sh` **Current Implementation**: - Basic WebSocket connectivity check using TCP connection test - Manual `wscat` test recommended but not automated - No actual WebSocket handshake or message exchange verification **Gap**: - WebSocket tests are minimal (just TCP connection) - No verification that WebSocket protocol upgrade works correctly - No test of actual RPC WebSocket messages **Action Required**: - Add automated WebSocket handshake test (if `wscat` is available) - Or add clear documentation that WebSocket testing requires manual verification - Consider adding automated WebSocket test script if `wscat` or `websocat` is installed --- ## Configuration Gaps ### 9. Environment Variable Documentation **Missing**: Comprehensive `.env.example` file listing all required variables **Required Variables** (from scripts): ```bash # Cloudflare CLOUDFLARE_API_TOKEN= CLOUDFLARE_EMAIL= CLOUDFLARE_API_KEY= CLOUDFLARE_ZONE_ID_D_BIS_ORG= CLOUDFLARE_ZONE_ID_MIM4U_ORG= CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS= CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO= # Public IP PUBLIC_IP=76.53.10.36 # NPMplus NPM_URL=https://192.168.11.166:81 NPM_EMAIL=nsatoshi2007@hotmail.com NPM_PASSWORD= NPM_PROXMOX_HOST=192.168.11.11 NPM_VMID=10233 # Proxmox Hosts (for testing) PROXMOX_HOST_FOR_TEST=192.168.11.11 ``` **Action Required**: Create `.env.example` file in project root with all required variables. --- ### 10. Script Dependencies Documentation **Missing**: List of required system dependencies **Required Tools** (used across scripts): - `bash` (4.0+) - `curl` (for API calls) - `jq` (for JSON parsing) - `dig` (for DNS resolution) - `openssl` (for SSL certificate inspection) - `ssh` (for remote execution) - `ss` (for port checking) - `systemctl` (for service status) - `sqlite3` (for database backup) **Optional Tools**: - `wscat` or `websocat` (for WebSocket testing) **Action Required**: - Add dependencies section to `INGRESS_VERIFICATION_RUNBOOK.md` - Create `scripts/verify/README.md` with installation instructions - Add dependency check function to `run-full-verification.sh` --- ## Data Completeness Gaps ### 11. Source of Truth JSON - Hardcoded Values **Location**: `scripts/verify/generate-source-of-truth.sh` (lines 169-177) **Current**: NPMplus container info is hardcoded: ```json "container": { "vmid": 10233, "host": "r630-01", "host_ip": "192.168.11.11", "internal_ips": { "eth0": "192.168.11.166", "eth1": "192.168.11.167" }, "management_ui": "https://192.168.11.166:81", "status": "running" } ``` **Gap**: Status should be dynamically determined from verification results. **Action Required**: - Make container status dynamic based on `export-npmplus-config.sh` results - Verify IP addresses are correct (especially `eth1`) - Document if `eth1` is actually used or is a placeholder --- ### 12. DNS Verification - Zone ID Lookup **Location**: `scripts/verify/export-cloudflare-dns-records.sh` **Current**: Attempts to fetch zone IDs if not provided in `.env`, but has fallback to empty string. **Potential Issue**: If zone ID lookup fails and `.env` doesn't have zone IDs, script will fail silently or skip zones. **Action Required**: - Add validation that zone IDs are set (either from `.env` or from API lookup) - Fail clearly if zone ID cannot be determined - Provide helpful error message with instructions --- ## Documentation Completeness ### 13. Missing Troubleshooting Sections **Location**: `docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md` **Current**: Basic troubleshooting section exists (lines 427-468) but could be expanded. **Missing Topics**: - What to do if verification scripts fail partially - How to interpret "unknown" status vs "needs-fix" status - How to manually verify items that scripts can't automate - Common Cloudflare API errors and solutions - Common NPMplus API authentication issues - SSH connection failures to Proxmox hosts **Action Required**: Expand troubleshooting section with more scenarios. --- ### 14. Missing Rollback Procedures **Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md` **Current**: Basic rollback steps exist (lines 330-342) but could be more detailed. **Missing**: - Automated rollback script reference - Exact commands to restore previous NPMplus configuration - How to verify rollback was successful - Recovery time expectations **Action Required**: - Create `scripts/verify/rollback-sankofa-routing.sh` (optional but recommended) - Or expand manual rollback steps with exact API calls --- ## Priority Summary ### 🔴 Critical (Must Fix Before Production Use) 1. ✅ **Create `scripts/verify/backup-npmplus.sh`** - Referenced but missing 2. ✅ **Resolve TBD nginx config paths** (VMID 10130, 2400) - Blocks verification 3. ✅ **Add file dependency validation** in `generate-source-of-truth.sh` ### 🟡 Important (Should Fix Soon) 4. **Add `.env.example` file** with all required variables 5. **Add dependency checks** to verification scripts 6. **Expand service-specific health checks** for Besu, Node.js, Blockscout 7. **Document WebSocket testing limitations** or automate it ### 🟢 Nice to Have (Can Wait) 8. **Expand troubleshooting section** with more scenarios 9. **Create rollback script** for Sankofa cutover 10. **Add dependency installation guide** to runbook 11. **Make container status dynamic** in source-of-truth generation --- ## Notes - **Placeholders in examples**: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference `.env` file usage. - **Sankofa placeholders**: `` and `` are expected placeholders until Sankofa services are deployed. These should be updated during cutover. - **TBD config paths**: These need to be discovered by running verification and inspecting actual VMs. --- --- ## Additional Items Completed ### 15. NPMplus High Availability (HA) Setup Guide ✅ ADDED **Status**: ✅ **DOCUMENTATION COMPLETE** - Implementation pending **Location**: `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` **What Was Added**: - Complete HA architecture guide (Active-Passive with Keepalived) - Step-by-step implementation instructions (6 phases) - Helper scripts: `sync-certificates.sh`, `monitor-ha-status.sh` - Testing and validation procedures - Troubleshooting guide - Rollback plan - Future upgrade path to Active-Active **Scripts Created**: - `scripts/npmplus/sync-certificates.sh` - Synchronize certificates from primary to secondary - `scripts/npmplus/monitor-ha-status.sh` - Monitor HA status and send alerts **Impact**: Eliminates single point of failure for NPMplus, enables automatic failover. --- ## NPMplus HA Implementation Tasks ### Phase 1: Prepare Secondary NPMplus Instance #### Task 1.1: Create Secondary NPMplus Container **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Download Alpine 3.22 template on r630-02 - [ ] Create container VMID 10234 with: - Hostname: `npmplus-secondary` - IP: `192.168.11.167/24` - Memory: 1024 MB - Cores: 2 - Disk: 5 GB - Features: nesting=1, unprivileged=1 - [ ] Start container and verify it's running - [ ] Document container creation in deployment log **Commands**: ```bash # On r630-02 CTID=10234 HOSTNAME="npmplus-secondary" IP="192.168.11.167" BRIDGE="vmbr0" pveam download local alpine-3.22-default_20241208_amd64.tar.xz pct create $CTID \ local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \ --hostname $HOSTNAME \ --memory 1024 \ --cores 2 \ --rootfs local-lvm:5 \ --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \ --unprivileged 1 \ --features nesting=1 pct start $CTID ``` --- #### Task 1.2: Install NPMplus on Secondary Instance **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 45 minutes **Actions Required**: - [ ] SSH to r630-02 and enter container - [ ] Install dependencies: `tzdata`, `gawk`, `yq`, `docker`, `docker-compose`, `curl`, `bash`, `rsync` - [ ] Start and enable Docker service - [ ] Download NPMplus compose.yaml from GitHub - [ ] Configure timezone: `America/New_York` - [ ] Configure ACME email: `nsatoshi2007@hotmail.com` - [ ] Start NPMplus container (but don't configure yet - will sync first) - [ ] Wait for NPMplus to be healthy - [ ] Retrieve admin password and document it **Commands**: ```bash ssh root@192.168.11.12 pct exec 10234 -- ash apk update apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync rc-service docker start rc-update add docker default sleep 5 cd /opt curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml TZ="America/New_York" ACME_EMAIL="nsatoshi2007@hotmail.com" yq -i " .services.npmplus.environment |= (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) + [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"]) " compose.yaml docker compose up -d ``` --- #### Task 1.3: Configure Secondary Container Network **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 10 minutes **Actions Required**: - [ ] Verify static IP assignment: `192.168.11.167` - [ ] Verify gateway: `192.168.11.1` - [ ] Test network connectivity to primary host - [ ] Test network connectivity to backend VMs - [ ] Document network configuration **Commands**: ```bash pct exec 10234 -- ip addr show eth0 pct exec 10234 -- ping -c 3 192.168.11.11 pct exec 10234 -- ping -c 3 192.168.11.166 ``` --- ### Phase 2: Set Up Certificate Synchronization #### Task 2.1: Create Certificate Sync Script **Status**: ✅ **COMPLETED** **Location**: `scripts/npmplus/sync-certificates.sh` **Note**: Script already created, needs testing **Actions Required**: - [ ] Test certificate sync script manually - [ ] Verify certificates sync correctly - [ ] Verify script handles errors gracefully - [ ] Document certificate paths for both primary and secondary --- #### Task 2.2: Set Up Automated Certificate Sync **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 15 minutes **Actions Required**: - [ ] Add cron job on primary Proxmox host (r630-01) - [ ] Configure to run every 5 minutes - [ ] Set up log rotation for `/var/log/npmplus-cert-sync.log` - [ ] Test cron job execution - [ ] Monitor logs for successful syncs - [ ] Verify certificate count matches between primary and secondary **Commands**: ```bash # On r630-01 crontab -e # Add: */5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1 # Test manually first bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh ``` --- ### Phase 3: Set Up Keepalived for Virtual IP #### Task 3.1: Install Keepalived on Proxmox Hosts **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 10 minutes **Actions Required**: - [ ] Install Keepalived on r630-01 (primary) - [ ] Install Keepalived on r630-02 (secondary) - [ ] Verify Keepalived installation - [ ] Check firewall rules for VRRP (multicast 224.0.0.0/8) **Commands**: ```bash # On both hosts apt update apt install -y keepalived # Verify installation keepalived --version ``` --- #### Task 3.2: Configure Keepalived on Primary Host (r630-01) **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 20 minutes **Actions Required**: - [ ] Create `/etc/keepalived/keepalived.conf` with MASTER configuration - [ ] Set virtual_router_id: 51 - [ ] Set priority: 110 - [ ] Configure auth_pass (use secure password) - [ ] Configure virtual_ipaddress: 192.168.11.166/24 - [ ] Reference health check script path - [ ] Reference notification script path - [ ] Verify configuration syntax - [ ] Document Keepalived configuration **Files to Create**: - `/etc/keepalived/keepalived.conf` (see HA guide for full config) - `/usr/local/bin/check-npmplus-health.sh` (Task 3.4) - `/usr/local/bin/keepalived-notify.sh` (Task 3.5) --- #### Task 3.3: Configure Keepalived on Secondary Host (r630-02) **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 20 minutes **Actions Required**: - [ ] Create `/etc/keepalived/keepalived.conf` with BACKUP configuration - [ ] Set virtual_router_id: 51 (must match primary) - [ ] Set priority: 100 (lower than primary) - [ ] Configure auth_pass (must match primary) - [ ] Configure virtual_ipaddress: 192.168.11.166/24 - [ ] Reference health check script path - [ ] Reference notification script path - [ ] Verify configuration syntax - [ ] Document Keepalived configuration **Files to Create**: - `/etc/keepalived/keepalived.conf` (see HA guide for full config) - `/usr/local/bin/check-npmplus-health.sh` (Task 3.4) - `/usr/local/bin/keepalived-notify.sh` (Task 3.5) --- #### Task 3.4: Create Health Check Script **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Create `/usr/local/bin/check-npmplus-health.sh` on both hosts - [ ] Script should: - Detect hostname to determine which VMID to check - Check if container is running - Check if NPMplus Docker container is healthy - Check if NPMplus web interface responds (port 81) - Return exit code 0 if healthy, 1 if unhealthy - [ ] Make script executable: `chmod +x` - [ ] Test script manually on both hosts - [ ] Verify script detects failures correctly **File**: `/usr/local/bin/check-npmplus-health.sh` **Details**: See HA guide for full script content --- #### Task 3.5: Create Keepalived Notification Script **Status**: ⏳ **PENDING** **Priority**: 🟡 **Important** **Estimated Time**: 15 minutes **Actions Required**: - [ ] Create `/usr/local/bin/keepalived-notify.sh` on both hosts - [ ] Script should handle states: master, backup, fault - [ ] Log state changes to `/var/log/keepalived-notify.log` - [ ] Optional: Send alerts (email, webhook) on fault state - [ ] Make script executable: `chmod +x` - [ ] Test script with each state manually **File**: `/usr/local/bin/keepalived-notify.sh` **Details**: See HA guide for full script content --- #### Task 3.6: Start and Enable Keepalived **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 15 minutes **Actions Required**: - [ ] Enable Keepalived service on both hosts - [ ] Start Keepalived on both hosts - [ ] Verify Keepalived is running - [ ] Verify primary host owns VIP (192.168.11.166) - [ ] Verify secondary host is in BACKUP state - [ ] Monitor Keepalived logs for any errors - [ ] Document VIP ownership verification **Commands**: ```bash # On both hosts systemctl enable keepalived systemctl start keepalived # Verify status systemctl status keepalived # Check VIP ownership (should be on primary) ip addr show vmbr0 | grep 192.168.11.166 # Check logs journalctl -u keepalived -f ``` --- ### Phase 4: Sync Configuration to Secondary #### Task 4.1: Export Primary Configuration **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Create export script: `scripts/npmplus/export-primary-config.sh` - [ ] Export NPMplus SQLite database to SQL dump - [ ] Export proxy hosts via API (JSON) - [ ] Export certificates via API (JSON) - [ ] Create timestamped backup directory - [ ] Verify all exports completed successfully - [ ] Document backup location and contents **Script to Create**: `scripts/npmplus/export-primary-config.sh` **Details**: See HA guide for full script content --- #### Task 4.2: Import Configuration to Secondary **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 45 minutes **Actions Required**: - [ ] Create import script: `scripts/npmplus/import-secondary-config.sh` - [ ] Stop NPMplus container on secondary (if running) - [ ] Copy database SQL dump to secondary - [ ] Import database dump into secondary NPMplus - [ ] Restart NPMplus container on secondary - [ ] Wait for NPMplus to be healthy - [ ] Verify proxy hosts are configured - [ ] Verify certificates are accessible - [ ] Document any manual configuration steps needed **Script to Create**: `scripts/npmplus/import-secondary-config.sh` **Details**: See HA guide for full script content **Note**: Some configuration may need manual replication via API or UI. --- ### Phase 5: Set Up Ongoing Configuration Sync #### Task 5.1: Create Configuration Sync Script **Status**: ⏳ **PENDING** **Priority**: 🟡 **Important** **Estimated Time**: 45 minutes **Actions Required**: - [ ] Create sync script: `scripts/npmplus/sync-config.sh` - [ ] Authenticate to NPMplus API (primary) - [ ] Export proxy hosts configuration - [ ] Implement API-based sync or document manual sync process - [ ] Add script to automation (if automated sync is possible) - [ ] Document manual sync procedures for configuration changes **Script to Create**: `scripts/npmplus/sync-config.sh` **Note**: Full automated sync requires shared database or complex API sync. For now, manual sync may be required. --- ### Phase 6: Testing and Validation #### Task 6.1: Test Virtual IP Failover **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Verify primary owns VIP before test - [ ] Simulate primary failure (stop Keepalived or NPMplus container) - [ ] Verify VIP moves to secondary within 5-10 seconds - [ ] Test connectivity to VIP from external source - [ ] Restore primary and verify failback - [ ] Document failover time (should be < 10 seconds) - [ ] Test multiple failover scenarios - [ ] Document test results **Test Scenarios**: 1. Stop Keepalived on primary 2. Stop NPMplus container on primary 3. Stop entire Proxmox host (if possible in test environment) 4. Network partition (if possible in test environment) --- #### Task 6.2: Test Certificate Access **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Verify certificates exist on secondary (after sync) - [ ] Test SSL endpoint from external: `curl -vI https://explorer.d-bis.org` - [ ] Verify certificate is valid and trusted - [ ] Test multiple domains with SSL - [ ] Verify certificate expiration dates match - [ ] Test certificate auto-renewal on secondary (when primary renews) - [ ] Document certificate test results **Commands**: ```bash # Verify certificates on secondary ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" # Test SSL endpoint curl -vI https://explorer.d-bis.org curl -vI https://mim4u.org curl -vI https://rpc-http-pub.d-bis.org ``` --- #### Task 6.3: Test Proxy Host Functionality **Status**: ⏳ **PENDING** **Priority**: 🔴 **Critical** **Estimated Time**: 45 minutes **Actions Required**: - [ ] Test each domain from external after failover - [ ] Verify HTTP to HTTPS redirects work - [ ] Verify WebSocket connections work (for RPC endpoints) - [ ] Verify API endpoints respond correctly - [ ] Test all 19+ domains - [ ] Document any domains that don't work correctly - [ ] Test with secondary as active instance - [ ] Test failback to primary **Test Domains**: - All d-bis.org domains (9 domains) - All mim4u.org domains (4 domains) - All sankofa.nexus domains (5 domains) - defi-oracle.io domain (1 domain) --- ### Monitoring and Maintenance #### Task 7.1: Set Up HA Status Monitoring **Status**: ✅ **COMPLETED** (script created, needs deployment) **Priority**: 🟡 **Important** **Location**: `scripts/npmplus/monitor-ha-status.sh` **Actions Required**: - [ ] Add cron job for HA status monitoring (every 5 minutes) - [ ] Configure log rotation for `/var/log/npmplus-ha-monitor.log` - [ ] Test monitoring script manually - [ ] Optional: Integrate with alerting system (email, webhook) - [ ] Document alert thresholds and escalation procedures - [ ] Test alert generation **Commands**: ```bash # On primary Proxmox host crontab -e # Add: */5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1 ``` --- #### Task 7.2: Document Manual Failover Procedures **Status**: ⏳ **PENDING** **Priority**: 🟡 **Important** **Estimated Time**: 30 minutes **Actions Required**: - [ ] Document step-by-step manual failover procedure - [ ] Document how to force failover to secondary - [ ] Document how to force failback to primary - [ ] Document troubleshooting steps for common issues - [ ] Create runbook for operations team - [ ] Test manual failover procedures - [ ] Review and approve documentation **Location**: Add to `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` troubleshooting section --- #### Task 7.3: Test All Failover Scenarios **Status**: ⏳ **PENDING** **Priority**: 🟡 **Important** **Estimated Time**: 2 hours **Actions Required**: - [ ] Test automatic failover (primary failure) - [ ] Test automatic failback (primary recovery) - [ ] Test manual failover (force to secondary) - [ ] Test manual failback (force to primary) - [ ] Test partial failure (Keepalived down but NPMplus up) - [ ] Test network partition scenarios - [ ] Test during high traffic (if possible) - [ ] Document all test results - [ ] Identify and fix any issues found --- ## HA Implementation Summary ### Total Estimated Time - **Phase 1**: 1.5 hours (container creation and NPMplus installation) - **Phase 2**: 30 minutes (certificate sync setup) - **Phase 3**: 2 hours (Keepalived configuration and scripts) - **Phase 4**: 1.5 hours (configuration export/import) - **Phase 5**: 45 minutes (ongoing sync setup) - **Phase 6**: 2 hours (testing and validation) - **Monitoring**: 1 hour (monitoring setup and documentation) **Total**: ~9 hours of implementation time ### Prerequisites Checklist - [ ] Secondary Proxmox host available (r630-02 or ml110) - [ ] Network connectivity between hosts verified - [ ] Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores) - [ ] SSH access configured between hosts (key-based auth recommended) - [ ] Maintenance window scheduled - [ ] Backup of primary NPMplus completed - [ ] Team notified of maintenance window ### Risk Mitigation - [ ] Rollback plan documented and tested - [ ] Primary NPMplus backup verified before changes - [ ] Test environment available (if possible) - [ ] Monitoring in place before production deployment - [ ] Emergency contact list available --- **Last Updated**: 2026-01-20 **Next Review**: After addressing critical items