Files
proxmox/docs/04-configuration/VERIFICATION_GAPS_AND_TODOS.md
defiQUG b3a8fe4496
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: sync all changes to Gitea
- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor
2026-03-02 11:37:34 -08:00

29 KiB

Verification Scripts and Documentation - Gaps and TODOs

Last Updated: 2026-03-02
Document Version: 1.0
Status: Active Documentation


Date: 2026-01-20
Status: Gap Analysis Complete
Purpose: Identify all placeholders, missing components, and incomplete implementations

Documentation note (2026-03-02): Runbook placeholders (e.g. your-token, your-password) are intentional examples. In production, use values from .env only; do not commit secrets. INGRESS_VERIFICATION_RUNBOOK.md updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.


Critical Missing Components

1. Missing Script: scripts/verify/backup-npmplus.sh

Status: CREATED (scripts/verify/backup-npmplus.sh)
Referenced in:

  • docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md (lines 39, 150, 437, 480)

Required Functionality:

  • Automated backup of NPMplus database (/data/database.sqlite)
  • Export of proxy hosts via API
  • Export of certificates via API
  • Certificate file backup from disk
  • Compression and timestamping
  • Configurable backup destination

Action Required: Create the script with all backup procedures documented in NPMPLUS_BACKUP_RESTORE.md.


Placeholders and TBD Values

2. Nginx Config Paths - TBD Values

Location: scripts/verify/verify-backend-vms.sh

Status: RESOLVED - Paths set in scripts/verify/verify-backend-vms.sh:

  • VMID 10130: /etc/nginx/sites-available/dbis-frontend
  • VMID 2400: /etc/nginx/sites-available/thirdweb-rpc

Required Actions (if paths differ on actual VMs):

  1. VMID 10130 (dbis-frontend):

    • Determine actual nginx config path
    • Common locations: /etc/nginx/sites-available/dbis-frontend or /etc/nginx/sites-available/dbis-admin
    • Update script with actual path
    • Verify config exists and is enabled
  2. VMID 2400 (thirdweb-rpc-1):

    • Determine actual nginx config path
    • Common locations: /etc/nginx/sites-available/thirdweb-rpc or /etc/nginx/sites-available/rpc
    • Update script with actual path
    • Verify config exists and is enabled

Impact: Script will skip nginx config verification for these VMs until resolved.


3. Sankofa Cutover Plan - Target Placeholders

Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md

Placeholders to Replace (once Sankofa services are deployed):

  • <TARGET_IP> (appears 10 times)
  • <TARGET_PORT> (appears 10 times)
  • ⚠️ TBD values in table (lines 60-64)

Domain-Specific Targets Needed:

Domain Current (Wrong) Target (TBD)
sankofa.nexus 192.168.11.140:80 <TARGET_IP>:<TARGET_PORT>
www.sankofa.nexus 192.168.11.140:80 <TARGET_IP>:<TARGET_PORT>
phoenix.sankofa.nexus 192.168.11.140:80 <TARGET_IP>:<TARGET_PORT>
www.phoenix.sankofa.nexus 192.168.11.140:80 <TARGET_IP>:<TARGET_PORT>
the-order.sankofa.nexus 192.168.11.140:80 <TARGET_IP>:<TARGET_PORT>

Action Required: Update placeholders with actual Sankofa service IPs and ports once deployed.


Documentation Placeholders

4. Generic Placeholders in Runbooks

Location: Multiple files

Replacements Needed:

INGRESS_VERIFICATION_RUNBOOK.md:

  • Line 23: CLOUDFLARE_API_TOKEN="your-token" → Should reference .env file
  • Line 25: CLOUDFLARE_EMAIL="your-email" → Should reference .env file
  • Line 26: CLOUDFLARE_API_KEY="your-key" → Should reference .env file
  • Line 31: NPM_PASSWORD="your-password" → Should reference .env file
  • Lines 91, 101, 213: Similar placeholders in examples

Note: These are intentional examples, but should be clearly marked as such and reference .env file usage.

NPMPLUS_BACKUP_RESTORE.md:

  • Line 84: NPM_PASSWORD="your-password" → Example placeholder (acceptable)
  • Line 304: NPM_PASSWORD="your-password" → Example placeholder (acceptable)

SANKOFA_CUTOVER_PLAN.md:

  • Line 125: NPM_PASSWORD="your-password" → Example placeholder (acceptable)
  • Line 178: NPM_PASSWORD="your-password" → Example placeholder (acceptable)

Status (2026-03-02): Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.


5. Source of Truth JSON - Verifier Field

Location: docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json (line 5)

Current: "verifier": "operator-name"

Expected: Should be dynamically set by script using $USER or actual operator name.

Status: HANDLED - The generate-source-of-truth.sh script uses env.USER // "unknown" which is correct. The example JSON file is just a template.

Action Required: None - script implementation is correct.


Implementation Gaps

6. Source of Truth Generation - File Path Dependencies

Location: scripts/verify/generate-source-of-truth.sh

Potential Issues:

  • Script expects specific output file names from verification scripts
  • If verification scripts don't run first, JSON will be empty or have defaults
  • No validation that source files exist before parsing

Expected File Dependencies:

$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json

Action Required:

  • Add file existence checks before parsing
  • Provide clear error messages if dependencies are missing
  • Add option to generate partial source-of-truth if some verifications haven't run

7. Backend VM Verification - Service-Specific Checks

Location: scripts/verify/verify-backend-vms.sh

Gaps Identified:

  1. Besu RPC VMs (2101, 2201):

    • Script checks for RPC endpoints but doesn't verify Besu-specific health checks
    • Should test actual RPC calls (e.g., eth_chainId) not just HTTP status
    • WebSocket port (8546) verification is minimal
  2. Node.js API VMs (10150, 10151):

    • Only checks port 3000 is listening
    • Doesn't verify API health endpoint exists
    • Should test actual API endpoint (e.g., /health or /api/health)
  3. Blockscout VM (5000):

    • Checks nginx on port 80 and Blockscout on port 4000
    • Should verify Blockscout API is responding (e.g., /api/health)

Action Required:

  • Add service-specific health check functions
  • Implement actual RPC/API endpoint testing beyond port checks
  • Document expected health check endpoints per service type

8. End-to-End Routing - WebSocket Testing

Location: scripts/verify/verify-end-to-end-routing.sh

Current Implementation:

  • Basic WebSocket connectivity check using TCP connection test
  • Manual wscat test recommended but not automated
  • No actual WebSocket handshake or message exchange verification

Gap:

  • WebSocket tests are minimal (just TCP connection)
  • No verification that WebSocket protocol upgrade works correctly
  • No test of actual RPC WebSocket messages

Action Required:

  • Add automated WebSocket handshake test (if wscat is available)
  • Or add clear documentation that WebSocket testing requires manual verification
  • Consider adding automated WebSocket test script if wscat or websocat is installed

Configuration Gaps

9. Environment Variable Documentation

Missing: Comprehensive .env.example file listing all required variables

Required Variables (from scripts):

# Cloudflare
CLOUDFLARE_API_TOKEN=
CLOUDFLARE_EMAIL=
CLOUDFLARE_API_KEY=
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=

# Public IP
PUBLIC_IP=76.53.10.36

# NPMplus
NPM_URL=https://192.168.11.166:81
NPM_EMAIL=nsatoshi2007@hotmail.com
NPM_PASSWORD=
NPM_PROXMOX_HOST=192.168.11.11
NPM_VMID=10233

# Proxmox Hosts (for testing)
PROXMOX_HOST_FOR_TEST=192.168.11.11

Action Required: Create .env.example file in project root with all required variables.


10. Script Dependencies Documentation

Missing: List of required system dependencies

Required Tools (used across scripts):

  • bash (4.0+)
  • curl (for API calls)
  • jq (for JSON parsing)
  • dig (for DNS resolution)
  • openssl (for SSL certificate inspection)
  • ssh (for remote execution)
  • ss (for port checking)
  • systemctl (for service status)
  • sqlite3 (for database backup)

Optional Tools:

  • wscat or websocat (for WebSocket testing)

Action Required:

  • Add dependencies section to INGRESS_VERIFICATION_RUNBOOK.md
  • Create scripts/verify/README.md with installation instructions
  • Add dependency check function to run-full-verification.sh

Data Completeness Gaps

11. Source of Truth JSON - Hardcoded Values

Location: scripts/verify/generate-source-of-truth.sh (lines 169-177)

Current: NPMplus container info is hardcoded:

"container": {
    "vmid": 10233,
    "host": "r630-01",
    "host_ip": "192.168.11.11",
    "internal_ips": {
        "eth0": "192.168.11.166",
        "eth1": "192.168.11.167"
    },
    "management_ui": "https://192.168.11.166:81",
    "status": "running"
}

Gap: Status should be dynamically determined from verification results.

Action Required:

  • Make container status dynamic based on export-npmplus-config.sh results
  • Verify IP addresses are correct (especially eth1)
  • Document if eth1 is actually used or is a placeholder

12. DNS Verification - Zone ID Lookup

Location: scripts/verify/export-cloudflare-dns-records.sh

Current: Attempts to fetch zone IDs if not provided in .env, but has fallback to empty string.

Potential Issue: If zone ID lookup fails and .env doesn't have zone IDs, script will fail silently or skip zones.

Action Required:

  • Add validation that zone IDs are set (either from .env or from API lookup)
  • Fail clearly if zone ID cannot be determined
  • Provide helpful error message with instructions

Documentation Completeness

13. Missing Troubleshooting Sections

Location: docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md

Current: Basic troubleshooting section exists (lines 427-468) but could be expanded.

Missing Topics:

  • What to do if verification scripts fail partially
  • How to interpret "unknown" status vs "needs-fix" status
  • How to manually verify items that scripts can't automate
  • Common Cloudflare API errors and solutions
  • Common NPMplus API authentication issues
  • SSH connection failures to Proxmox hosts

Action Required: Expand troubleshooting section with more scenarios.


14. Missing Rollback Procedures

Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md

Current: Basic rollback steps exist (lines 330-342) but could be more detailed.

Missing:

  • Automated rollback script reference
  • Exact commands to restore previous NPMplus configuration
  • How to verify rollback was successful
  • Recovery time expectations

Action Required:

  • Create scripts/verify/rollback-sankofa-routing.sh (optional but recommended)
  • Or expand manual rollback steps with exact API calls

Priority Summary

🔴 Critical (Must Fix Before Production Use)

  1. Create scripts/verify/backup-npmplus.sh - Referenced but missing
  2. Resolve TBD nginx config paths (VMID 10130, 2400) - Blocks verification
  3. Add file dependency validation in generate-source-of-truth.sh

🟡 Important (Should Fix Soon)

  1. Add .env.example file with all required variables
  2. Add dependency checks to verification scripts
  3. Expand service-specific health checks for Besu, Node.js, Blockscout
  4. Document WebSocket testing limitations or automate it

🟢 Nice to Have (Can Wait)

  1. Expand troubleshooting section with more scenarios
  2. Create rollback script for Sankofa cutover
  3. Add dependency installation guide to runbook
  4. Make container status dynamic in source-of-truth generation

Notes

  • Placeholders in examples: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference .env file usage.
  • Sankofa placeholders: <TARGET_IP> and <TARGET_PORT> are expected placeholders until Sankofa services are deployed. These should be updated during cutover.
  • TBD config paths: These need to be discovered by running verification and inspecting actual VMs.


Additional Items Completed

15. NPMplus High Availability (HA) Setup Guide ADDED

Status: DOCUMENTATION COMPLETE - Implementation pending
Location: docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md

What Was Added:

  • Complete HA architecture guide (Active-Passive with Keepalived)
  • Step-by-step implementation instructions (6 phases)
  • Helper scripts: sync-certificates.sh, monitor-ha-status.sh
  • Testing and validation procedures
  • Troubleshooting guide
  • Rollback plan
  • Future upgrade path to Active-Active

Scripts Created:

  • scripts/npmplus/sync-certificates.sh - Synchronize certificates from primary to secondary
  • scripts/npmplus/monitor-ha-status.sh - Monitor HA status and send alerts

Impact: Eliminates single point of failure for NPMplus, enables automatic failover.


NPMplus HA Implementation Tasks

Phase 1: Prepare Secondary NPMplus Instance

Task 1.1: Create Secondary NPMplus Container

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

  • Download Alpine 3.22 template on r630-02
  • Create container VMID 10234 with:
    • Hostname: npmplus-secondary
    • IP: 192.168.11.167/24
    • Memory: 1024 MB
    • Cores: 2
    • Disk: 5 GB
    • Features: nesting=1, unprivileged=1
  • Start container and verify it's running
  • Document container creation in deployment log

Commands:

# On r630-02
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.167"
BRIDGE="vmbr0"

pveam download local alpine-3.22-default_20241208_amd64.tar.xz

pct create $CTID \
    local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
    --hostname $HOSTNAME \
    --memory 1024 \
    --cores 2 \
    --rootfs local-lvm:5 \
    --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
    --unprivileged 1 \
    --features nesting=1

pct start $CTID

Task 1.2: Install NPMplus on Secondary Instance

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

  • SSH to r630-02 and enter container
  • Install dependencies: tzdata, gawk, yq, docker, docker-compose, curl, bash, rsync
  • Start and enable Docker service
  • Download NPMplus compose.yaml from GitHub
  • Configure timezone: America/New_York
  • Configure ACME email: nsatoshi2007@hotmail.com
  • Start NPMplus container (but don't configure yet - will sync first)
  • Wait for NPMplus to be healthy
  • Retrieve admin password and document it

Commands:

ssh root@192.168.11.12
pct exec 10234 -- ash

apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync

rc-service docker start
rc-update add docker default
sleep 5

cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml

TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"

yq -i "
  .services.npmplus.environment |=
    (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
    [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml

docker compose up -d

Task 1.3: Configure Secondary Container Network

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes

Actions Required:

  • Verify static IP assignment: 192.168.11.167
  • Verify gateway: 192.168.11.1
  • Test network connectivity to primary host
  • Test network connectivity to backend VMs
  • Document network configuration

Commands:

pct exec 10234 -- ip addr show eth0
pct exec 10234 -- ping -c 3 192.168.11.11
pct exec 10234 -- ping -c 3 192.168.11.166

Phase 2: Set Up Certificate Synchronization

Task 2.1: Create Certificate Sync Script

Status: COMPLETED
Location: scripts/npmplus/sync-certificates.sh
Note: Script already created, needs testing

Actions Required:

  • Test certificate sync script manually
  • Verify certificates sync correctly
  • Verify script handles errors gracefully
  • Document certificate paths for both primary and secondary

Task 2.2: Set Up Automated Certificate Sync

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes

Actions Required:

  • Add cron job on primary Proxmox host (r630-01)
  • Configure to run every 5 minutes
  • Set up log rotation for /var/log/npmplus-cert-sync.log
  • Test cron job execution
  • Monitor logs for successful syncs
  • Verify certificate count matches between primary and secondary

Commands:

# On r630-01
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1

# Test manually first
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh

Phase 3: Set Up Keepalived for Virtual IP

Task 3.1: Install Keepalived on Proxmox Hosts

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes

Actions Required:

  • Install Keepalived on r630-01 (primary)
  • Install Keepalived on r630-02 (secondary)
  • Verify Keepalived installation
  • Check firewall rules for VRRP (multicast 224.0.0.0/8)

Commands:

# On both hosts
apt update
apt install -y keepalived

# Verify installation
keepalived --version

Task 3.2: Configure Keepalived on Primary Host (r630-01)

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes

Actions Required:

  • Create /etc/keepalived/keepalived.conf with MASTER configuration
  • Set virtual_router_id: 51
  • Set priority: 110
  • Configure auth_pass (use secure password)
  • Configure virtual_ipaddress: 192.168.11.166/24
  • Reference health check script path
  • Reference notification script path
  • Verify configuration syntax
  • Document Keepalived configuration

Files to Create:

  • /etc/keepalived/keepalived.conf (see HA guide for full config)
  • /usr/local/bin/check-npmplus-health.sh (Task 3.4)
  • /usr/local/bin/keepalived-notify.sh (Task 3.5)

Task 3.3: Configure Keepalived on Secondary Host (r630-02)

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes

Actions Required:

  • Create /etc/keepalived/keepalived.conf with BACKUP configuration
  • Set virtual_router_id: 51 (must match primary)
  • Set priority: 100 (lower than primary)
  • Configure auth_pass (must match primary)
  • Configure virtual_ipaddress: 192.168.11.166/24
  • Reference health check script path
  • Reference notification script path
  • Verify configuration syntax
  • Document Keepalived configuration

Files to Create:

  • /etc/keepalived/keepalived.conf (see HA guide for full config)
  • /usr/local/bin/check-npmplus-health.sh (Task 3.4)
  • /usr/local/bin/keepalived-notify.sh (Task 3.5)

Task 3.4: Create Health Check Script

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

  • Create /usr/local/bin/check-npmplus-health.sh on both hosts
  • Script should:
    • Detect hostname to determine which VMID to check
    • Check if container is running
    • Check if NPMplus Docker container is healthy
    • Check if NPMplus web interface responds (port 81)
    • Return exit code 0 if healthy, 1 if unhealthy
  • Make script executable: chmod +x
  • Test script manually on both hosts
  • Verify script detects failures correctly

File: /usr/local/bin/check-npmplus-health.sh
Details: See HA guide for full script content


Task 3.5: Create Keepalived Notification Script

Status: PENDING
Priority: 🟡 Important
Estimated Time: 15 minutes

Actions Required:

  • Create /usr/local/bin/keepalived-notify.sh on both hosts
  • Script should handle states: master, backup, fault
  • Log state changes to /var/log/keepalived-notify.log
  • Optional: Send alerts (email, webhook) on fault state
  • Make script executable: chmod +x
  • Test script with each state manually

File: /usr/local/bin/keepalived-notify.sh
Details: See HA guide for full script content


Task 3.6: Start and Enable Keepalived

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes

Actions Required:

  • Enable Keepalived service on both hosts
  • Start Keepalived on both hosts
  • Verify Keepalived is running
  • Verify primary host owns VIP (192.168.11.166)
  • Verify secondary host is in BACKUP state
  • Monitor Keepalived logs for any errors
  • Document VIP ownership verification

Commands:

# On both hosts
systemctl enable keepalived
systemctl start keepalived

# Verify status
systemctl status keepalived

# Check VIP ownership (should be on primary)
ip addr show vmbr0 | grep 192.168.11.166

# Check logs
journalctl -u keepalived -f

Phase 4: Sync Configuration to Secondary

Task 4.1: Export Primary Configuration

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

  • Create export script: scripts/npmplus/export-primary-config.sh
  • Export NPMplus SQLite database to SQL dump
  • Export proxy hosts via API (JSON)
  • Export certificates via API (JSON)
  • Create timestamped backup directory
  • Verify all exports completed successfully
  • Document backup location and contents

Script to Create: scripts/npmplus/export-primary-config.sh
Details: See HA guide for full script content


Task 4.2: Import Configuration to Secondary

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

  • Create import script: scripts/npmplus/import-secondary-config.sh
  • Stop NPMplus container on secondary (if running)
  • Copy database SQL dump to secondary
  • Import database dump into secondary NPMplus
  • Restart NPMplus container on secondary
  • Wait for NPMplus to be healthy
  • Verify proxy hosts are configured
  • Verify certificates are accessible
  • Document any manual configuration steps needed

Script to Create: scripts/npmplus/import-secondary-config.sh
Details: See HA guide for full script content

Note: Some configuration may need manual replication via API or UI.


Phase 5: Set Up Ongoing Configuration Sync

Task 5.1: Create Configuration Sync Script

Status: PENDING
Priority: 🟡 Important
Estimated Time: 45 minutes

Actions Required:

  • Create sync script: scripts/npmplus/sync-config.sh
  • Authenticate to NPMplus API (primary)
  • Export proxy hosts configuration
  • Implement API-based sync or document manual sync process
  • Add script to automation (if automated sync is possible)
  • Document manual sync procedures for configuration changes

Script to Create: scripts/npmplus/sync-config.sh
Note: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.


Phase 6: Testing and Validation

Task 6.1: Test Virtual IP Failover

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

  • Verify primary owns VIP before test
  • Simulate primary failure (stop Keepalived or NPMplus container)
  • Verify VIP moves to secondary within 5-10 seconds
  • Test connectivity to VIP from external source
  • Restore primary and verify failback
  • Document failover time (should be < 10 seconds)
  • Test multiple failover scenarios
  • Document test results

Test Scenarios:

  1. Stop Keepalived on primary
  2. Stop NPMplus container on primary
  3. Stop entire Proxmox host (if possible in test environment)
  4. Network partition (if possible in test environment)

Task 6.2: Test Certificate Access

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

  • Verify certificates exist on secondary (after sync)
  • Test SSL endpoint from external: curl -vI https://explorer.d-bis.org
  • Verify certificate is valid and trusted
  • Test multiple domains with SSL
  • Verify certificate expiration dates match
  • Test certificate auto-renewal on secondary (when primary renews)
  • Document certificate test results

Commands:

# Verify certificates on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

# Test SSL endpoint
curl -vI https://explorer.d-bis.org
curl -vI https://mim4u.org
curl -vI https://rpc-http-pub.d-bis.org

Task 6.3: Test Proxy Host Functionality

Status: PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

  • Test each domain from external after failover
  • Verify HTTP to HTTPS redirects work
  • Verify WebSocket connections work (for RPC endpoints)
  • Verify API endpoints respond correctly
  • Test all 19+ domains
  • Document any domains that don't work correctly
  • Test with secondary as active instance
  • Test failback to primary

Test Domains:

  • All d-bis.org domains (9 domains)
  • All mim4u.org domains (4 domains)
  • All sankofa.nexus domains (5 domains)
  • defi-oracle.io domain (1 domain)

Monitoring and Maintenance

Task 7.1: Set Up HA Status Monitoring

Status: COMPLETED (script created, needs deployment)
Priority: 🟡 Important
Location: scripts/npmplus/monitor-ha-status.sh

Actions Required:

  • Add cron job for HA status monitoring (every 5 minutes)
  • Configure log rotation for /var/log/npmplus-ha-monitor.log
  • Test monitoring script manually
  • Optional: Integrate with alerting system (email, webhook)
  • Document alert thresholds and escalation procedures
  • Test alert generation

Commands:

# On primary Proxmox host
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1

Task 7.2: Document Manual Failover Procedures

Status: PENDING
Priority: 🟡 Important
Estimated Time: 30 minutes

Actions Required:

  • Document step-by-step manual failover procedure
  • Document how to force failover to secondary
  • Document how to force failback to primary
  • Document troubleshooting steps for common issues
  • Create runbook for operations team
  • Test manual failover procedures
  • Review and approve documentation

Location: Add to docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md troubleshooting section


Task 7.3: Test All Failover Scenarios

Status: PENDING
Priority: 🟡 Important
Estimated Time: 2 hours

Actions Required:

  • Test automatic failover (primary failure)
  • Test automatic failback (primary recovery)
  • Test manual failover (force to secondary)
  • Test manual failback (force to primary)
  • Test partial failure (Keepalived down but NPMplus up)
  • Test network partition scenarios
  • Test during high traffic (if possible)
  • Document all test results
  • Identify and fix any issues found

HA Implementation Summary

Total Estimated Time

  • Phase 1: 1.5 hours (container creation and NPMplus installation)
  • Phase 2: 30 minutes (certificate sync setup)
  • Phase 3: 2 hours (Keepalived configuration and scripts)
  • Phase 4: 1.5 hours (configuration export/import)
  • Phase 5: 45 minutes (ongoing sync setup)
  • Phase 6: 2 hours (testing and validation)
  • Monitoring: 1 hour (monitoring setup and documentation)

Total: ~9 hours of implementation time

Prerequisites Checklist

  • Secondary Proxmox host available (r630-02 or ml110)
  • Network connectivity between hosts verified
  • Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
  • SSH access configured between hosts (key-based auth recommended)
  • Maintenance window scheduled
  • Backup of primary NPMplus completed
  • Team notified of maintenance window

Risk Mitigation

  • Rollback plan documented and tested
  • Primary NPMplus backup verified before changes
  • Test environment available (if possible)
  • Monitoring in place before production deployment
  • Emergency contact list available

Last Updated: 2026-01-20
Next Review: After addressing critical items