- Config, docs, scripts, and backup manifests - Submodule refs unchanged (m = modified content in submodules) Made-with: Cursor
29 KiB
Verification Scripts and Documentation - Gaps and TODOs
Last Updated: 2026-03-02
Document Version: 1.0
Status: Active Documentation
Date: 2026-01-20
Status: Gap Analysis Complete
Purpose: Identify all placeholders, missing components, and incomplete implementations
Documentation note (2026-03-02): Runbook placeholders (e.g. your-token, your-password) are intentional examples. In production, use values from .env only; do not commit secrets. INGRESS_VERIFICATION_RUNBOOK.md updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.
Critical Missing Components
1. Missing Script: scripts/verify/backup-npmplus.sh
Status: ✅ CREATED (scripts/verify/backup-npmplus.sh)
Referenced in:
docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md(lines 39, 150, 437, 480)
Required Functionality:
- Automated backup of NPMplus database (
/data/database.sqlite) - Export of proxy hosts via API
- Export of certificates via API
- Certificate file backup from disk
- Compression and timestamping
- Configurable backup destination
Action Required: Create the script with all backup procedures documented in NPMPLUS_BACKUP_RESTORE.md.
Placeholders and TBD Values
2. Nginx Config Paths - TBD Values
Location: scripts/verify/verify-backend-vms.sh
Status: ✅ RESOLVED - Paths set in scripts/verify/verify-backend-vms.sh:
- VMID 10130:
/etc/nginx/sites-available/dbis-frontend - VMID 2400:
/etc/nginx/sites-available/thirdweb-rpc
Required Actions (if paths differ on actual VMs):
-
VMID 10130 (dbis-frontend):
- Determine actual nginx config path
- Common locations:
/etc/nginx/sites-available/dbis-frontendor/etc/nginx/sites-available/dbis-admin - Update script with actual path
- Verify config exists and is enabled
-
VMID 2400 (thirdweb-rpc-1):
- Determine actual nginx config path
- Common locations:
/etc/nginx/sites-available/thirdweb-rpcor/etc/nginx/sites-available/rpc - Update script with actual path
- Verify config exists and is enabled
Impact: Script will skip nginx config verification for these VMs until resolved.
3. Sankofa Cutover Plan - Target Placeholders
Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md
Placeholders to Replace (once Sankofa services are deployed):
<TARGET_IP>(appears 10 times)<TARGET_PORT>(appears 10 times)⚠️ TBDvalues in table (lines 60-64)
Domain-Specific Targets Needed:
| Domain | Current (Wrong) | Target (TBD) |
|---|---|---|
sankofa.nexus |
192.168.11.140:80 | <TARGET_IP>:<TARGET_PORT> |
www.sankofa.nexus |
192.168.11.140:80 | <TARGET_IP>:<TARGET_PORT> |
phoenix.sankofa.nexus |
192.168.11.140:80 | <TARGET_IP>:<TARGET_PORT> |
www.phoenix.sankofa.nexus |
192.168.11.140:80 | <TARGET_IP>:<TARGET_PORT> |
the-order.sankofa.nexus |
192.168.11.140:80 | <TARGET_IP>:<TARGET_PORT> |
Action Required: Update placeholders with actual Sankofa service IPs and ports once deployed.
Documentation Placeholders
4. Generic Placeholders in Runbooks
Location: Multiple files
Replacements Needed:
INGRESS_VERIFICATION_RUNBOOK.md:
- Line 23:
CLOUDFLARE_API_TOKEN="your-token"→ Should reference.envfile - Line 25:
CLOUDFLARE_EMAIL="your-email"→ Should reference.envfile - Line 26:
CLOUDFLARE_API_KEY="your-key"→ Should reference.envfile - Line 31:
NPM_PASSWORD="your-password"→ Should reference.envfile - Lines 91, 101, 213: Similar placeholders in examples
Note: These are intentional examples, but should be clearly marked as such and reference .env file usage.
NPMPLUS_BACKUP_RESTORE.md:
- Line 84:
NPM_PASSWORD="your-password"→ Example placeholder (acceptable) - Line 304:
NPM_PASSWORD="your-password"→ Example placeholder (acceptable)
SANKOFA_CUTOVER_PLAN.md:
- Line 125:
NPM_PASSWORD="your-password"→ Example placeholder (acceptable) - Line 178:
NPM_PASSWORD="your-password"→ Example placeholder (acceptable)
Status (2026-03-02): Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.
5. Source of Truth JSON - Verifier Field
Location: docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json (line 5)
Current: "verifier": "operator-name"
Expected: Should be dynamically set by script using $USER or actual operator name.
Status: ✅ HANDLED - The generate-source-of-truth.sh script uses env.USER // "unknown" which is correct. The example JSON file is just a template.
Action Required: None - script implementation is correct.
Implementation Gaps
6. Source of Truth Generation - File Path Dependencies
Location: scripts/verify/generate-source-of-truth.sh
Potential Issues:
- Script expects specific output file names from verification scripts
- If verification scripts don't run first, JSON will be empty or have defaults
- No validation that source files exist before parsing
Expected File Dependencies:
$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json
Action Required:
- Add file existence checks before parsing
- Provide clear error messages if dependencies are missing
- Add option to generate partial source-of-truth if some verifications haven't run
7. Backend VM Verification - Service-Specific Checks
Location: scripts/verify/verify-backend-vms.sh
Gaps Identified:
-
Besu RPC VMs (2101, 2201):
- Script checks for RPC endpoints but doesn't verify Besu-specific health checks
- Should test actual RPC calls (e.g.,
eth_chainId) not just HTTP status - WebSocket port (8546) verification is minimal
-
Node.js API VMs (10150, 10151):
- Only checks port 3000 is listening
- Doesn't verify API health endpoint exists
- Should test actual API endpoint (e.g.,
/healthor/api/health)
-
Blockscout VM (5000):
- Checks nginx on port 80 and Blockscout on port 4000
- Should verify Blockscout API is responding (e.g.,
/api/health)
Action Required:
- Add service-specific health check functions
- Implement actual RPC/API endpoint testing beyond port checks
- Document expected health check endpoints per service type
8. End-to-End Routing - WebSocket Testing
Location: scripts/verify/verify-end-to-end-routing.sh
Current Implementation:
- Basic WebSocket connectivity check using TCP connection test
- Manual
wscattest recommended but not automated - No actual WebSocket handshake or message exchange verification
Gap:
- WebSocket tests are minimal (just TCP connection)
- No verification that WebSocket protocol upgrade works correctly
- No test of actual RPC WebSocket messages
Action Required:
- Add automated WebSocket handshake test (if
wscatis available) - Or add clear documentation that WebSocket testing requires manual verification
- Consider adding automated WebSocket test script if
wscatorwebsocatis installed
Configuration Gaps
9. Environment Variable Documentation
Missing: Comprehensive .env.example file listing all required variables
Required Variables (from scripts):
# Cloudflare
CLOUDFLARE_API_TOKEN=
CLOUDFLARE_EMAIL=
CLOUDFLARE_API_KEY=
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=
# Public IP
PUBLIC_IP=76.53.10.36
# NPMplus
NPM_URL=https://192.168.11.166:81
NPM_EMAIL=nsatoshi2007@hotmail.com
NPM_PASSWORD=
NPM_PROXMOX_HOST=192.168.11.11
NPM_VMID=10233
# Proxmox Hosts (for testing)
PROXMOX_HOST_FOR_TEST=192.168.11.11
Action Required: Create .env.example file in project root with all required variables.
10. Script Dependencies Documentation
Missing: List of required system dependencies
Required Tools (used across scripts):
bash(4.0+)curl(for API calls)jq(for JSON parsing)dig(for DNS resolution)openssl(for SSL certificate inspection)ssh(for remote execution)ss(for port checking)systemctl(for service status)sqlite3(for database backup)
Optional Tools:
wscatorwebsocat(for WebSocket testing)
Action Required:
- Add dependencies section to
INGRESS_VERIFICATION_RUNBOOK.md - Create
scripts/verify/README.mdwith installation instructions - Add dependency check function to
run-full-verification.sh
Data Completeness Gaps
11. Source of Truth JSON - Hardcoded Values
Location: scripts/verify/generate-source-of-truth.sh (lines 169-177)
Current: NPMplus container info is hardcoded:
"container": {
"vmid": 10233,
"host": "r630-01",
"host_ip": "192.168.11.11",
"internal_ips": {
"eth0": "192.168.11.166",
"eth1": "192.168.11.167"
},
"management_ui": "https://192.168.11.166:81",
"status": "running"
}
Gap: Status should be dynamically determined from verification results.
Action Required:
- Make container status dynamic based on
export-npmplus-config.shresults - Verify IP addresses are correct (especially
eth1) - Document if
eth1is actually used or is a placeholder
12. DNS Verification - Zone ID Lookup
Location: scripts/verify/export-cloudflare-dns-records.sh
Current: Attempts to fetch zone IDs if not provided in .env, but has fallback to empty string.
Potential Issue: If zone ID lookup fails and .env doesn't have zone IDs, script will fail silently or skip zones.
Action Required:
- Add validation that zone IDs are set (either from
.envor from API lookup) - Fail clearly if zone ID cannot be determined
- Provide helpful error message with instructions
Documentation Completeness
13. Missing Troubleshooting Sections
Location: docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md
Current: Basic troubleshooting section exists (lines 427-468) but could be expanded.
Missing Topics:
- What to do if verification scripts fail partially
- How to interpret "unknown" status vs "needs-fix" status
- How to manually verify items that scripts can't automate
- Common Cloudflare API errors and solutions
- Common NPMplus API authentication issues
- SSH connection failures to Proxmox hosts
Action Required: Expand troubleshooting section with more scenarios.
14. Missing Rollback Procedures
Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md
Current: Basic rollback steps exist (lines 330-342) but could be more detailed.
Missing:
- Automated rollback script reference
- Exact commands to restore previous NPMplus configuration
- How to verify rollback was successful
- Recovery time expectations
Action Required:
- Create
scripts/verify/rollback-sankofa-routing.sh(optional but recommended) - Or expand manual rollback steps with exact API calls
Priority Summary
🔴 Critical (Must Fix Before Production Use)
- ✅ Create
scripts/verify/backup-npmplus.sh- Referenced but missing - ✅ Resolve TBD nginx config paths (VMID 10130, 2400) - Blocks verification
- ✅ Add file dependency validation in
generate-source-of-truth.sh
🟡 Important (Should Fix Soon)
- Add
.env.examplefile with all required variables - Add dependency checks to verification scripts
- Expand service-specific health checks for Besu, Node.js, Blockscout
- Document WebSocket testing limitations or automate it
🟢 Nice to Have (Can Wait)
- Expand troubleshooting section with more scenarios
- Create rollback script for Sankofa cutover
- Add dependency installation guide to runbook
- Make container status dynamic in source-of-truth generation
Notes
- Placeholders in examples: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference
.envfile usage. - Sankofa placeholders:
<TARGET_IP>and<TARGET_PORT>are expected placeholders until Sankofa services are deployed. These should be updated during cutover. - TBD config paths: These need to be discovered by running verification and inspecting actual VMs.
Additional Items Completed
15. NPMplus High Availability (HA) Setup Guide ✅ ADDED
Status: ✅ DOCUMENTATION COMPLETE - Implementation pending
Location: docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md
What Was Added:
- Complete HA architecture guide (Active-Passive with Keepalived)
- Step-by-step implementation instructions (6 phases)
- Helper scripts:
sync-certificates.sh,monitor-ha-status.sh - Testing and validation procedures
- Troubleshooting guide
- Rollback plan
- Future upgrade path to Active-Active
Scripts Created:
scripts/npmplus/sync-certificates.sh- Synchronize certificates from primary to secondaryscripts/npmplus/monitor-ha-status.sh- Monitor HA status and send alerts
Impact: Eliminates single point of failure for NPMplus, enables automatic failover.
NPMplus HA Implementation Tasks
Phase 1: Prepare Secondary NPMplus Instance
Task 1.1: Create Secondary NPMplus Container
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes
Actions Required:
- Download Alpine 3.22 template on r630-02
- Create container VMID 10234 with:
- Hostname:
npmplus-secondary - IP:
192.168.11.167/24 - Memory: 1024 MB
- Cores: 2
- Disk: 5 GB
- Features: nesting=1, unprivileged=1
- Hostname:
- Start container and verify it's running
- Document container creation in deployment log
Commands:
# On r630-02
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.167"
BRIDGE="vmbr0"
pveam download local alpine-3.22-default_20241208_amd64.tar.xz
pct create $CTID \
local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
--hostname $HOSTNAME \
--memory 1024 \
--cores 2 \
--rootfs local-lvm:5 \
--net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
--unprivileged 1 \
--features nesting=1
pct start $CTID
Task 1.2: Install NPMplus on Secondary Instance
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes
Actions Required:
- SSH to r630-02 and enter container
- Install dependencies:
tzdata,gawk,yq,docker,docker-compose,curl,bash,rsync - Start and enable Docker service
- Download NPMplus compose.yaml from GitHub
- Configure timezone:
America/New_York - Configure ACME email:
nsatoshi2007@hotmail.com - Start NPMplus container (but don't configure yet - will sync first)
- Wait for NPMplus to be healthy
- Retrieve admin password and document it
Commands:
ssh root@192.168.11.12
pct exec 10234 -- ash
apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync
rc-service docker start
rc-update add docker default
sleep 5
cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml
TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"
yq -i "
.services.npmplus.environment |=
(map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
[\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml
docker compose up -d
Task 1.3: Configure Secondary Container Network
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes
Actions Required:
- Verify static IP assignment:
192.168.11.167 - Verify gateway:
192.168.11.1 - Test network connectivity to primary host
- Test network connectivity to backend VMs
- Document network configuration
Commands:
pct exec 10234 -- ip addr show eth0
pct exec 10234 -- ping -c 3 192.168.11.11
pct exec 10234 -- ping -c 3 192.168.11.166
Phase 2: Set Up Certificate Synchronization
Task 2.1: Create Certificate Sync Script
Status: ✅ COMPLETED
Location: scripts/npmplus/sync-certificates.sh
Note: Script already created, needs testing
Actions Required:
- Test certificate sync script manually
- Verify certificates sync correctly
- Verify script handles errors gracefully
- Document certificate paths for both primary and secondary
Task 2.2: Set Up Automated Certificate Sync
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes
Actions Required:
- Add cron job on primary Proxmox host (r630-01)
- Configure to run every 5 minutes
- Set up log rotation for
/var/log/npmplus-cert-sync.log - Test cron job execution
- Monitor logs for successful syncs
- Verify certificate count matches between primary and secondary
Commands:
# On r630-01
crontab -e
# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1
# Test manually first
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh
Phase 3: Set Up Keepalived for Virtual IP
Task 3.1: Install Keepalived on Proxmox Hosts
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes
Actions Required:
- Install Keepalived on r630-01 (primary)
- Install Keepalived on r630-02 (secondary)
- Verify Keepalived installation
- Check firewall rules for VRRP (multicast 224.0.0.0/8)
Commands:
# On both hosts
apt update
apt install -y keepalived
# Verify installation
keepalived --version
Task 3.2: Configure Keepalived on Primary Host (r630-01)
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes
Actions Required:
- Create
/etc/keepalived/keepalived.confwith MASTER configuration - Set virtual_router_id: 51
- Set priority: 110
- Configure auth_pass (use secure password)
- Configure virtual_ipaddress: 192.168.11.166/24
- Reference health check script path
- Reference notification script path
- Verify configuration syntax
- Document Keepalived configuration
Files to Create:
/etc/keepalived/keepalived.conf(see HA guide for full config)/usr/local/bin/check-npmplus-health.sh(Task 3.4)/usr/local/bin/keepalived-notify.sh(Task 3.5)
Task 3.3: Configure Keepalived on Secondary Host (r630-02)
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes
Actions Required:
- Create
/etc/keepalived/keepalived.confwith BACKUP configuration - Set virtual_router_id: 51 (must match primary)
- Set priority: 100 (lower than primary)
- Configure auth_pass (must match primary)
- Configure virtual_ipaddress: 192.168.11.166/24
- Reference health check script path
- Reference notification script path
- Verify configuration syntax
- Document Keepalived configuration
Files to Create:
/etc/keepalived/keepalived.conf(see HA guide for full config)/usr/local/bin/check-npmplus-health.sh(Task 3.4)/usr/local/bin/keepalived-notify.sh(Task 3.5)
Task 3.4: Create Health Check Script
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes
Actions Required:
- Create
/usr/local/bin/check-npmplus-health.shon both hosts - Script should:
- Detect hostname to determine which VMID to check
- Check if container is running
- Check if NPMplus Docker container is healthy
- Check if NPMplus web interface responds (port 81)
- Return exit code 0 if healthy, 1 if unhealthy
- Make script executable:
chmod +x - Test script manually on both hosts
- Verify script detects failures correctly
File: /usr/local/bin/check-npmplus-health.sh
Details: See HA guide for full script content
Task 3.5: Create Keepalived Notification Script
Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 15 minutes
Actions Required:
- Create
/usr/local/bin/keepalived-notify.shon both hosts - Script should handle states: master, backup, fault
- Log state changes to
/var/log/keepalived-notify.log - Optional: Send alerts (email, webhook) on fault state
- Make script executable:
chmod +x - Test script with each state manually
File: /usr/local/bin/keepalived-notify.sh
Details: See HA guide for full script content
Task 3.6: Start and Enable Keepalived
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes
Actions Required:
- Enable Keepalived service on both hosts
- Start Keepalived on both hosts
- Verify Keepalived is running
- Verify primary host owns VIP (192.168.11.166)
- Verify secondary host is in BACKUP state
- Monitor Keepalived logs for any errors
- Document VIP ownership verification
Commands:
# On both hosts
systemctl enable keepalived
systemctl start keepalived
# Verify status
systemctl status keepalived
# Check VIP ownership (should be on primary)
ip addr show vmbr0 | grep 192.168.11.166
# Check logs
journalctl -u keepalived -f
Phase 4: Sync Configuration to Secondary
Task 4.1: Export Primary Configuration
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes
Actions Required:
- Create export script:
scripts/npmplus/export-primary-config.sh - Export NPMplus SQLite database to SQL dump
- Export proxy hosts via API (JSON)
- Export certificates via API (JSON)
- Create timestamped backup directory
- Verify all exports completed successfully
- Document backup location and contents
Script to Create: scripts/npmplus/export-primary-config.sh
Details: See HA guide for full script content
Task 4.2: Import Configuration to Secondary
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes
Actions Required:
- Create import script:
scripts/npmplus/import-secondary-config.sh - Stop NPMplus container on secondary (if running)
- Copy database SQL dump to secondary
- Import database dump into secondary NPMplus
- Restart NPMplus container on secondary
- Wait for NPMplus to be healthy
- Verify proxy hosts are configured
- Verify certificates are accessible
- Document any manual configuration steps needed
Script to Create: scripts/npmplus/import-secondary-config.sh
Details: See HA guide for full script content
Note: Some configuration may need manual replication via API or UI.
Phase 5: Set Up Ongoing Configuration Sync
Task 5.1: Create Configuration Sync Script
Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 45 minutes
Actions Required:
- Create sync script:
scripts/npmplus/sync-config.sh - Authenticate to NPMplus API (primary)
- Export proxy hosts configuration
- Implement API-based sync or document manual sync process
- Add script to automation (if automated sync is possible)
- Document manual sync procedures for configuration changes
Script to Create: scripts/npmplus/sync-config.sh
Note: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.
Phase 6: Testing and Validation
Task 6.1: Test Virtual IP Failover
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes
Actions Required:
- Verify primary owns VIP before test
- Simulate primary failure (stop Keepalived or NPMplus container)
- Verify VIP moves to secondary within 5-10 seconds
- Test connectivity to VIP from external source
- Restore primary and verify failback
- Document failover time (should be < 10 seconds)
- Test multiple failover scenarios
- Document test results
Test Scenarios:
- Stop Keepalived on primary
- Stop NPMplus container on primary
- Stop entire Proxmox host (if possible in test environment)
- Network partition (if possible in test environment)
Task 6.2: Test Certificate Access
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes
Actions Required:
- Verify certificates exist on secondary (after sync)
- Test SSL endpoint from external:
curl -vI https://explorer.d-bis.org - Verify certificate is valid and trusted
- Test multiple domains with SSL
- Verify certificate expiration dates match
- Test certificate auto-renewal on secondary (when primary renews)
- Document certificate test results
Commands:
# Verify certificates on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
# Test SSL endpoint
curl -vI https://explorer.d-bis.org
curl -vI https://mim4u.org
curl -vI https://rpc-http-pub.d-bis.org
Task 6.3: Test Proxy Host Functionality
Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes
Actions Required:
- Test each domain from external after failover
- Verify HTTP to HTTPS redirects work
- Verify WebSocket connections work (for RPC endpoints)
- Verify API endpoints respond correctly
- Test all 19+ domains
- Document any domains that don't work correctly
- Test with secondary as active instance
- Test failback to primary
Test Domains:
- All d-bis.org domains (9 domains)
- All mim4u.org domains (4 domains)
- All sankofa.nexus domains (5 domains)
- defi-oracle.io domain (1 domain)
Monitoring and Maintenance
Task 7.1: Set Up HA Status Monitoring
Status: ✅ COMPLETED (script created, needs deployment)
Priority: 🟡 Important
Location: scripts/npmplus/monitor-ha-status.sh
Actions Required:
- Add cron job for HA status monitoring (every 5 minutes)
- Configure log rotation for
/var/log/npmplus-ha-monitor.log - Test monitoring script manually
- Optional: Integrate with alerting system (email, webhook)
- Document alert thresholds and escalation procedures
- Test alert generation
Commands:
# On primary Proxmox host
crontab -e
# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1
Task 7.2: Document Manual Failover Procedures
Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 30 minutes
Actions Required:
- Document step-by-step manual failover procedure
- Document how to force failover to secondary
- Document how to force failback to primary
- Document troubleshooting steps for common issues
- Create runbook for operations team
- Test manual failover procedures
- Review and approve documentation
Location: Add to docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md troubleshooting section
Task 7.3: Test All Failover Scenarios
Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 2 hours
Actions Required:
- Test automatic failover (primary failure)
- Test automatic failback (primary recovery)
- Test manual failover (force to secondary)
- Test manual failback (force to primary)
- Test partial failure (Keepalived down but NPMplus up)
- Test network partition scenarios
- Test during high traffic (if possible)
- Document all test results
- Identify and fix any issues found
HA Implementation Summary
Total Estimated Time
- Phase 1: 1.5 hours (container creation and NPMplus installation)
- Phase 2: 30 minutes (certificate sync setup)
- Phase 3: 2 hours (Keepalived configuration and scripts)
- Phase 4: 1.5 hours (configuration export/import)
- Phase 5: 45 minutes (ongoing sync setup)
- Phase 6: 2 hours (testing and validation)
- Monitoring: 1 hour (monitoring setup and documentation)
Total: ~9 hours of implementation time
Prerequisites Checklist
- Secondary Proxmox host available (r630-02 or ml110)
- Network connectivity between hosts verified
- Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
- SSH access configured between hosts (key-based auth recommended)
- Maintenance window scheduled
- Backup of primary NPMplus completed
- Team notified of maintenance window
Risk Mitigation
- Rollback plan documented and tested
- Primary NPMplus backup verified before changes
- Test environment available (if possible)
- Monitoring in place before production deployment
- Emergency contact list available
Last Updated: 2026-01-20
Next Review: After addressing critical items