Files

Deploy to Phoenix / deploy (push) Has been cancelled

Details

- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor

2026-03-02 11:37:34 -08:00

29 KiB

Raw Blame History

Verification Scripts and Documentation - Gaps and TODOs

Last Updated: 2026-03-02
Document Version: 1.0
Status: Active Documentation

Date: 2026-01-20
Status: Gap Analysis Complete
Purpose: Identify all placeholders, missing components, and incomplete implementations

Documentation note (2026-03-02): Runbook placeholders (e.g. your-token, your-password) are intentional examples. In production, use values from .env only; do not commit secrets. INGRESS_VERIFICATION_RUNBOOK.md updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.

Critical Missing Components

1. Missing Script: `scripts/verify/backup-npmplus.sh`

Status: ✅ CREATED (scripts/verify/backup-npmplus.sh)
Referenced in:

docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md (lines 39, 150, 437, 480)

Required Functionality:

Automated backup of NPMplus database (/data/database.sqlite)
Export of proxy hosts via API
Export of certificates via API
Certificate file backup from disk
Compression and timestamping
Configurable backup destination

Action Required: Create the script with all backup procedures documented in NPMPLUS_BACKUP_RESTORE.md.

Placeholders and TBD Values

2. Nginx Config Paths - TBD Values

Location: scripts/verify/verify-backend-vms.sh

Status: ✅ RESOLVED - Paths set in scripts/verify/verify-backend-vms.sh:

VMID 10130: /etc/nginx/sites-available/dbis-frontend
VMID 2400: /etc/nginx/sites-available/thirdweb-rpc

Required Actions (if paths differ on actual VMs):

VMID 10130 (dbis-frontend):
- Determine actual nginx config path
- Common locations: /etc/nginx/sites-available/dbis-frontend or /etc/nginx/sites-available/dbis-admin
- Update script with actual path
- Verify config exists and is enabled
VMID 2400 (thirdweb-rpc-1):
- Determine actual nginx config path
- Common locations: /etc/nginx/sites-available/thirdweb-rpc or /etc/nginx/sites-available/rpc
- Update script with actual path
- Verify config exists and is enabled

Impact: Script will skip nginx config verification for these VMs until resolved.

3. Sankofa Cutover Plan - Target Placeholders

Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md

Placeholders to Replace (once Sankofa services are deployed):

<TARGET_IP> (appears 10 times)
<TARGET_PORT> (appears 10 times)
⚠️ TBD values in table (lines 60-64)

Domain-Specific Targets Needed:

Domain	Current (Wrong)	Target (TBD)
`sankofa.nexus`	192.168.11.140:80	`<TARGET_IP>:<TARGET_PORT>`
`www.sankofa.nexus`	192.168.11.140:80	`<TARGET_IP>:<TARGET_PORT>`
`phoenix.sankofa.nexus`	192.168.11.140:80	`<TARGET_IP>:<TARGET_PORT>`
`www.phoenix.sankofa.nexus`	192.168.11.140:80	`<TARGET_IP>:<TARGET_PORT>`
`the-order.sankofa.nexus`	192.168.11.140:80	`<TARGET_IP>:<TARGET_PORT>`

Action Required: Update placeholders with actual Sankofa service IPs and ports once deployed.

Documentation Placeholders

4. Generic Placeholders in Runbooks

Location: Multiple files

Replacements Needed:

`INGRESS_VERIFICATION_RUNBOOK.md`:

Line 23: CLOUDFLARE_API_TOKEN="your-token" → Should reference .env file
Line 25: CLOUDFLARE_EMAIL="your-email" → Should reference .env file
Line 26: CLOUDFLARE_API_KEY="your-key" → Should reference .env file
Line 31: NPM_PASSWORD="your-password" → Should reference .env file
Lines 91, 101, 213: Similar placeholders in examples

Note: These are intentional examples, but should be clearly marked as such and reference .env file usage.

`NPMPLUS_BACKUP_RESTORE.md`:

Line 84: NPM_PASSWORD="your-password" → Example placeholder (acceptable)
Line 304: NPM_PASSWORD="your-password" → Example placeholder (acceptable)

`SANKOFA_CUTOVER_PLAN.md`:

Line 125: NPM_PASSWORD="your-password" → Example placeholder (acceptable)
Line 178: NPM_PASSWORD="your-password" → Example placeholder (acceptable)

Status (2026-03-02): Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.

5. Source of Truth JSON - Verifier Field

Location: docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json (line 5)

Current: "verifier": "operator-name"

Expected: Should be dynamically set by script using $USER or actual operator name.

Status: ✅ HANDLED - The generate-source-of-truth.sh script uses env.USER // "unknown" which is correct. The example JSON file is just a template.

Action Required: None - script implementation is correct.

Implementation Gaps

6. Source of Truth Generation - File Path Dependencies

Location: scripts/verify/generate-source-of-truth.sh

Potential Issues:

Script expects specific output file names from verification scripts
If verification scripts don't run first, JSON will be empty or have defaults
No validation that source files exist before parsing

Expected File Dependencies:

$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json

Action Required:

Add file existence checks before parsing
Provide clear error messages if dependencies are missing
Add option to generate partial source-of-truth if some verifications haven't run

7. Backend VM Verification - Service-Specific Checks

Location: scripts/verify/verify-backend-vms.sh

Gaps Identified:

Besu RPC VMs (2101, 2201):
- Script checks for RPC endpoints but doesn't verify Besu-specific health checks
- Should test actual RPC calls (e.g., eth_chainId) not just HTTP status
- WebSocket port (8546) verification is minimal
Node.js API VMs (10150, 10151):
- Only checks port 3000 is listening
- Doesn't verify API health endpoint exists
- Should test actual API endpoint (e.g., /health or /api/health)
Blockscout VM (5000):
- Checks nginx on port 80 and Blockscout on port 4000
- Should verify Blockscout API is responding (e.g., /api/health)

Action Required:

Add service-specific health check functions
Implement actual RPC/API endpoint testing beyond port checks
Document expected health check endpoints per service type

8. End-to-End Routing - WebSocket Testing

Location: scripts/verify/verify-end-to-end-routing.sh

Current Implementation:

Basic WebSocket connectivity check using TCP connection test
Manual wscat test recommended but not automated
No actual WebSocket handshake or message exchange verification

Gap:

WebSocket tests are minimal (just TCP connection)
No verification that WebSocket protocol upgrade works correctly
No test of actual RPC WebSocket messages

Action Required:

Add automated WebSocket handshake test (if wscat is available)
Or add clear documentation that WebSocket testing requires manual verification
Consider adding automated WebSocket test script if wscat or websocat is installed

Configuration Gaps

9. Environment Variable Documentation

Missing: Comprehensive .env.example file listing all required variables

Required Variables (from scripts):

# Cloudflare
CLOUDFLARE_API_TOKEN=
CLOUDFLARE_EMAIL=
CLOUDFLARE_API_KEY=
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=

# Public IP
PUBLIC_IP=76.53.10.36

# NPMplus
NPM_URL=https://192.168.11.166:81
NPM_EMAIL=nsatoshi2007@hotmail.com
NPM_PASSWORD=
NPM_PROXMOX_HOST=192.168.11.11
NPM_VMID=10233

# Proxmox Hosts (for testing)
PROXMOX_HOST_FOR_TEST=192.168.11.11

Action Required: Create .env.example file in project root with all required variables.

10. Script Dependencies Documentation

Missing: List of required system dependencies

Required Tools (used across scripts):

bash (4.0+)
curl (for API calls)
jq (for JSON parsing)
dig (for DNS resolution)
openssl (for SSL certificate inspection)
ssh (for remote execution)
ss (for port checking)
systemctl (for service status)
sqlite3 (for database backup)

Optional Tools:

wscat or websocat (for WebSocket testing)

Action Required:

Add dependencies section to INGRESS_VERIFICATION_RUNBOOK.md
Create scripts/verify/README.md with installation instructions
Add dependency check function to run-full-verification.sh

Data Completeness Gaps

11. Source of Truth JSON - Hardcoded Values

Location: scripts/verify/generate-source-of-truth.sh (lines 169-177)

Current: NPMplus container info is hardcoded:

"container": {
    "vmid": 10233,
    "host": "r630-01",
    "host_ip": "192.168.11.11",
    "internal_ips": {
        "eth0": "192.168.11.166",
        "eth1": "192.168.11.167"
    },
    "management_ui": "https://192.168.11.166:81",
    "status": "running"
}

Gap: Status should be dynamically determined from verification results.

Action Required:

Make container status dynamic based on export-npmplus-config.sh results
Verify IP addresses are correct (especially eth1)
Document if eth1 is actually used or is a placeholder

12. DNS Verification - Zone ID Lookup

Location: scripts/verify/export-cloudflare-dns-records.sh

Current: Attempts to fetch zone IDs if not provided in .env, but has fallback to empty string.

Potential Issue: If zone ID lookup fails and .env doesn't have zone IDs, script will fail silently or skip zones.

Action Required:

Add validation that zone IDs are set (either from .env or from API lookup)
Fail clearly if zone ID cannot be determined
Provide helpful error message with instructions

Documentation Completeness

13. Missing Troubleshooting Sections

Location: docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md

Current: Basic troubleshooting section exists (lines 427-468) but could be expanded.

Missing Topics:

What to do if verification scripts fail partially
How to interpret "unknown" status vs "needs-fix" status
How to manually verify items that scripts can't automate
Common Cloudflare API errors and solutions
Common NPMplus API authentication issues
SSH connection failures to Proxmox hosts

Action Required: Expand troubleshooting section with more scenarios.

14. Missing Rollback Procedures

Location: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md

Current: Basic rollback steps exist (lines 330-342) but could be more detailed.

Missing:

Automated rollback script reference
Exact commands to restore previous NPMplus configuration
How to verify rollback was successful
Recovery time expectations

Action Required:

Create scripts/verify/rollback-sankofa-routing.sh (optional but recommended)
Or expand manual rollback steps with exact API calls

Priority Summary

🔴 Critical (Must Fix Before Production Use)

✅ Create scripts/verify/backup-npmplus.sh - Referenced but missing
✅ Resolve TBD nginx config paths (VMID 10130, 2400) - Blocks verification
✅ Add file dependency validation in generate-source-of-truth.sh

🟡 Important (Should Fix Soon)

Add .env.example file with all required variables
Add dependency checks to verification scripts
Expand service-specific health checks for Besu, Node.js, Blockscout
Document WebSocket testing limitations or automate it

🟢 Nice to Have (Can Wait)

Expand troubleshooting section with more scenarios
Create rollback script for Sankofa cutover
Add dependency installation guide to runbook
Make container status dynamic in source-of-truth generation

Notes

Placeholders in examples: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference .env file usage.
Sankofa placeholders: <TARGET_IP> and <TARGET_PORT> are expected placeholders until Sankofa services are deployed. These should be updated during cutover.
TBD config paths: These need to be discovered by running verification and inspecting actual VMs.

Additional Items Completed

15. NPMplus High Availability (HA) Setup Guide ✅ ADDED

Status: ✅ DOCUMENTATION COMPLETE - Implementation pending
Location: docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md

What Was Added:

Complete HA architecture guide (Active-Passive with Keepalived)
Step-by-step implementation instructions (6 phases)
Helper scripts: sync-certificates.sh, monitor-ha-status.sh
Testing and validation procedures
Troubleshooting guide
Rollback plan
Future upgrade path to Active-Active

Scripts Created:

scripts/npmplus/sync-certificates.sh - Synchronize certificates from primary to secondary
scripts/npmplus/monitor-ha-status.sh - Monitor HA status and send alerts

Impact: Eliminates single point of failure for NPMplus, enables automatic failover.

NPMplus HA Implementation Tasks

Phase 1: Prepare Secondary NPMplus Instance

Task 1.1: Create Secondary NPMplus Container

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

Download Alpine 3.22 template on r630-02
Create container VMID 10234 with:
- Hostname: npmplus-secondary
- IP: 192.168.11.167/24
- Memory: 1024 MB
- Cores: 2
- Disk: 5 GB
- Features: nesting=1, unprivileged=1
Start container and verify it's running
Document container creation in deployment log

Commands:

# On r630-02
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.167"
BRIDGE="vmbr0"

pveam download local alpine-3.22-default_20241208_amd64.tar.xz

pct create $CTID \
    local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
    --hostname $HOSTNAME \
    --memory 1024 \
    --cores 2 \
    --rootfs local-lvm:5 \
    --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
    --unprivileged 1 \
    --features nesting=1

pct start $CTID

Task 1.2: Install NPMplus on Secondary Instance

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

SSH to r630-02 and enter container
Install dependencies: tzdata, gawk, yq, docker, docker-compose, curl, bash, rsync
Start and enable Docker service
Download NPMplus compose.yaml from GitHub
Configure timezone: America/New_York
Configure ACME email: nsatoshi2007@hotmail.com
Start NPMplus container (but don't configure yet - will sync first)
Wait for NPMplus to be healthy
Retrieve admin password and document it

Commands:

ssh root@192.168.11.12
pct exec 10234 -- ash

apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync

rc-service docker start
rc-update add docker default
sleep 5

cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml

TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"

yq -i "
  .services.npmplus.environment |=
    (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
    [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml

docker compose up -d

Task 1.3: Configure Secondary Container Network

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes

Actions Required:

Verify static IP assignment: 192.168.11.167
Verify gateway: 192.168.11.1
Test network connectivity to primary host
Test network connectivity to backend VMs
Document network configuration

Commands:

pct exec 10234 -- ip addr show eth0
pct exec 10234 -- ping -c 3 192.168.11.11
pct exec 10234 -- ping -c 3 192.168.11.166

Phase 2: Set Up Certificate Synchronization

Task 2.1: Create Certificate Sync Script

Status: ✅ COMPLETED
Location: scripts/npmplus/sync-certificates.sh
Note: Script already created, needs testing

Actions Required:

Test certificate sync script manually
Verify certificates sync correctly
Verify script handles errors gracefully
Document certificate paths for both primary and secondary

Task 2.2: Set Up Automated Certificate Sync

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes

Actions Required:

Add cron job on primary Proxmox host (r630-01)
Configure to run every 5 minutes
Set up log rotation for /var/log/npmplus-cert-sync.log
Test cron job execution
Monitor logs for successful syncs
Verify certificate count matches between primary and secondary

Commands:

# On r630-01
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1

# Test manually first
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh

Phase 3: Set Up Keepalived for Virtual IP

Task 3.1: Install Keepalived on Proxmox Hosts

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 10 minutes

Actions Required:

Install Keepalived on r630-01 (primary)
Install Keepalived on r630-02 (secondary)
Verify Keepalived installation
Check firewall rules for VRRP (multicast 224.0.0.0/8)

Commands:

# On both hosts
apt update
apt install -y keepalived

# Verify installation
keepalived --version

Task 3.2: Configure Keepalived on Primary Host (r630-01)

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes

Actions Required:

Create /etc/keepalived/keepalived.conf with MASTER configuration
Set virtual_router_id: 51
Set priority: 110
Configure auth_pass (use secure password)
Configure virtual_ipaddress: 192.168.11.166/24
Reference health check script path
Reference notification script path
Verify configuration syntax
Document Keepalived configuration

Files to Create:

/etc/keepalived/keepalived.conf (see HA guide for full config)
/usr/local/bin/check-npmplus-health.sh (Task 3.4)
/usr/local/bin/keepalived-notify.sh (Task 3.5)

Task 3.3: Configure Keepalived on Secondary Host (r630-02)

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 20 minutes

Actions Required:

Create /etc/keepalived/keepalived.conf with BACKUP configuration
Set virtual_router_id: 51 (must match primary)
Set priority: 100 (lower than primary)
Configure auth_pass (must match primary)
Configure virtual_ipaddress: 192.168.11.166/24
Reference health check script path
Reference notification script path
Verify configuration syntax
Document Keepalived configuration

Files to Create:

/etc/keepalived/keepalived.conf (see HA guide for full config)
/usr/local/bin/check-npmplus-health.sh (Task 3.4)
/usr/local/bin/keepalived-notify.sh (Task 3.5)

Task 3.4: Create Health Check Script

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

Create /usr/local/bin/check-npmplus-health.sh on both hosts
Script should:
- Detect hostname to determine which VMID to check
- Check if container is running
- Check if NPMplus Docker container is healthy
- Check if NPMplus web interface responds (port 81)
- Return exit code 0 if healthy, 1 if unhealthy
Make script executable: chmod +x
Test script manually on both hosts
Verify script detects failures correctly

File: /usr/local/bin/check-npmplus-health.sh
Details: See HA guide for full script content

Task 3.5: Create Keepalived Notification Script

Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 15 minutes

Actions Required:

Create /usr/local/bin/keepalived-notify.sh on both hosts
Script should handle states: master, backup, fault
Log state changes to /var/log/keepalived-notify.log
Optional: Send alerts (email, webhook) on fault state
Make script executable: chmod +x
Test script with each state manually

File: /usr/local/bin/keepalived-notify.sh
Details: See HA guide for full script content

Task 3.6: Start and Enable Keepalived

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 15 minutes

Actions Required:

Enable Keepalived service on both hosts
Start Keepalived on both hosts
Verify Keepalived is running
Verify primary host owns VIP (192.168.11.166)
Verify secondary host is in BACKUP state
Monitor Keepalived logs for any errors
Document VIP ownership verification

Commands:

# On both hosts
systemctl enable keepalived
systemctl start keepalived

# Verify status
systemctl status keepalived

# Check VIP ownership (should be on primary)
ip addr show vmbr0 | grep 192.168.11.166

# Check logs
journalctl -u keepalived -f

Phase 4: Sync Configuration to Secondary

Task 4.1: Export Primary Configuration

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

Create export script: scripts/npmplus/export-primary-config.sh
Export NPMplus SQLite database to SQL dump
Export proxy hosts via API (JSON)
Export certificates via API (JSON)
Create timestamped backup directory
Verify all exports completed successfully
Document backup location and contents

Script to Create: scripts/npmplus/export-primary-config.sh
Details: See HA guide for full script content

Task 4.2: Import Configuration to Secondary

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

Create import script: scripts/npmplus/import-secondary-config.sh
Stop NPMplus container on secondary (if running)
Copy database SQL dump to secondary
Import database dump into secondary NPMplus
Restart NPMplus container on secondary
Wait for NPMplus to be healthy
Verify proxy hosts are configured
Verify certificates are accessible
Document any manual configuration steps needed

Script to Create: scripts/npmplus/import-secondary-config.sh
Details: See HA guide for full script content

Note: Some configuration may need manual replication via API or UI.

Phase 5: Set Up Ongoing Configuration Sync

Task 5.1: Create Configuration Sync Script

Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 45 minutes

Actions Required:

Create sync script: scripts/npmplus/sync-config.sh
Authenticate to NPMplus API (primary)
Export proxy hosts configuration
Implement API-based sync or document manual sync process
Add script to automation (if automated sync is possible)
Document manual sync procedures for configuration changes

Script to Create: scripts/npmplus/sync-config.sh
Note: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.

Phase 6: Testing and Validation

Task 6.1: Test Virtual IP Failover

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

Verify primary owns VIP before test
Simulate primary failure (stop Keepalived or NPMplus container)
Verify VIP moves to secondary within 5-10 seconds
Test connectivity to VIP from external source
Restore primary and verify failback
Document failover time (should be < 10 seconds)
Test multiple failover scenarios
Document test results

Test Scenarios:

Stop Keepalived on primary
Stop NPMplus container on primary
Stop entire Proxmox host (if possible in test environment)
Network partition (if possible in test environment)

Task 6.2: Test Certificate Access

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 30 minutes

Actions Required:

Verify certificates exist on secondary (after sync)
Test SSL endpoint from external: curl -vI https://explorer.d-bis.org
Verify certificate is valid and trusted
Test multiple domains with SSL
Verify certificate expiration dates match
Test certificate auto-renewal on secondary (when primary renews)
Document certificate test results

Commands:

# Verify certificates on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

# Test SSL endpoint
curl -vI https://explorer.d-bis.org
curl -vI https://mim4u.org
curl -vI https://rpc-http-pub.d-bis.org

Task 6.3: Test Proxy Host Functionality

Status: ⏳ PENDING
Priority: 🔴 Critical
Estimated Time: 45 minutes

Actions Required:

Test each domain from external after failover
Verify HTTP to HTTPS redirects work
Verify WebSocket connections work (for RPC endpoints)
Verify API endpoints respond correctly
Test all 19+ domains
Document any domains that don't work correctly
Test with secondary as active instance
Test failback to primary

Test Domains:

All d-bis.org domains (9 domains)
All mim4u.org domains (4 domains)
All sankofa.nexus domains (5 domains)
defi-oracle.io domain (1 domain)

Monitoring and Maintenance

Task 7.1: Set Up HA Status Monitoring

Status: ✅ COMPLETED (script created, needs deployment)
Priority: 🟡 Important
Location: scripts/npmplus/monitor-ha-status.sh

Actions Required:

Add cron job for HA status monitoring (every 5 minutes)
Configure log rotation for /var/log/npmplus-ha-monitor.log
Test monitoring script manually
Optional: Integrate with alerting system (email, webhook)
Document alert thresholds and escalation procedures
Test alert generation

Commands:

# On primary Proxmox host
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1

Task 7.2: Document Manual Failover Procedures

Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 30 minutes

Actions Required:

Document step-by-step manual failover procedure
Document how to force failover to secondary
Document how to force failback to primary
Document troubleshooting steps for common issues
Create runbook for operations team
Test manual failover procedures
Review and approve documentation

Location: Add to docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md troubleshooting section

Task 7.3: Test All Failover Scenarios

Status: ⏳ PENDING
Priority: 🟡 Important
Estimated Time: 2 hours

Actions Required:

Test automatic failover (primary failure)
Test automatic failback (primary recovery)
Test manual failover (force to secondary)
Test manual failback (force to primary)
Test partial failure (Keepalived down but NPMplus up)
Test network partition scenarios
Test during high traffic (if possible)
Document all test results
Identify and fix any issues found

HA Implementation Summary

Total Estimated Time

Phase 1: 1.5 hours (container creation and NPMplus installation)
Phase 2: 30 minutes (certificate sync setup)
Phase 3: 2 hours (Keepalived configuration and scripts)
Phase 4: 1.5 hours (configuration export/import)
Phase 5: 45 minutes (ongoing sync setup)
Phase 6: 2 hours (testing and validation)
Monitoring: 1 hour (monitoring setup and documentation)

Total: ~9 hours of implementation time

Prerequisites Checklist

Secondary Proxmox host available (r630-02 or ml110)
Network connectivity between hosts verified
Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
SSH access configured between hosts (key-based auth recommended)
Maintenance window scheduled
Backup of primary NPMplus completed
Team notified of maintenance window

Risk Mitigation

Rollback plan documented and tested
Primary NPMplus backup verified before changes
Test environment available (if possible)
Monitoring in place before production deployment
Emergency contact list available

Last Updated: 2026-01-20
Next Review: After addressing critical items

29 KiB Raw Blame History

Verification Scripts and Documentation - Gaps and TODOs

Critical Missing Components

1. Missing Script: scripts/verify/backup-npmplus.sh

Placeholders and TBD Values

2. Nginx Config Paths - TBD Values

3. Sankofa Cutover Plan - Target Placeholders

Documentation Placeholders

4. Generic Placeholders in Runbooks

INGRESS_VERIFICATION_RUNBOOK.md:

NPMPLUS_BACKUP_RESTORE.md:

SANKOFA_CUTOVER_PLAN.md:

5. Source of Truth JSON - Verifier Field

Implementation Gaps

6. Source of Truth Generation - File Path Dependencies

7. Backend VM Verification - Service-Specific Checks

8. End-to-End Routing - WebSocket Testing

Configuration Gaps

9. Environment Variable Documentation

10. Script Dependencies Documentation

Data Completeness Gaps

11. Source of Truth JSON - Hardcoded Values

12. DNS Verification - Zone ID Lookup

Documentation Completeness

13. Missing Troubleshooting Sections

14. Missing Rollback Procedures

Priority Summary

🔴 Critical (Must Fix Before Production Use)

🟡 Important (Should Fix Soon)

🟢 Nice to Have (Can Wait)

Notes

Additional Items Completed

15. NPMplus High Availability (HA) Setup Guide ✅ ADDED

NPMplus HA Implementation Tasks

Phase 1: Prepare Secondary NPMplus Instance

Task 1.1: Create Secondary NPMplus Container

Task 1.2: Install NPMplus on Secondary Instance

Task 1.3: Configure Secondary Container Network

Phase 2: Set Up Certificate Synchronization

Task 2.1: Create Certificate Sync Script

Task 2.2: Set Up Automated Certificate Sync

Phase 3: Set Up Keepalived for Virtual IP

Task 3.1: Install Keepalived on Proxmox Hosts

Task 3.2: Configure Keepalived on Primary Host (r630-01)

Task 3.3: Configure Keepalived on Secondary Host (r630-02)

Task 3.4: Create Health Check Script

Task 3.5: Create Keepalived Notification Script

Task 3.6: Start and Enable Keepalived

Phase 4: Sync Configuration to Secondary

Task 4.1: Export Primary Configuration

Task 4.2: Import Configuration to Secondary

Phase 5: Set Up Ongoing Configuration Sync

Task 5.1: Create Configuration Sync Script

Phase 6: Testing and Validation

Task 6.1: Test Virtual IP Failover

Task 6.2: Test Certificate Access

Task 6.3: Test Proxy Host Functionality

Monitoring and Maintenance

Task 7.1: Set Up HA Status Monitoring

Task 7.2: Document Manual Failover Procedures

Task 7.3: Test All Failover Scenarios

HA Implementation Summary

Total Estimated Time

Prerequisites Checklist

Risk Mitigation

29 KiB

Raw Blame History

1. Missing Script: `scripts/verify/backup-npmplus.sh`

`INGRESS_VERIFICATION_RUNBOOK.md`:

`NPMPLUS_BACKUP_RESTORE.md`:

`SANKOFA_CUTOVER_PLAN.md`: