Files
proxmox/docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

20 KiB

NPMplus High Availability (HA) Setup Guide

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Date: 2026-01-20
Status: Complete HA Architecture Guide
Purpose: Comprehensive guide for deploying High Availability NPMplus architecture


Overview

This guide provides step-by-step instructions for deploying a highly available NPMplus setup to eliminate the single point of failure in the ingress architecture.

Current Architecture

  • Single NPMplus Instance: VMID 10233 on r630-01 (192.168.11.166)
  • Single Point of Failure: All 19+ domains depend on one container
  • No Redundancy: Container failure = complete ingress outage

Target HA Architecture

  • Multiple NPMplus Instances: Primary + Secondary (optionally Tertiary)
  • Shared Storage: Database and certificates synchronized
  • Load Balancer: Distributes traffic across instances
  • Automatic Failover: Health checks and automatic routing

HA Architecture Options

Architecture:

Internet
    ↓
Cloudflare DNS → 76.53.10.36
    ↓
UDM Pro Port Forward (80/443)
    ↓
Keepalived Virtual IP (192.168.11.166)
    ├─ Primary NPMplus (VMID 10233) - Active
    └─ Secondary NPMplus (VMID 10234) - Standby
    ↓
Backend VMs

Pros:

  • Simple configuration
  • No changes to existing DNS/port forwarding
  • Automatic failover
  • Single active instance (easier certificate management)

Cons:

  • Secondary instance idle (no load distribution)
  • Requires shared storage for certificates

Option 2: Active-Active with HAProxy Load Balancer

Architecture:

Internet
    ↓
Cloudflare DNS → 76.53.10.36
    ↓
UDM Pro Port Forward (80/443)
    ↓
HAProxy (192.168.11.166)
    ├─ Primary NPMplus (VMID 10233) - Active
    └─ Secondary NPMplus (VMID 10234) - Active
    ↓
Backend VMs

Pros:

  • Load distribution across instances
  • Better resource utilization
  • Automatic failover
  • Can handle more traffic

Cons:

  • More complex configuration
  • Requires shared storage for database and certificates
  • Need to handle SSL termination at HAProxy or NPMplus

Option 3: Active-Active with Shared Database (Advanced)

Architecture:

Internet
    ↓
Cloudflare DNS → 76.53.10.36
    ↓
UDM Pro Port Forward (80/443)
    ↓
Keepalived Virtual IP (192.168.11.166)
    ├─ Primary NPMplus (VMID 10233)
    └─ Secondary NPMplus (VMID 10234)
    ↓ (Shared Resources)
    ├─ PostgreSQL/MariaDB Database (Shared)
    ├─ NFS/GlusterFS for Certificates (Shared)
    └─ Shared Configuration Storage
    ↓
Backend VMs

Pros:

  • True active-active (both instances serving traffic)
  • Shared database ensures configuration sync
  • Shared certificate storage

Cons:

  • Most complex to implement
  • Requires external database
  • Requires shared file storage (NFS/GlusterFS)
  • NPMplus uses SQLite (would need migration)

For the initial HA implementation, Option 1 (Active-Passive with Keepalived) is recommended because:

  1. Minimal changes to existing architecture
  2. Reuses existing NPMplus configuration
  3. Easier to implement and test
  4. Can be upgraded to active-active later

This guide focuses on Option 1, with notes on how to upgrade to Option 2 later.


Prerequisites

Infrastructure Requirements

  • Primary Proxmox Host: r630-01 (192.168.11.11) - Existing NPMplus
  • Secondary Proxmox Host: r630-02 (192.168.11.12) or ml110 (192.168.11.10) - For secondary NPMplus
  • Shared Storage: NFS or rsync-based synchronization for certificates
  • Network: Both hosts on same VLAN (192.168.11.0/24)

Software Requirements

  • Keepalived (for virtual IP)
  • rsync or NFS (for certificate synchronization)
  • Monitoring tools (for health checks)

Current NPMplus Details

  • VMID: 10233
  • Host: r630-01 (192.168.11.11)
  • Container IP: 192.168.11.166 (eth0)
  • Management Port: 81
  • Database: /data/database.sqlite
  • Certificates: /data/tls/certbot/live/

Step-by-Step Implementation

Phase 1: Prepare Secondary NPMplus Instance

Step 1.1: Create Secondary NPMplus Container

Target: VMID 10234 on r630-02 (192.168.11.12)

# On Proxmox host (r630-02)
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.168"
BRIDGE="vmbr0"

# Download Alpine template
pveam download local alpine-3.22-default_20241208_amd64.tar.xz

# Create container
pct create $CTID \
    local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
    --hostname $HOSTNAME \
    --memory 1024 \
    --cores 2 \
    --rootfs local-lvm:5 \
    --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
    --unprivileged 1 \
    --features nesting=1

# Start container
pct start $CTID

# Wait for container to be ready
sleep 10

Step 1.2: Install NPMplus on Secondary Instance

# SSH to Proxmox host
ssh root@192.168.11.12

# Enter container
pct exec 10234 -- ash

# Install dependencies
apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync

# Start Docker
rc-service docker start
rc-update add docker default

# Wait for Docker
sleep 5

# Fetch NPMplus compose file
cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml

# Update compose file with timezone and email
TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"

yq -i "
  .services.npmplus.environment |=
    (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
    [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml

# Start NPMplus (DO NOT start services yet - will sync config first)
docker compose up -d

Step 1.3: Configure Secondary Container Network

# Secondary container should have static IP
# VMID 10234: 192.168.11.167 (eth0)

# Verify IP
pct exec 10234 -- ip addr show eth0

Phase 2: Set Up Certificate Synchronization

Step 2.1: Create Certificate Sync Script

Location: scripts/npmplus/sync-certificates.sh

#!/bin/bash
# Synchronize NPMplus certificates from primary to secondary

set -euo pipefail

PRIMARY_HOST="192.168.11.11"
PRIMARY_VMID="10233"
SECONDARY_HOST="192.168.11.12"
SECONDARY_VMID="10234"
CERT_PATH="/data/tls/certbot/live"

# Colors
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'

log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }

log_info "Starting certificate synchronization..."

# Sync certificates from primary to secondary
rsync -avz --delete \
    -e "ssh -o StrictHostKeyChecking=no" \
    root@$PRIMARY_HOST:"/var/lib/vz/containers/$PRIMARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" \
    root@$SECONDARY_HOST:"/var/lib/vz/containers/$SECONDARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

log_info "Certificate synchronization complete"

Make executable:

chmod +x scripts/npmplus/sync-certificates.sh

Step 2.2: Set Up Automated Certificate Sync

Cron Job (runs every 5 minutes):

# On primary Proxmox host (r630-01)
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1

Phase 3: Set Up Keepalived for Virtual IP

Step 3.1: Install Keepalived on Proxmox Hosts

# On both primary and secondary Proxmox hosts
apt update
apt install -y keepalived

Step 3.2: Configure Keepalived on Primary Host (r630-01)

File: /etc/keepalived/keepalived.conf

vrrp_script chk_npmplus {
    script "/usr/local/bin/check-npmplus-health.sh"
    interval 5
    weight -10
    fall 2
    rise 2
}

vrrp_instance VI_NPMPLUS {
    state MASTER
    interface vmbr0
    virtual_router_id 51
    priority 110
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass npmplus_ha_2024
    }
    virtual_ipaddress {
        192.168.11.166/24
    }
    track_script {
        chk_npmplus
    }
    notify_master "/usr/local/bin/keepalived-notify.sh master"
    notify_backup "/usr/local/bin/keepalived-notify.sh backup"
    notify_fault "/usr/local/bin/keepalived-notify.sh fault"
}

Step 3.3: Configure Keepalived on Secondary Host (r630-02)

File: /etc/keepalived/keepalived.conf

vrrp_script chk_npmplus {
    script "/usr/local/bin/check-npmplus-health.sh"
    interval 5
    weight -10
    fall 2
    rise 2
}

vrrp_instance VI_NPMPLUS {
    state BACKUP
    interface vmbr0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass npmplus_ha_2024
    }
    virtual_ipaddress {
        192.168.11.166/24
    }
    track_script {
        chk_npmplus
    }
    notify_master "/usr/local/bin/keepalived-notify.sh master"
    notify_backup "/usr/local/bin/keepalived-notify.sh backup"
    notify_fault "/usr/local/bin/keepalived-notify.sh fault"
}

Step 3.4: Create Health Check Script

File: /usr/local/bin/check-npmplus-health.sh (on both hosts)

#!/bin/bash
# Check NPMplus health and return 0 if healthy, 1 if unhealthy

PRIMARY_HOST="192.168.11.11"
PRIMARY_VMID="10233"
SECONDARY_HOST="192.168.11.12"
SECONDARY_VMID="10234"

HOSTNAME=$(hostname)
if [ "$HOSTNAME" = "r630-01" ]; then
    VMID=$PRIMARY_VMID
elif [ "$HOSTNAME" = "r630-02" ]; then
    VMID=$SECONDARY_VMID
else
    exit 1
fi

# Check if container is running
if ! pct status $VMID 2>/dev/null | grep -q "running"; then
    exit 1
fi

# Check if NPMplus container is healthy
if ! pct exec $VMID -- docker ps --filter "name=npmplus" --format "{{.Status}}" | grep -q "healthy\|Up"; then
    exit 1
fi

# Check if NPMplus web interface responds
if ! pct exec $VMID -- curl -s -k -f -o /dev/null --max-time 5 https://localhost:81 >/dev/null 2>&1; then
    exit 1
fi

# All checks passed
exit 0

Make executable:

chmod +x /usr/local/bin/check-npmplus-health.sh

Step 3.5: Create Notification Script

File: /usr/local/bin/keepalived-notify.sh (on both hosts)

#!/bin/bash
# Handle Keepalived state changes

STATE=$1
LOGFILE="/var/log/keepalived-notify.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

case "$STATE" in
    "master")
        echo "[$TIMESTAMP] Transitioned to MASTER - This node now owns VIP 192.168.11.166" >> "$LOGFILE"
        # Optionally: Start services, send alerts, etc.
        ;;
    "backup")
        echo "[$TIMESTAMP] Transitioned to BACKUP - Standby mode" >> "$LOGFILE"
        ;;
    "fault")
        echo "[$TIMESTAMP] Transitioned to FAULT - Health check failed" >> "$LOGFILE"
        # Optionally: Send critical alerts
        ;;
esac

Make executable:

chmod +x /usr/local/bin/keepalived-notify.sh

Step 3.6: Start Keepalived

# On both hosts
systemctl enable keepalived
systemctl start keepalived

# Verify status
systemctl status keepalived
ip addr show vmbr0 | grep 192.168.11.166

Phase 4: Sync Configuration to Secondary

Step 4.1: Export Primary Configuration

Script: scripts/npmplus/export-primary-config.sh

#!/bin/bash
# Export primary NPMplus configuration

PRIMARY_HOST="192.168.11.11"
PRIMARY_VMID="10233"
BACKUP_DIR="/tmp/npmplus-config-backup-$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Export database
ssh root@$PRIMARY_HOST "pct exec $PRIMARY_VMID -- docker exec npmplus sqlite3 /data/database.sqlite '.dump'" > "$BACKUP_DIR/database.sql"

# Export proxy hosts via API (if available)
NPM_URL="https://192.168.11.166:81"
NPM_EMAIL="nsatoshi2007@hotmail.com"
NPM_PASSWORD="your-password"  # Update from .env

TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \
    -H "Content-Type: application/json" \
    -d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}")

TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token')

curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \
    -H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/proxy_hosts.json"

curl -s -k -X GET "$NPM_URL/api/nginx/certificates" \
    -H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/certificates.json"

echo "Configuration exported to $BACKUP_DIR"

Step 4.2: Import Configuration to Secondary

Script: scripts/npmplus/import-secondary-config.sh

#!/bin/bash
# Import configuration to secondary NPMplus

SECONDARY_HOST="192.168.11.12"
SECONDARY_VMID="10234"
BACKUP_DIR="$1"  # Path to backup directory from Step 4.1

if [ -z "$BACKUP_DIR" ] || [ ! -d "$BACKUP_DIR" ]; then
    echo "Usage: $0 <backup-directory>"
    exit 1
fi

# Import database (requires stopping NPMplus first)
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker stop npmplus"

# Copy database backup
scp "$BACKUP_DIR/database.sql" root@$SECONDARY_HOST:/tmp/

# Import database
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- bash -c '
    cat /tmp/database.sql | docker exec -i npmplus sqlite3 /data/database.sqlite
'"

# Restart NPMplus
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker start npmplus"

# Wait for NPMplus to be ready
sleep 10

echo "Configuration imported to secondary NPMplus"

Phase 5: Set Up Configuration Sync (Ongoing)

Step 5.1: Create Configuration Sync Script

Script: scripts/npmplus/sync-config.sh

#!/bin/bash
# Sync NPMplus configuration from primary to secondary

PRIMARY_HOST="192.168.11.11"
PRIMARY_VMID="10233"
SECONDARY_HOST="192.168.11.12"
SECONDARY_VMID="10234"

NPM_URL="https://192.168.11.166:81"
NPM_EMAIL="nsatoshi2007@hotmail.com"
NPM_PASSWORD="${NPM_PASSWORD:-}"  # From .env

if [ -z "$NPM_PASSWORD" ]; then
    echo "ERROR: NPM_PASSWORD not set"
    exit 1
fi

# Authenticate
TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \
    -H "Content-Type: application/json" \
    -d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}")

TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token')

if [ -z "$TOKEN" ] || [ "$TOKEN" = "null" ]; then
    echo "ERROR: Authentication failed"
    exit 1
fi

# Export from primary
curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \
    -H "Authorization: Bearer $TOKEN" > /tmp/proxy_hosts_primary.json

# Get secondary URL (will be different when not active)
SECONDARY_URL="https://192.168.11.168:81"

# For now, manual sync is required
# In future: implement API-based sync or shared database
echo "Manual configuration sync required"
echo "Export from: $NPM_URL"
echo "Import to: $SECONDARY_URL"

Note: Full automated configuration sync requires either:

  • Shared database (PostgreSQL/MariaDB migration)
  • API-based sync script (more complex)
  • Manual sync process for configuration changes

For now: Configuration changes must be manually replicated to secondary.


Phase 6: Testing and Validation

Step 6.1: Test Virtual IP Failover

# On primary host
ip addr show vmbr0 | grep 192.168.11.166
# Should show: 192.168.11.166

# Simulate primary failure
systemctl stop keepalived

# Wait 5-10 seconds
sleep 10

# Check secondary host
ssh root@192.168.11.12 "ip addr show vmbr0 | grep 192.168.11.166"
# Should now show: 192.168.11.166 (VIP moved to secondary)

# Test connectivity
curl -k https://192.168.11.166:81
# Should connect to secondary NPMplus

# Restore primary
systemctl start keepalived

# Wait for failback
sleep 10

Step 6.2: Test Certificate Access

# Verify certificates exist on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

# Test SSL endpoint
curl -vI https://explorer.d-bis.org
# Should show valid certificate

Step 6.3: Test Proxy Host Functionality

# Test each domain from external
for domain in explorer.d-bis.org mim4u.org rpc-http-pub.d-bis.org; do
    echo "Testing $domain..."
    curl -I "https://$domain" 2>&1 | grep -E "HTTP|Server"
done

Monitoring and Maintenance

Health Monitoring

Script: scripts/npmplus/monitor-ha-status.sh

#!/bin/bash
# Monitor HA status and send alerts if needed

VIP="192.168.11.166"
PRIMARY_HOST="192.168.11.11"
SECONDARY_HOST="192.168.11.12"

# Check who owns VIP
VIP_OWNER=$(ssh root@$PRIMARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$PRIMARY_HOST" || \
            ssh root@$SECONDARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$SECONDARY_HOST" || \
            echo "UNKNOWN")

echo "VIP $VIP owner: $VIP_OWNER"

# Check Keepalived status on both hosts
PRIMARY_STATUS=$(ssh root@$PRIMARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown")
SECONDARY_STATUS=$(ssh root@$SECONDARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown")

echo "Primary Keepalived: $PRIMARY_STATUS"
echo "Secondary Keepalived: $SECONDARY_STATUS"

# Alert if both are down
if [ "$PRIMARY_STATUS" != "active" ] && [ "$SECONDARY_STATUS" != "active" ]; then
    echo "ALERT: Both Keepalived instances are down!"
    # Send alert (email, webhook, etc.)
fi

Cron Job:

*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1

Upgrading to Active-Active (Future)

To upgrade from Active-Passive to Active-Active:

Option A: HAProxy Load Balancer

  1. Deploy HAProxy on dedicated VM/container (VMID 10235)
  2. Configure HAProxy to balance between both NPMplus instances
  3. Update UDM Pro port forwarding to point to HAProxy IP
  4. Configure shared storage for certificates
  5. Implement shared database (PostgreSQL migration)

Option B: DNS Round-Robin

  1. Assign multiple IPs to NPMplus instances
  2. Configure DNS round-robin (not recommended for SSL termination)

Troubleshooting

Issue: VIP not moving to secondary

Symptoms: Primary fails but secondary doesn't take over

Check:

# Check Keepalived logs
journalctl -u keepalived -n 50

# Check health check script
/usr/local/bin/check-npmplus-health.sh
echo $?  # Should return 0 if healthy

# Check firewall (VRRP uses multicast)
iptables -L | grep 224.0.0.0

Solution: Ensure VRRP multicast traffic (224.0.0.0/8) is allowed between hosts.


Issue: Certificates out of sync

Symptoms: Secondary shows certificate errors

Solution:

# Manually sync certificates
bash scripts/npmplus/sync-certificates.sh

# Verify sync
ssh root@192.168.11.12 "ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

Issue: Configuration mismatch

Symptoms: Proxy hosts work on primary but not secondary

Solution:

# Export from primary
bash scripts/npmplus/export-primary-config.sh

# Import to secondary
bash scripts/npmplus/import-secondary-config.sh /tmp/npmplus-config-backup-*

Rollback Plan

If HA setup causes issues:

  1. Disable Keepalived on Secondary:

    ssh root@192.168.11.12 "systemctl stop keepalived"
    systemctl disable keepalived
    
  2. Ensure Primary Owns VIP:

    systemctl restart keepalived
    ip addr show vmbr0 | grep 192.168.11.166
    
  3. Stop Secondary NPMplus (optional):

    ssh root@192.168.11.12 "pct stop 10234"
    
  4. Remove Secondary Container (if not needed):

    ssh root@192.168.11.12 "pct destroy 10234"
    

Cost and Resource Impact

Additional Resources Required

  • Secondary NPMplus Container: ~1 GB RAM, 5 GB disk, 2 CPU cores
  • Keepalived: Minimal overhead (< 10 MB RAM)
  • Network: VRRP multicast traffic (minimal)
  • Storage: Certificate sync storage (same as primary)

Maintenance Overhead

  • Certificate Sync: Automated (every 5 minutes)
  • Configuration Sync: Manual (when changes made)
  • Monitoring: Automated (every 5 minutes)

Next Steps

  1. Review and Approve HA Architecture
  2. Schedule Maintenance Window (if required)
  3. Create Secondary NPMplus Instance (Phase 1)
  4. Set Up Certificate Sync (Phase 2)
  5. Configure Keepalived (Phase 3)
  6. Sync Configuration (Phase 4)
  7. Test Failover (Phase 6)
  8. Enable Monitoring (Monitoring section)

References


Last Updated: 2026-01-20
Status: Ready for Implementation
Estimated Implementation Time: 4-6 hours