Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
848 lines
20 KiB
Markdown
848 lines
20 KiB
Markdown
# NPMplus High Availability (HA) Setup Guide
|
|
|
|
**Last Updated:** 2026-01-31
|
|
**Document Version:** 1.0
|
|
**Status:** Active Documentation
|
|
|
|
---
|
|
|
|
**Date**: 2026-01-20
|
|
**Status**: Complete HA Architecture Guide
|
|
**Purpose**: Comprehensive guide for deploying High Availability NPMplus architecture
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This guide provides step-by-step instructions for deploying a highly available NPMplus setup to eliminate the single point of failure in the ingress architecture.
|
|
|
|
### Current Architecture
|
|
- **Single NPMplus Instance**: VMID 10233 on r630-01 (192.168.11.166)
|
|
- **Single Point of Failure**: All 19+ domains depend on one container
|
|
- **No Redundancy**: Container failure = complete ingress outage
|
|
|
|
### Target HA Architecture
|
|
- **Multiple NPMplus Instances**: Primary + Secondary (optionally Tertiary)
|
|
- **Shared Storage**: Database and certificates synchronized
|
|
- **Load Balancer**: Distributes traffic across instances
|
|
- **Automatic Failover**: Health checks and automatic routing
|
|
|
|
---
|
|
|
|
## HA Architecture Options
|
|
|
|
### Option 1: Active-Passive with Keepalived (Recommended for Start)
|
|
|
|
**Architecture**:
|
|
```
|
|
Internet
|
|
↓
|
|
Cloudflare DNS → 76.53.10.36
|
|
↓
|
|
UDM Pro Port Forward (80/443)
|
|
↓
|
|
Keepalived Virtual IP (192.168.11.166)
|
|
├─ Primary NPMplus (VMID 10233) - Active
|
|
└─ Secondary NPMplus (VMID 10234) - Standby
|
|
↓
|
|
Backend VMs
|
|
```
|
|
|
|
**Pros**:
|
|
- Simple configuration
|
|
- No changes to existing DNS/port forwarding
|
|
- Automatic failover
|
|
- Single active instance (easier certificate management)
|
|
|
|
**Cons**:
|
|
- Secondary instance idle (no load distribution)
|
|
- Requires shared storage for certificates
|
|
|
|
---
|
|
|
|
### Option 2: Active-Active with HAProxy Load Balancer
|
|
|
|
**Architecture**:
|
|
```
|
|
Internet
|
|
↓
|
|
Cloudflare DNS → 76.53.10.36
|
|
↓
|
|
UDM Pro Port Forward (80/443)
|
|
↓
|
|
HAProxy (192.168.11.166)
|
|
├─ Primary NPMplus (VMID 10233) - Active
|
|
└─ Secondary NPMplus (VMID 10234) - Active
|
|
↓
|
|
Backend VMs
|
|
```
|
|
|
|
**Pros**:
|
|
- Load distribution across instances
|
|
- Better resource utilization
|
|
- Automatic failover
|
|
- Can handle more traffic
|
|
|
|
**Cons**:
|
|
- More complex configuration
|
|
- Requires shared storage for database and certificates
|
|
- Need to handle SSL termination at HAProxy or NPMplus
|
|
|
|
---
|
|
|
|
### Option 3: Active-Active with Shared Database (Advanced)
|
|
|
|
**Architecture**:
|
|
```
|
|
Internet
|
|
↓
|
|
Cloudflare DNS → 76.53.10.36
|
|
↓
|
|
UDM Pro Port Forward (80/443)
|
|
↓
|
|
Keepalived Virtual IP (192.168.11.166)
|
|
├─ Primary NPMplus (VMID 10233)
|
|
└─ Secondary NPMplus (VMID 10234)
|
|
↓ (Shared Resources)
|
|
├─ PostgreSQL/MariaDB Database (Shared)
|
|
├─ NFS/GlusterFS for Certificates (Shared)
|
|
└─ Shared Configuration Storage
|
|
↓
|
|
Backend VMs
|
|
```
|
|
|
|
**Pros**:
|
|
- True active-active (both instances serving traffic)
|
|
- Shared database ensures configuration sync
|
|
- Shared certificate storage
|
|
|
|
**Cons**:
|
|
- Most complex to implement
|
|
- Requires external database
|
|
- Requires shared file storage (NFS/GlusterFS)
|
|
- NPMplus uses SQLite (would need migration)
|
|
|
|
---
|
|
|
|
## Recommended Approach: Active-Passive with Keepalived
|
|
|
|
For the initial HA implementation, **Option 1 (Active-Passive with Keepalived)** is recommended because:
|
|
1. Minimal changes to existing architecture
|
|
2. Reuses existing NPMplus configuration
|
|
3. Easier to implement and test
|
|
4. Can be upgraded to active-active later
|
|
|
|
This guide focuses on **Option 1**, with notes on how to upgrade to **Option 2** later.
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
### Infrastructure Requirements
|
|
- **Primary Proxmox Host**: r630-01 (192.168.11.11) - Existing NPMplus
|
|
- **Secondary Proxmox Host**: r630-02 (192.168.11.12) or ml110 (192.168.11.10) - For secondary NPMplus
|
|
- **Shared Storage**: NFS or rsync-based synchronization for certificates
|
|
- **Network**: Both hosts on same VLAN (192.168.11.0/24)
|
|
|
|
### Software Requirements
|
|
- Keepalived (for virtual IP)
|
|
- rsync or NFS (for certificate synchronization)
|
|
- Monitoring tools (for health checks)
|
|
|
|
### Current NPMplus Details
|
|
- **VMID**: 10233
|
|
- **Host**: r630-01 (192.168.11.11)
|
|
- **Container IP**: 192.168.11.166 (eth0)
|
|
- **Management Port**: 81
|
|
- **Database**: `/data/database.sqlite`
|
|
- **Certificates**: `/data/tls/certbot/live/`
|
|
|
|
---
|
|
|
|
## Step-by-Step Implementation
|
|
|
|
### Phase 1: Prepare Secondary NPMplus Instance
|
|
|
|
#### Step 1.1: Create Secondary NPMplus Container
|
|
|
|
**Target**: VMID 10234 on r630-02 (192.168.11.12)
|
|
|
|
```bash
|
|
# On Proxmox host (r630-02)
|
|
CTID=10234
|
|
HOSTNAME="npmplus-secondary"
|
|
IP="192.168.11.168"
|
|
BRIDGE="vmbr0"
|
|
|
|
# Download Alpine template
|
|
pveam download local alpine-3.22-default_20241208_amd64.tar.xz
|
|
|
|
# Create container
|
|
pct create $CTID \
|
|
local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
|
|
--hostname $HOSTNAME \
|
|
--memory 1024 \
|
|
--cores 2 \
|
|
--rootfs local-lvm:5 \
|
|
--net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
|
|
--unprivileged 1 \
|
|
--features nesting=1
|
|
|
|
# Start container
|
|
pct start $CTID
|
|
|
|
# Wait for container to be ready
|
|
sleep 10
|
|
```
|
|
|
|
#### Step 1.2: Install NPMplus on Secondary Instance
|
|
|
|
```bash
|
|
# SSH to Proxmox host
|
|
ssh root@192.168.11.12
|
|
|
|
# Enter container
|
|
pct exec 10234 -- ash
|
|
|
|
# Install dependencies
|
|
apk update
|
|
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync
|
|
|
|
# Start Docker
|
|
rc-service docker start
|
|
rc-update add docker default
|
|
|
|
# Wait for Docker
|
|
sleep 5
|
|
|
|
# Fetch NPMplus compose file
|
|
cd /opt
|
|
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml
|
|
|
|
# Update compose file with timezone and email
|
|
TZ="America/New_York"
|
|
ACME_EMAIL="nsatoshi2007@hotmail.com"
|
|
|
|
yq -i "
|
|
.services.npmplus.environment |=
|
|
(map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
|
|
[\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
|
|
" compose.yaml
|
|
|
|
# Start NPMplus (DO NOT start services yet - will sync config first)
|
|
docker compose up -d
|
|
```
|
|
|
|
#### Step 1.3: Configure Secondary Container Network
|
|
|
|
```bash
|
|
# Secondary container should have static IP
|
|
# VMID 10234: 192.168.11.167 (eth0)
|
|
|
|
# Verify IP
|
|
pct exec 10234 -- ip addr show eth0
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: Set Up Certificate Synchronization
|
|
|
|
#### Step 2.1: Create Certificate Sync Script
|
|
|
|
**Location**: `scripts/npmplus/sync-certificates.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Synchronize NPMplus certificates from primary to secondary
|
|
|
|
set -euo pipefail
|
|
|
|
PRIMARY_HOST="192.168.11.11"
|
|
PRIMARY_VMID="10233"
|
|
SECONDARY_HOST="192.168.11.12"
|
|
SECONDARY_VMID="10234"
|
|
CERT_PATH="/data/tls/certbot/live"
|
|
|
|
# Colors
|
|
GREEN='\033[0;32m'
|
|
YELLOW='\033[1;33m'
|
|
RED='\033[0;31m'
|
|
NC='\033[0m'
|
|
|
|
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
|
|
log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
|
|
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }
|
|
|
|
log_info "Starting certificate synchronization..."
|
|
|
|
# Sync certificates from primary to secondary
|
|
rsync -avz --delete \
|
|
-e "ssh -o StrictHostKeyChecking=no" \
|
|
root@$PRIMARY_HOST:"/var/lib/vz/containers/$PRIMARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" \
|
|
root@$SECONDARY_HOST:"/var/lib/vz/containers/$SECONDARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
|
|
|
|
log_info "Certificate synchronization complete"
|
|
```
|
|
|
|
**Make executable**:
|
|
```bash
|
|
chmod +x scripts/npmplus/sync-certificates.sh
|
|
```
|
|
|
|
#### Step 2.2: Set Up Automated Certificate Sync
|
|
|
|
**Cron Job** (runs every 5 minutes):
|
|
```bash
|
|
# On primary Proxmox host (r630-01)
|
|
crontab -e
|
|
|
|
# Add:
|
|
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 3: Set Up Keepalived for Virtual IP
|
|
|
|
#### Step 3.1: Install Keepalived on Proxmox Hosts
|
|
|
|
```bash
|
|
# On both primary and secondary Proxmox hosts
|
|
apt update
|
|
apt install -y keepalived
|
|
```
|
|
|
|
#### Step 3.2: Configure Keepalived on Primary Host (r630-01)
|
|
|
|
**File**: `/etc/keepalived/keepalived.conf`
|
|
|
|
```bash
|
|
vrrp_script chk_npmplus {
|
|
script "/usr/local/bin/check-npmplus-health.sh"
|
|
interval 5
|
|
weight -10
|
|
fall 2
|
|
rise 2
|
|
}
|
|
|
|
vrrp_instance VI_NPMPLUS {
|
|
state MASTER
|
|
interface vmbr0
|
|
virtual_router_id 51
|
|
priority 110
|
|
advert_int 1
|
|
authentication {
|
|
auth_type PASS
|
|
auth_pass npmplus_ha_2024
|
|
}
|
|
virtual_ipaddress {
|
|
192.168.11.166/24
|
|
}
|
|
track_script {
|
|
chk_npmplus
|
|
}
|
|
notify_master "/usr/local/bin/keepalived-notify.sh master"
|
|
notify_backup "/usr/local/bin/keepalived-notify.sh backup"
|
|
notify_fault "/usr/local/bin/keepalived-notify.sh fault"
|
|
}
|
|
```
|
|
|
|
#### Step 3.3: Configure Keepalived on Secondary Host (r630-02)
|
|
|
|
**File**: `/etc/keepalived/keepalived.conf`
|
|
|
|
```bash
|
|
vrrp_script chk_npmplus {
|
|
script "/usr/local/bin/check-npmplus-health.sh"
|
|
interval 5
|
|
weight -10
|
|
fall 2
|
|
rise 2
|
|
}
|
|
|
|
vrrp_instance VI_NPMPLUS {
|
|
state BACKUP
|
|
interface vmbr0
|
|
virtual_router_id 51
|
|
priority 100
|
|
advert_int 1
|
|
authentication {
|
|
auth_type PASS
|
|
auth_pass npmplus_ha_2024
|
|
}
|
|
virtual_ipaddress {
|
|
192.168.11.166/24
|
|
}
|
|
track_script {
|
|
chk_npmplus
|
|
}
|
|
notify_master "/usr/local/bin/keepalived-notify.sh master"
|
|
notify_backup "/usr/local/bin/keepalived-notify.sh backup"
|
|
notify_fault "/usr/local/bin/keepalived-notify.sh fault"
|
|
}
|
|
```
|
|
|
|
#### Step 3.4: Create Health Check Script
|
|
|
|
**File**: `/usr/local/bin/check-npmplus-health.sh` (on both hosts)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Check NPMplus health and return 0 if healthy, 1 if unhealthy
|
|
|
|
PRIMARY_HOST="192.168.11.11"
|
|
PRIMARY_VMID="10233"
|
|
SECONDARY_HOST="192.168.11.12"
|
|
SECONDARY_VMID="10234"
|
|
|
|
HOSTNAME=$(hostname)
|
|
if [ "$HOSTNAME" = "r630-01" ]; then
|
|
VMID=$PRIMARY_VMID
|
|
elif [ "$HOSTNAME" = "r630-02" ]; then
|
|
VMID=$SECONDARY_VMID
|
|
else
|
|
exit 1
|
|
fi
|
|
|
|
# Check if container is running
|
|
if ! pct status $VMID 2>/dev/null | grep -q "running"; then
|
|
exit 1
|
|
fi
|
|
|
|
# Check if NPMplus container is healthy
|
|
if ! pct exec $VMID -- docker ps --filter "name=npmplus" --format "{{.Status}}" | grep -q "healthy\|Up"; then
|
|
exit 1
|
|
fi
|
|
|
|
# Check if NPMplus web interface responds
|
|
if ! pct exec $VMID -- curl -s -k -f -o /dev/null --max-time 5 https://localhost:81 >/dev/null 2>&1; then
|
|
exit 1
|
|
fi
|
|
|
|
# All checks passed
|
|
exit 0
|
|
```
|
|
|
|
**Make executable**:
|
|
```bash
|
|
chmod +x /usr/local/bin/check-npmplus-health.sh
|
|
```
|
|
|
|
#### Step 3.5: Create Notification Script
|
|
|
|
**File**: `/usr/local/bin/keepalived-notify.sh` (on both hosts)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Handle Keepalived state changes
|
|
|
|
STATE=$1
|
|
LOGFILE="/var/log/keepalived-notify.log"
|
|
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
|
|
|
|
case "$STATE" in
|
|
"master")
|
|
echo "[$TIMESTAMP] Transitioned to MASTER - This node now owns VIP 192.168.11.166" >> "$LOGFILE"
|
|
# Optionally: Start services, send alerts, etc.
|
|
;;
|
|
"backup")
|
|
echo "[$TIMESTAMP] Transitioned to BACKUP - Standby mode" >> "$LOGFILE"
|
|
;;
|
|
"fault")
|
|
echo "[$TIMESTAMP] Transitioned to FAULT - Health check failed" >> "$LOGFILE"
|
|
# Optionally: Send critical alerts
|
|
;;
|
|
esac
|
|
```
|
|
|
|
**Make executable**:
|
|
```bash
|
|
chmod +x /usr/local/bin/keepalived-notify.sh
|
|
```
|
|
|
|
#### Step 3.6: Start Keepalived
|
|
|
|
```bash
|
|
# On both hosts
|
|
systemctl enable keepalived
|
|
systemctl start keepalived
|
|
|
|
# Verify status
|
|
systemctl status keepalived
|
|
ip addr show vmbr0 | grep 192.168.11.166
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Sync Configuration to Secondary
|
|
|
|
#### Step 4.1: Export Primary Configuration
|
|
|
|
**Script**: `scripts/npmplus/export-primary-config.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Export primary NPMplus configuration
|
|
|
|
PRIMARY_HOST="192.168.11.11"
|
|
PRIMARY_VMID="10233"
|
|
BACKUP_DIR="/tmp/npmplus-config-backup-$(date +%Y%m%d_%H%M%S)"
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
# Export database
|
|
ssh root@$PRIMARY_HOST "pct exec $PRIMARY_VMID -- docker exec npmplus sqlite3 /data/database.sqlite '.dump'" > "$BACKUP_DIR/database.sql"
|
|
|
|
# Export proxy hosts via API (if available)
|
|
NPM_URL="https://192.168.11.166:81"
|
|
NPM_EMAIL="nsatoshi2007@hotmail.com"
|
|
NPM_PASSWORD="your-password" # Update from .env
|
|
|
|
TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}")
|
|
|
|
TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token')
|
|
|
|
curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/proxy_hosts.json"
|
|
|
|
curl -s -k -X GET "$NPM_URL/api/nginx/certificates" \
|
|
-H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/certificates.json"
|
|
|
|
echo "Configuration exported to $BACKUP_DIR"
|
|
```
|
|
|
|
#### Step 4.2: Import Configuration to Secondary
|
|
|
|
**Script**: `scripts/npmplus/import-secondary-config.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Import configuration to secondary NPMplus
|
|
|
|
SECONDARY_HOST="192.168.11.12"
|
|
SECONDARY_VMID="10234"
|
|
BACKUP_DIR="$1" # Path to backup directory from Step 4.1
|
|
|
|
if [ -z "$BACKUP_DIR" ] || [ ! -d "$BACKUP_DIR" ]; then
|
|
echo "Usage: $0 <backup-directory>"
|
|
exit 1
|
|
fi
|
|
|
|
# Import database (requires stopping NPMplus first)
|
|
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker stop npmplus"
|
|
|
|
# Copy database backup
|
|
scp "$BACKUP_DIR/database.sql" root@$SECONDARY_HOST:/tmp/
|
|
|
|
# Import database
|
|
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- bash -c '
|
|
cat /tmp/database.sql | docker exec -i npmplus sqlite3 /data/database.sqlite
|
|
'"
|
|
|
|
# Restart NPMplus
|
|
ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker start npmplus"
|
|
|
|
# Wait for NPMplus to be ready
|
|
sleep 10
|
|
|
|
echo "Configuration imported to secondary NPMplus"
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 5: Set Up Configuration Sync (Ongoing)
|
|
|
|
#### Step 5.1: Create Configuration Sync Script
|
|
|
|
**Script**: `scripts/npmplus/sync-config.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Sync NPMplus configuration from primary to secondary
|
|
|
|
PRIMARY_HOST="192.168.11.11"
|
|
PRIMARY_VMID="10233"
|
|
SECONDARY_HOST="192.168.11.12"
|
|
SECONDARY_VMID="10234"
|
|
|
|
NPM_URL="https://192.168.11.166:81"
|
|
NPM_EMAIL="nsatoshi2007@hotmail.com"
|
|
NPM_PASSWORD="${NPM_PASSWORD:-}" # From .env
|
|
|
|
if [ -z "$NPM_PASSWORD" ]; then
|
|
echo "ERROR: NPM_PASSWORD not set"
|
|
exit 1
|
|
fi
|
|
|
|
# Authenticate
|
|
TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}")
|
|
|
|
TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token')
|
|
|
|
if [ -z "$TOKEN" ] || [ "$TOKEN" = "null" ]; then
|
|
echo "ERROR: Authentication failed"
|
|
exit 1
|
|
fi
|
|
|
|
# Export from primary
|
|
curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \
|
|
-H "Authorization: Bearer $TOKEN" > /tmp/proxy_hosts_primary.json
|
|
|
|
# Get secondary URL (will be different when not active)
|
|
SECONDARY_URL="https://192.168.11.168:81"
|
|
|
|
# For now, manual sync is required
|
|
# In future: implement API-based sync or shared database
|
|
echo "Manual configuration sync required"
|
|
echo "Export from: $NPM_URL"
|
|
echo "Import to: $SECONDARY_URL"
|
|
```
|
|
|
|
**Note**: Full automated configuration sync requires either:
|
|
- Shared database (PostgreSQL/MariaDB migration)
|
|
- API-based sync script (more complex)
|
|
- Manual sync process for configuration changes
|
|
|
|
**For now**: Configuration changes must be manually replicated to secondary.
|
|
|
|
---
|
|
|
|
### Phase 6: Testing and Validation
|
|
|
|
#### Step 6.1: Test Virtual IP Failover
|
|
|
|
```bash
|
|
# On primary host
|
|
ip addr show vmbr0 | grep 192.168.11.166
|
|
# Should show: 192.168.11.166
|
|
|
|
# Simulate primary failure
|
|
systemctl stop keepalived
|
|
|
|
# Wait 5-10 seconds
|
|
sleep 10
|
|
|
|
# Check secondary host
|
|
ssh root@192.168.11.12 "ip addr show vmbr0 | grep 192.168.11.166"
|
|
# Should now show: 192.168.11.166 (VIP moved to secondary)
|
|
|
|
# Test connectivity
|
|
curl -k https://192.168.11.166:81
|
|
# Should connect to secondary NPMplus
|
|
|
|
# Restore primary
|
|
systemctl start keepalived
|
|
|
|
# Wait for failback
|
|
sleep 10
|
|
```
|
|
|
|
#### Step 6.2: Test Certificate Access
|
|
|
|
```bash
|
|
# Verify certificates exist on secondary
|
|
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
|
|
|
|
# Test SSL endpoint
|
|
curl -vI https://explorer.d-bis.org
|
|
# Should show valid certificate
|
|
```
|
|
|
|
#### Step 6.3: Test Proxy Host Functionality
|
|
|
|
```bash
|
|
# Test each domain from external
|
|
for domain in explorer.d-bis.org mim4u.org rpc-http-pub.d-bis.org; do
|
|
echo "Testing $domain..."
|
|
curl -I "https://$domain" 2>&1 | grep -E "HTTP|Server"
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring and Maintenance
|
|
|
|
### Health Monitoring
|
|
|
|
**Script**: `scripts/npmplus/monitor-ha-status.sh`
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Monitor HA status and send alerts if needed
|
|
|
|
VIP="192.168.11.166"
|
|
PRIMARY_HOST="192.168.11.11"
|
|
SECONDARY_HOST="192.168.11.12"
|
|
|
|
# Check who owns VIP
|
|
VIP_OWNER=$(ssh root@$PRIMARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$PRIMARY_HOST" || \
|
|
ssh root@$SECONDARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$SECONDARY_HOST" || \
|
|
echo "UNKNOWN")
|
|
|
|
echo "VIP $VIP owner: $VIP_OWNER"
|
|
|
|
# Check Keepalived status on both hosts
|
|
PRIMARY_STATUS=$(ssh root@$PRIMARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown")
|
|
SECONDARY_STATUS=$(ssh root@$SECONDARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown")
|
|
|
|
echo "Primary Keepalived: $PRIMARY_STATUS"
|
|
echo "Secondary Keepalived: $SECONDARY_STATUS"
|
|
|
|
# Alert if both are down
|
|
if [ "$PRIMARY_STATUS" != "active" ] && [ "$SECONDARY_STATUS" != "active" ]; then
|
|
echo "ALERT: Both Keepalived instances are down!"
|
|
# Send alert (email, webhook, etc.)
|
|
fi
|
|
```
|
|
|
|
**Cron Job**:
|
|
```bash
|
|
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
## Upgrading to Active-Active (Future)
|
|
|
|
To upgrade from Active-Passive to Active-Active:
|
|
|
|
### Option A: HAProxy Load Balancer
|
|
|
|
1. Deploy HAProxy on dedicated VM/container (VMID 10235)
|
|
2. Configure HAProxy to balance between both NPMplus instances
|
|
3. Update UDM Pro port forwarding to point to HAProxy IP
|
|
4. Configure shared storage for certificates
|
|
5. Implement shared database (PostgreSQL migration)
|
|
|
|
### Option B: DNS Round-Robin
|
|
|
|
1. Assign multiple IPs to NPMplus instances
|
|
2. Configure DNS round-robin (not recommended for SSL termination)
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: VIP not moving to secondary
|
|
|
|
**Symptoms**: Primary fails but secondary doesn't take over
|
|
|
|
**Check**:
|
|
```bash
|
|
# Check Keepalived logs
|
|
journalctl -u keepalived -n 50
|
|
|
|
# Check health check script
|
|
/usr/local/bin/check-npmplus-health.sh
|
|
echo $? # Should return 0 if healthy
|
|
|
|
# Check firewall (VRRP uses multicast)
|
|
iptables -L | grep 224.0.0.0
|
|
```
|
|
|
|
**Solution**: Ensure VRRP multicast traffic (224.0.0.0/8) is allowed between hosts.
|
|
|
|
---
|
|
|
|
### Issue: Certificates out of sync
|
|
|
|
**Symptoms**: Secondary shows certificate errors
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Manually sync certificates
|
|
bash scripts/npmplus/sync-certificates.sh
|
|
|
|
# Verify sync
|
|
ssh root@192.168.11.12 "ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
|
|
```
|
|
|
|
---
|
|
|
|
### Issue: Configuration mismatch
|
|
|
|
**Symptoms**: Proxy hosts work on primary but not secondary
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Export from primary
|
|
bash scripts/npmplus/export-primary-config.sh
|
|
|
|
# Import to secondary
|
|
bash scripts/npmplus/import-secondary-config.sh /tmp/npmplus-config-backup-*
|
|
```
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If HA setup causes issues:
|
|
|
|
1. **Disable Keepalived on Secondary**:
|
|
```bash
|
|
ssh root@192.168.11.12 "systemctl stop keepalived"
|
|
systemctl disable keepalived
|
|
```
|
|
|
|
2. **Ensure Primary Owns VIP**:
|
|
```bash
|
|
systemctl restart keepalived
|
|
ip addr show vmbr0 | grep 192.168.11.166
|
|
```
|
|
|
|
3. **Stop Secondary NPMplus** (optional):
|
|
```bash
|
|
ssh root@192.168.11.12 "pct stop 10234"
|
|
```
|
|
|
|
4. **Remove Secondary Container** (if not needed):
|
|
```bash
|
|
ssh root@192.168.11.12 "pct destroy 10234"
|
|
```
|
|
|
|
---
|
|
|
|
## Cost and Resource Impact
|
|
|
|
### Additional Resources Required
|
|
- **Secondary NPMplus Container**: ~1 GB RAM, 5 GB disk, 2 CPU cores
|
|
- **Keepalived**: Minimal overhead (< 10 MB RAM)
|
|
- **Network**: VRRP multicast traffic (minimal)
|
|
- **Storage**: Certificate sync storage (same as primary)
|
|
|
|
### Maintenance Overhead
|
|
- **Certificate Sync**: Automated (every 5 minutes)
|
|
- **Configuration Sync**: Manual (when changes made)
|
|
- **Monitoring**: Automated (every 5 minutes)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Review and Approve HA Architecture**
|
|
2. **Schedule Maintenance Window** (if required)
|
|
3. **Create Secondary NPMplus Instance** (Phase 1)
|
|
4. **Set Up Certificate Sync** (Phase 2)
|
|
5. **Configure Keepalived** (Phase 3)
|
|
6. **Sync Configuration** (Phase 4)
|
|
7. **Test Failover** (Phase 6)
|
|
8. **Enable Monitoring** (Monitoring section)
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **Keepalived Documentation**: https://www.keepalived.org/manpage.html
|
|
- **NPMplus GitHub**: https://github.com/ZoeyVid/NPMplus
|
|
- **VRRP Protocol**: RFC 3768
|
|
- **Current Architecture**: `docs/04-configuration/DNS_NPMPLUS_VM_COMPREHENSIVE_ARCHITECTURE.md`
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-20
|
|
**Status**: Ready for Implementation
|
|
**Estimated Implementation Time**: 4-6 hours
|