Files
proxmox/docs/04-configuration/PROXMOX_LOAD_BALANCING_RUNBOOK.md
defiQUG e4c9dda0fd
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: update submodule references and documentation
- Marked submodules ai-mcp-pmm-controller, explorer-monorepo, and smom-dbis-138 as dirty to reflect recent changes.
- Updated documentation to clarify operator script usage, including dotenv loading and task execution instructions.
- Enhanced the README and various index files to provide clearer navigation and task completion guidance.

Made-with: Cursor
2026-03-04 02:03:08 -08:00

6.2 KiB
Raw Blame History

Proxmox load balancing runbook

Purpose: Reduce load on the busiest node (r630-01) by migrating selected LXC containers to r630-02. Also frees space on r630-01 when moving to another host. Note: ml110 is being repurposed to OPNsense/pfSense (WAN aggregator); migrate workloads off ml110 to r630-01/r630-02 before repurpose — see ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.

Before you start: If you are considering adding a third or fourth R630 to the cluster first, see PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md — including whether you already have r630-03/r630-04 (powered off) to bring online.

Current imbalance (typical):

Node IP LXC count Load (1/5/15) Notes
r630-01 192.168.11.11 58 56 / 81 / 92 Heavily loaded
r630-02 192.168.11.12 23 ~4 / 4 / 4 Light
ml110 192.168.11.10 18 ~7 / 7 / 9 Repurposing to OPNsense/pfSense — migrate workloads off to r630-01/r630-02

Ways to balance:

  1. Cross-host migration (r630-01 → r630-02) — Moves workload off r630-01. IP stays the same if the container uses a static IP; only the Proxmox host changes. (ml110 is no longer a migration target; migrate containers off ml110 first.)
  2. Same-host storage migration (r630-01 data → thin1) — Frees space on the data pool and can improve I/O; does not reduce CPU/load by much. See MIGRATION_PLAN_R630_01_DATA.md.

1. Check cluster (live migrate vs backup/restore)

If all nodes are in the same Proxmox cluster, you can try live migration (faster, less downtime):

ssh root@192.168.11.11 "pvecm status"
ssh root@192.168.11.12 "pvecm status"
  • If both show the same cluster name and list each other: use pct migrate <VMID> <target_node> --restart from any cluster node (run on r630-01 or from a host that SSHs to r630-01).
  • If nodes are not in a cluster (or migrate fails due to storage): use backup → copy → restore with the script below.

2. Cross-host migration (r630-01 → r630-02)

Script (backup/restore; works without shared storage):

cd /path/to/proxmox

# One container (replace VMID and target storage)
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> [target_storage] [--destroy-source]

# Examples
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --dry-run
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --destroy-source

Target storage on r630-02: Check with ssh root@192.168.11.12 "pvesm status". Common: thin1, thin2, thin5, thin6.

If cluster works (live migrate):

ssh root@192.168.11.11 "pct migrate <VMID> r630-02 --storage thin1 --restart"
# Then remove source CT if desired: pct destroy <VMID> --purge 1

3. Good candidates to move (r630-01 → r630-02)

Containers that reduce load and are safe to move (no critical chain/consensus; IP can stay static). Prefer moving several smaller ones rather than one critical RPC.

VMID Name / role Notes
3500 oracle-publisher-1 Oracle publisher
3501 ccip-monitor-1 CCIP monitor
7804 gov-portals-dev Gov portals (already migrated in past; verify current host)
8640 vault-phoenix-1 Vault (if not critical path)
8642 vault-phoenix-3 Vault
10232 CT10232 Small service
10235 npmplus-alltra-hybx NPMplus instance (has its own NPM; update UDM port forward if needed)
10236 npmplus-fourth NPMplus instance
1003010092 order-* (identity, intake, finance, etc.) Order stack; move as a group if desired
1020010210 order-prometheus, grafana, opensearch, haproxy Monitoring/HA; move with order-* or after

Do not move (keep on r630-01 for now):

  • 10233 — npmplus (main NPMplus; 76.53.10.36 → .167)
  • 2101 — besu-rpc-core-1 (core RPC for deploy/admin)
  • 25002505 — RPC alltra/hybx (critical RPCs)
  • 10001002, 15001502 — validators and sentries (consensus)
  • 10130, 10150, 10151 — dbis-frontend, dbis-api (core apps; move only with a plan)
  • 100, 101, 102, 103, 104, 105 — mail, datacenter, cloudflared, omada, gitea (infra)

4. Migrating workloads off ml110 (before OPNsense/pfSense repurpose)

ml110 (192.168.11.10) is being repurposed to OPNsense/pfSense (WAN aggregator between 610 cable modems and UDM Pros). All containers/VMs on ml110 must be migrated to r630-01 or r630-02 before the repurpose.

  • If cluster: ssh root@192.168.11.10 "pct migrate <VMID> r630-01 --storage <storage> --restart" or ... r630-02 ...
  • If no cluster: Use backup on ml110, copy to r630-01 or r630-02, restore there (see MIGRATE_CT_R630_01_TO_R630_02.md and adapt for source=ml110, target=r630-01 or r630-02).

After all workloads are off ml110, remove ml110 from the cluster (or reinstall the node with OPNsense/pfSense). See ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.


5. After migration

  • IP: Containers keep the same IP if they use static IP in the CT config; no change needed for NPM/DNS if they point by IP.
  • Docs: Update any runbooks or configs that assume “VMID X is on r630-01” (e.g. config/ip-addresses.conf comments, backup scripts).
  • Verify: Re-run bash scripts/check-all-proxmox-hosts.sh and confirm load and container counts.

6. Quick reference

Goal Command / doc
Check current load bash scripts/check-all-proxmox-hosts.sh
Migrate one CT (r630-01 → r630-02) ./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> thin1 [--destroy-source]
Same-host (data → thin1) MIGRATION_PLAN_R630_01_DATA.md, migrate-ct-r630-01-data-to-thin1.sh
Full migration doc MIGRATE_CT_R630_01_TO_R630_02.md