- Marked submodules ai-mcp-pmm-controller, explorer-monorepo, and smom-dbis-138 as dirty to reflect recent changes. - Updated documentation to clarify operator script usage, including dotenv loading and task execution instructions. - Enhanced the README and various index files to provide clearer navigation and task completion guidance. Made-with: Cursor
6.2 KiB
Proxmox load balancing runbook
Purpose: Reduce load on the busiest node (r630-01) by migrating selected LXC containers to r630-02. Also frees space on r630-01 when moving to another host. Note: ml110 is being repurposed to OPNsense/pfSense (WAN aggregator); migrate workloads off ml110 to r630-01/r630-02 before repurpose — see ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.
Before you start: If you are considering adding a third or fourth R630 to the cluster first, see PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md — including whether you already have r630-03/r630-04 (powered off) to bring online.
Current imbalance (typical):
| Node | IP | LXC count | Load (1/5/15) | Notes |
|---|---|---|---|---|
| r630-01 | 192.168.11.11 | 58 | 56 / 81 / 92 | Heavily loaded |
| r630-02 | 192.168.11.12 | 23 | ~4 / 4 / 4 | Light |
| ml110 | 192.168.11.10 | 18 | ~7 / 7 / 9 | Repurposing to OPNsense/pfSense — migrate workloads off to r630-01/r630-02 |
Ways to balance:
- Cross-host migration (r630-01 → r630-02) — Moves workload off r630-01. IP stays the same if the container uses a static IP; only the Proxmox host changes. (ml110 is no longer a migration target; migrate containers off ml110 first.)
- Same-host storage migration (r630-01 data → thin1) — Frees space on the
datapool and can improve I/O; does not reduce CPU/load by much. See MIGRATION_PLAN_R630_01_DATA.md.
1. Check cluster (live migrate vs backup/restore)
If all nodes are in the same Proxmox cluster, you can try live migration (faster, less downtime):
ssh root@192.168.11.11 "pvecm status"
ssh root@192.168.11.12 "pvecm status"
- If both show the same cluster name and list each other: use
pct migrate <VMID> <target_node> --restartfrom any cluster node (run on r630-01 or from a host that SSHs to r630-01). - If nodes are not in a cluster (or migrate fails due to storage): use backup → copy → restore with the script below.
2. Cross-host migration (r630-01 → r630-02)
Script (backup/restore; works without shared storage):
cd /path/to/proxmox
# One container (replace VMID and target storage)
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> [target_storage] [--destroy-source]
# Examples
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --dry-run
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --destroy-source
Target storage on r630-02: Check with ssh root@192.168.11.12 "pvesm status". Common: thin1, thin2, thin5, thin6.
If cluster works (live migrate):
ssh root@192.168.11.11 "pct migrate <VMID> r630-02 --storage thin1 --restart"
# Then remove source CT if desired: pct destroy <VMID> --purge 1
3. Good candidates to move (r630-01 → r630-02)
Containers that reduce load and are safe to move (no critical chain/consensus; IP can stay static). Prefer moving several smaller ones rather than one critical RPC.
| VMID | Name / role | Notes |
|---|---|---|
| 3500 | oracle-publisher-1 | Oracle publisher |
| 3501 | ccip-monitor-1 | CCIP monitor |
| 7804 | gov-portals-dev | Gov portals (already migrated in past; verify current host) |
| 8640 | vault-phoenix-1 | Vault (if not critical path) |
| 8642 | vault-phoenix-3 | Vault |
| 10232 | CT10232 | Small service |
| 10235 | npmplus-alltra-hybx | NPMplus instance (has its own NPM; update UDM port forward if needed) |
| 10236 | npmplus-fourth | NPMplus instance |
| 10030–10092 | order-* (identity, intake, finance, etc.) | Order stack; move as a group if desired |
| 10200–10210 | order-prometheus, grafana, opensearch, haproxy | Monitoring/HA; move with order-* or after |
Do not move (keep on r630-01 for now):
- 10233 — npmplus (main NPMplus; 76.53.10.36 → .167)
- 2101 — besu-rpc-core-1 (core RPC for deploy/admin)
- 2500–2505 — RPC alltra/hybx (critical RPCs)
- 1000–1002, 1500–1502 — validators and sentries (consensus)
- 10130, 10150, 10151 — dbis-frontend, dbis-api (core apps; move only with a plan)
- 100, 101, 102, 103, 104, 105 — mail, datacenter, cloudflared, omada, gitea (infra)
4. Migrating workloads off ml110 (before OPNsense/pfSense repurpose)
ml110 (192.168.11.10) is being repurposed to OPNsense/pfSense (WAN aggregator between 6–10 cable modems and UDM Pros). All containers/VMs on ml110 must be migrated to r630-01 or r630-02 before the repurpose.
- If cluster:
ssh root@192.168.11.10 "pct migrate <VMID> r630-01 --storage <storage> --restart"or... r630-02 ... - If no cluster: Use backup on ml110, copy to r630-01 or r630-02, restore there (see MIGRATE_CT_R630_01_TO_R630_02.md and adapt for source=ml110, target=r630-01 or r630-02).
After all workloads are off ml110, remove ml110 from the cluster (or reinstall the node with OPNsense/pfSense). See ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.
5. After migration
- IP: Containers keep the same IP if they use static IP in the CT config; no change needed for NPM/DNS if they point by IP.
- Docs: Update any runbooks or configs that assume “VMID X is on r630-01” (e.g.
config/ip-addresses.confcomments, backup scripts). - Verify: Re-run
bash scripts/check-all-proxmox-hosts.shand confirm load and container counts.
6. Quick reference
| Goal | Command / doc |
|---|---|
| Check current load | bash scripts/check-all-proxmox-hosts.sh |
| Migrate one CT (r630-01 → r630-02) | ./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> thin1 [--destroy-source] |
| Same-host (data → thin1) | MIGRATION_PLAN_R630_01_DATA.md, migrate-ct-r630-01-data-to-thin1.sh |
| Full migration doc | MIGRATE_CT_R630_01_TO_R630_02.md |