docs/04-configuration/PROXMOX_LOAD_BALANCING_RUNBOOK.md

# Proxmox load balancing runbook

**Purpose:** Reduce load on the busiest node (r630-01) by migrating selected LXC containers to r630-02. Also frees space on r630-01 when moving to another host. **Note:** ml110 is being repurposed to OPNsense/pfSense (WAN aggregator); migrate workloads *off* ml110 to r630-01/r630-02 before repurpose — see [ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md](../11-references/ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md).

**Before you start:** If you are considering adding a **third or fourth R630** to the cluster first, see [PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md](PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md) — including whether you already have r630-03/r630-04 (powered off) to bring online.

**Current imbalance (typical):**

| Node     | IP            | LXC count | Load (1/5/15)   | Notes        |
|----------|---------------|-----------|------------------|--------------|
| r630-01  | 192.168.11.11 | 58        | 56 / 81 / 92     | Heavily loaded |
| r630-02  | 192.168.11.12 | 23        | ~4 / 4 / 4        | Light        |
| ml110    | 192.168.11.10 | 18        | ~7 / 7 / 9        | **Repurposing to OPNsense/pfSense** — migrate workloads off to r630-01/r630-02 |

**Ways to balance:**

1. **Cross-host migration (r630-01 → r630-02)** — Moves workload off r630-01. IP stays the same if the container uses a static IP; only the Proxmox host changes. (ml110 is no longer a migration target; migrate containers *off* ml110 first.)
2. **Same-host storage migration (r630-01 data → thin1)** — Frees space on the `data` pool and can improve I/O; does not reduce CPU/load by much. See [MIGRATION_PLAN_R630_01_DATA.md](MIGRATION_PLAN_R630_01_DATA.md).

---

## 1. Check cluster (live migrate vs backup/restore)

If all nodes are in the **same Proxmox cluster**, you can try **live migration** (faster, less downtime):

```bash
ssh root@192.168.11.11 "pvecm status"
ssh root@192.168.11.12 "pvecm status"
```

- If both show the **same cluster name** and list each other: use `pct migrate <VMID> <target_node> --restart` from any cluster node (run on r630-01 or from a host that SSHs to r630-01).
- If nodes are **not** in a cluster (or migrate fails due to storage): use **backup → copy → restore** with the script below.

---

## 2. Cross-host migration (r630-01 → r630-02)

**Script (backup/restore; works without shared storage):**

```bash
cd /path/to/proxmox

# One container (replace VMID and target storage)
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> [target_storage] [--destroy-source]

# Examples
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --dry-run
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --destroy-source
```

**Target storage on r630-02:** Check with `ssh root@192.168.11.12 "pvesm status"`. Common: `thin1`, `thin2`, `thin5`, `thin6`.

**If cluster works (live migrate):**

```bash
ssh root@192.168.11.11 "pct migrate <VMID> r630-02 --storage thin1 --restart"
# Then remove source CT if desired: pct destroy <VMID> --purge 1
```

---

## 3. Good candidates to move (r630-01 → r630-02)

Containers that **reduce load** and are **safe to move** (no critical chain/consensus; IP can stay static). Prefer moving several smaller ones rather than one critical RPC.

| VMID   | Name / role           | Notes |
|--------|------------------------|-------|
| 3500   | oracle-publisher-1    | Oracle publisher |
| 3501   | ccip-monitor-1        | CCIP monitor |
| 7804   | gov-portals-dev       | Gov portals (already migrated in past; verify current host) |
| 8640   | vault-phoenix-1       | Vault (if not critical path) |
| 8642   | vault-phoenix-3       | Vault |
| 10232  | CT10232               | Small service |
| 10235  | npmplus-alltra-hybx   | NPMplus instance (has its own NPM; update UDM port forward if needed) |
| 10236  | npmplus-fourth        | NPMplus instance |
| 10030–10092 | order-* (identity, intake, finance, etc.) | Order stack; move as a group if desired |
| 10200–10210 | order-prometheus, grafana, opensearch, haproxy | Monitoring/HA; move with order-* or after |

**Do not move (keep on r630-01 for now):**

- **10233** — npmplus (main NPMplus; 76.53.10.36 → .167)
- **2101** — besu-rpc-core-1 (core RPC for deploy/admin)
- **2500–2505** — RPC alltra/hybx (critical RPCs)
- **1000–1002, 1500–1502** — validators and sentries (consensus)
- **10130, 10150, 10151** — dbis-frontend, dbis-api (core apps; move only with a plan)
- **100, 101, 102, 103, 104, 105** — mail, datacenter, cloudflared, omada, gitea (infra)

---

## 4. Migrating workloads *off* ml110 (before OPNsense/pfSense repurpose)

ml110 (192.168.11.10) is being **repurposed to OPNsense/pfSense** (WAN aggregator between 6–10 cable modems and UDM Pros). All containers/VMs on ml110 must be **migrated to r630-01 or r630-02** before the repurpose.

- **If cluster:** `ssh root@192.168.11.10 "pct migrate <VMID> r630-01 --storage <storage> --restart"` or `... r630-02 ...`
- **If no cluster:** Use backup on ml110, copy to r630-01 or r630-02, restore there (see [MIGRATE_CT_R630_01_TO_R630_02.md](../03-deployment/MIGRATE_CT_R630_01_TO_R630_02.md) and adapt for source=ml110, target=r630-01 or r630-02).

After all workloads are off ml110, remove ml110 from the cluster (or reinstall the node with OPNsense/pfSense). See [ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md](../11-references/ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md).

---

## 5. After migration

- **IP:** Containers keep the same IP if they use static IP in the CT config; no change needed for NPM/DNS if they point by IP.
- **Docs:** Update any runbooks or configs that assume “VMID X is on r630-01” (e.g. `config/ip-addresses.conf` comments, backup scripts).
- **Verify:** Re-run `bash scripts/check-all-proxmox-hosts.sh` and confirm load and container counts.

---

## 6. Quick reference

| Goal | Command / doc |
|------|----------------|
| Check current load | `bash scripts/check-all-proxmox-hosts.sh` |
| Migrate one CT (r630-01 → r630-02) | `./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> thin1 [--destroy-source]` |
| Same-host (data → thin1) | [MIGRATION_PLAN_R630_01_DATA.md](MIGRATION_PLAN_R630_01_DATA.md), `migrate-ct-r630-01-data-to-thin1.sh` |
| Full migration doc | [MIGRATE_CT_R630_01_TO_R630_02.md](../03-deployment/MIGRATE_CT_R630_01_TO_R630_02.md) |