- Config, docs, scripts, and backup manifests - Submodule refs unchanged (m = modified content in submodules) Made-with: Cursor
6.6 KiB
Proxmox full HA cluster — current state and roadmap
Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation
Last updated: 2026-01-31
Status: Cluster present; full HA not implemented
Short answer
Yes — for production, this Proxmox setup should ideally be a full HA cluster. Right now it is a cluster (shared config, quorum, live view) but not Proxmox HA. When you power down one R630 (e.g. for DIMM reseat), everything on that node stops and stays stopped until the node is back up; nothing is automatically restarted on another node.
Current state vs full HA
| Aspect | Current | Full HA |
|---|---|---|
| Cluster | Yes (3 nodes: ml110, r630-01, r630-02) | Same |
| Quorum | Yes (3 nodes) | Same |
| Storage | Local only (each node has its own disks) | Shared (Ceph or NFS) so any node can run any VM/container |
| VM/container placement | Pinned to one node; disk lives on that node | Disk on shared storage; can run on any node |
| Node failure / maintenance | All workloads on that node go down until the node returns | HA manager restarts those workloads on another node |
| Manual migration | Required to move a VM/container to another host | Optional; HA handles failover |
So today: cluster = shared management and quorum, but no automatic failover and no shared storage.
Ref: PROXMOX_CLUSTER_ARCHITECTURE.md — “HA Mode: Active/Standby (manual)”, “No shared storage”, “Manual VM migration required”, “No automatic failover”.
What full Proxmox HA would give you
- When a node is powered down (e.g. DIMM reseat) or crashes, the Proxmox HA manager would:
- Detect that the node is gone (or in maintenance).
- Start the HA-managed VMs/containers on another node that has access to the same (shared) storage.
- Planned maintenance (e.g. reseat DIMM B2) would mean: put node in maintenance → HA migrates/restarts resources on other nodes → you power down the server → no “all VMs on this host are gone until I power it back on”.
So yes — it should be full HA if you want automatic failover and no single-node dependency during maintenance or failures.
What’s required for full HA
-
Shared storage
So every node can see the same VM/container disks:- Ceph (recommended by Proxmox): replicated, distributed; needs multiple nodes and network.
- NFS: simpler (e.g. NAS or dedicated NFS server); single point of failure unless the NFS side is also HA.
- Other: ZFS over iSCSI, etc., depending on your hardware.
-
Proxmox HA stack
- HA Manager enabled in the cluster (Datacenter → Cluster → HA).
- Quorum: you already have 3 nodes, so quorum is satisfied (or use qdevice if you ever go to 2 nodes).
-
HA resources
- For each VM/container you want to fail over: add it as an HA resource (start/stop order, group, etc.).
- Those guests’ disks must be on shared storage, not local-only.
-
Network
- Same VLANs / connectivity so that when a VM/container starts on another node, it keeps the same IPs and reachability (e.g. same bridge/VLAN config on all nodes, as you already have).
Practical path (high level)
-
Design shared storage
- Decide: Ceph (multi-node) vs NFS (simpler).
- Size it for existing + growth of VM/container disks.
-
Introduce shared storage to the cluster
- Add the storage in Proxmox (e.g. Ceph pool or NFS mount) so all three nodes see it.
-
Migrate critical guests to shared storage
- New VMs/containers on shared storage; optionally migrate existing ones (e.g. NPMplus 10233, RPC, Blockscout, etc.) from local to shared.
-
Enable HA and add HA resources
- Enable HA in the cluster.
- Add the critical VMs/containers as HA resources (with groups/order if needed).
-
Test
- Put one node in maintenance or power it off; confirm HA restarts the resources on another node and services stay up.
How many R630s, and how much RAM per node?
Number of Dell PowerEdge R630s
| Setup | Minimum R630s | Notes |
|---|---|---|
| Proxmox HA + Ceph (hyper-converged) | 3 | Proxmox and Ceph both need at least 3 nodes: quorum (majority) and Ceph replication (3 replicas). With 2 nodes, one failure = no quorum. |
| Recommended for Ceph | 4 | With 4 nodes, Ceph can recover to fully healthy after one node failure; with 3 it stays degraded until the node returns. |
| Proxmox HA with NFS (no Ceph) | 2 + qdevice | Possible with 2 R630s + NFS + qdevice; 3 nodes is simpler and more robust. |
Answer: At least 3 R630s for full HA with Ceph. 4 R630s is better for Ceph recovery. (Your setup: ml110 + 2 R630s; adding a third R630 gives 3 Proxmox nodes for HA + Ceph.)
RAM per R630
| Role | Minimum per node | Recommended |
|---|---|---|
| Proxmox + HA only (NFS, no Ceph) | 32 GB | 64–128 GB |
| Proxmox + Ceph (hyper-converged) | 64 GB | 128–256 GB |
| Ceph OSD | — | ≥ 8 GiB per OSD (Proxmox/Ceph recommendation) |
- Minimum: 64 GB per R630 for Ceph + a few VMs (Ceph recovery uses extra RAM).
- Recommended: 128–256 GB per R630 for production (VMs + Ceph headroom).
- Migration: The 503 GB R630 (r630-01) is the source to migrate workload from; target is 128–256 GB per server. See MIGRATE_503GB_R630_TO_128_256GB_SERVERS.md.
Summary (R630s): 3 or 4 R630s, at least 64 GB RAM per node, 128–256 GB recommended for production HA + Ceph.
Summary
- Should this Proxmox be a full HA cluster? Yes, for production and to avoid “losing” those VMs (in the sense of them being down) whenever a single node is powered off.
- Current: Cluster only; no shared storage; no Proxmox HA; manual migration and manual restart after maintenance.
- Target: Full HA = shared storage + HA manager + HA resources so that when you power down an R630 (e.g. for DIMM B2 reseat), critical VMs/containers are restarted on another node automatically.
See also: PROXMOX_CLUSTER_ARCHITECTURE.md (current cluster and “Future Enhancements”), NPMPLUS_HA_SETUP_GUIDE.md (NPMplus-level HA with Keepalived). For 13× R630 + DoD/MIL-spec (full HA, Ceph, fencing, RAM/drives, STIG hardening), see R630_13_NODE_DOD_HA_MASTER_PLAN.md.