Files
proxmox/docs/02-architecture/PROXMOX_HA_CLUSTER_ROADMAP.md
defiQUG b3a8fe4496
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: sync all changes to Gitea
- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor
2026-03-02 11:37:34 -08:00

6.6 KiB
Raw Blame History

Proxmox full HA cluster — current state and roadmap

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Last updated: 2026-01-31
Status: Cluster present; full HA not implemented


Short answer

Yes — for production, this Proxmox setup should ideally be a full HA cluster. Right now it is a cluster (shared config, quorum, live view) but not Proxmox HA. When you power down one R630 (e.g. for DIMM reseat), everything on that node stops and stays stopped until the node is back up; nothing is automatically restarted on another node.


Current state vs full HA

Aspect Current Full HA
Cluster Yes (3 nodes: ml110, r630-01, r630-02) Same
Quorum Yes (3 nodes) Same
Storage Local only (each node has its own disks) Shared (Ceph or NFS) so any node can run any VM/container
VM/container placement Pinned to one node; disk lives on that node Disk on shared storage; can run on any node
Node failure / maintenance All workloads on that node go down until the node returns HA manager restarts those workloads on another node
Manual migration Required to move a VM/container to another host Optional; HA handles failover

So today: cluster = shared management and quorum, but no automatic failover and no shared storage.

Ref: PROXMOX_CLUSTER_ARCHITECTURE.md — “HA Mode: Active/Standby (manual)”, “No shared storage”, “Manual VM migration required”, “No automatic failover”.


What full Proxmox HA would give you

  • When a node is powered down (e.g. DIMM reseat) or crashes, the Proxmox HA manager would:
    • Detect that the node is gone (or in maintenance).
    • Start the HA-managed VMs/containers on another node that has access to the same (shared) storage.
  • Planned maintenance (e.g. reseat DIMM B2) would mean: put node in maintenance → HA migrates/restarts resources on other nodes → you power down the server → no “all VMs on this host are gone until I power it back on”.

So yes — it should be full HA if you want automatic failover and no single-node dependency during maintenance or failures.


Whats required for full HA

  1. Shared storage
    So every node can see the same VM/container disks:

    • Ceph (recommended by Proxmox): replicated, distributed; needs multiple nodes and network.
    • NFS: simpler (e.g. NAS or dedicated NFS server); single point of failure unless the NFS side is also HA.
    • Other: ZFS over iSCSI, etc., depending on your hardware.
  2. Proxmox HA stack

    • HA Manager enabled in the cluster (Datacenter → Cluster → HA).
    • Quorum: you already have 3 nodes, so quorum is satisfied (or use qdevice if you ever go to 2 nodes).
  3. HA resources

    • For each VM/container you want to fail over: add it as an HA resource (start/stop order, group, etc.).
    • Those guests disks must be on shared storage, not local-only.
  4. Network

    • Same VLANs / connectivity so that when a VM/container starts on another node, it keeps the same IPs and reachability (e.g. same bridge/VLAN config on all nodes, as you already have).

Practical path (high level)

  1. Design shared storage

    • Decide: Ceph (multi-node) vs NFS (simpler).
    • Size it for existing + growth of VM/container disks.
  2. Introduce shared storage to the cluster

    • Add the storage in Proxmox (e.g. Ceph pool or NFS mount) so all three nodes see it.
  3. Migrate critical guests to shared storage

    • New VMs/containers on shared storage; optionally migrate existing ones (e.g. NPMplus 10233, RPC, Blockscout, etc.) from local to shared.
  4. Enable HA and add HA resources

    • Enable HA in the cluster.
    • Add the critical VMs/containers as HA resources (with groups/order if needed).
  5. Test

    • Put one node in maintenance or power it off; confirm HA restarts the resources on another node and services stay up.

How many R630s, and how much RAM per node?

Number of Dell PowerEdge R630s

Setup Minimum R630s Notes
Proxmox HA + Ceph (hyper-converged) 3 Proxmox and Ceph both need at least 3 nodes: quorum (majority) and Ceph replication (3 replicas). With 2 nodes, one failure = no quorum.
Recommended for Ceph 4 With 4 nodes, Ceph can recover to fully healthy after one node failure; with 3 it stays degraded until the node returns.
Proxmox HA with NFS (no Ceph) 2 + qdevice Possible with 2 R630s + NFS + qdevice; 3 nodes is simpler and more robust.

Answer: At least 3 R630s for full HA with Ceph. 4 R630s is better for Ceph recovery. (Your setup: ml110 + 2 R630s; adding a third R630 gives 3 Proxmox nodes for HA + Ceph.)

RAM per R630

Role Minimum per node Recommended
Proxmox + HA only (NFS, no Ceph) 32 GB 64128 GB
Proxmox + Ceph (hyper-converged) 64 GB 128256 GB
Ceph OSD ≥ 8 GiB per OSD (Proxmox/Ceph recommendation)
  • Minimum: 64 GB per R630 for Ceph + a few VMs (Ceph recovery uses extra RAM).
  • Recommended: 128256 GB per R630 for production (VMs + Ceph headroom).
  • Migration: The 503 GB R630 (r630-01) is the source to migrate workload from; target is 128256 GB per server. See MIGRATE_503GB_R630_TO_128_256GB_SERVERS.md.

Summary (R630s): 3 or 4 R630s, at least 64 GB RAM per node, 128256 GB recommended for production HA + Ceph.


Summary

  • Should this Proxmox be a full HA cluster? Yes, for production and to avoid “losing” those VMs (in the sense of them being down) whenever a single node is powered off.
  • Current: Cluster only; no shared storage; no Proxmox HA; manual migration and manual restart after maintenance.
  • Target: Full HA = shared storage + HA manager + HA resources so that when you power down an R630 (e.g. for DIMM B2 reseat), critical VMs/containers are restarted on another node automatically.

See also: PROXMOX_CLUSTER_ARCHITECTURE.md (current cluster and “Future Enhancements”), NPMPLUS_HA_SETUP_GUIDE.md (NPMplus-level HA with Keepalived). For 13× R630 + DoD/MIL-spec (full HA, Ceph, fencing, RAM/drives, STIG hardening), see R630_13_NODE_DOD_HA_MASTER_PLAN.md.