Files

Deploy to Phoenix / deploy (push) Has been cancelled

Details

- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor

2026-03-02 11:37:34 -08:00

6.6 KiB

Raw Blame History

Proxmox full HA cluster — current state and roadmap

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation

Last updated: 2026-01-31
Status: Cluster present; full HA not implemented

Short answer

Yes — for production, this Proxmox setup should ideally be a full HA cluster. Right now it is a cluster (shared config, quorum, live view) but not Proxmox HA. When you power down one R630 (e.g. for DIMM reseat), everything on that node stops and stays stopped until the node is back up; nothing is automatically restarted on another node.

Current state vs full HA

Aspect	Current	Full HA
Cluster	Yes (3 nodes: ml110, r630-01, r630-02)	Same
Quorum	Yes (3 nodes)	Same
Storage	Local only (each node has its own disks)	Shared (Ceph or NFS) so any node can run any VM/container
VM/container placement	Pinned to one node; disk lives on that node	Disk on shared storage; can run on any node
Node failure / maintenance	All workloads on that node go down until the node returns	HA manager restarts those workloads on another node
Manual migration	Required to move a VM/container to another host	Optional; HA handles failover

So today: cluster = shared management and quorum, but no automatic failover and no shared storage.

Ref: PROXMOX_CLUSTER_ARCHITECTURE.md — “HA Mode: Active/Standby (manual)”, “No shared storage”, “Manual VM migration required”, “No automatic failover”.

What full Proxmox HA would give you

When a node is powered down (e.g. DIMM reseat) or crashes, the Proxmox HA manager would:
- Detect that the node is gone (or in maintenance).
- Start the HA-managed VMs/containers on another node that has access to the same (shared) storage.
Planned maintenance (e.g. reseat DIMM B2) would mean: put node in maintenance → HA migrates/restarts resources on other nodes → you power down the server → no “all VMs on this host are gone until I power it back on”.

So yes — it should be full HA if you want automatic failover and no single-node dependency during maintenance or failures.

What’s required for full HA

Shared storage
So every node can see the same VM/container disks:
- Ceph (recommended by Proxmox): replicated, distributed; needs multiple nodes and network.
- NFS: simpler (e.g. NAS or dedicated NFS server); single point of failure unless the NFS side is also HA.
- Other: ZFS over iSCSI, etc., depending on your hardware.
Proxmox HA stack
- HA Manager enabled in the cluster (Datacenter → Cluster → HA).
- Quorum: you already have 3 nodes, so quorum is satisfied (or use qdevice if you ever go to 2 nodes).
HA resources
- For each VM/container you want to fail over: add it as an HA resource (start/stop order, group, etc.).
- Those guests’ disks must be on shared storage, not local-only.
Network
- Same VLANs / connectivity so that when a VM/container starts on another node, it keeps the same IPs and reachability (e.g. same bridge/VLAN config on all nodes, as you already have).

Practical path (high level)

Design shared storage
- Decide: Ceph (multi-node) vs NFS (simpler).
- Size it for existing + growth of VM/container disks.
Introduce shared storage to the cluster
- Add the storage in Proxmox (e.g. Ceph pool or NFS mount) so all three nodes see it.
Migrate critical guests to shared storage
- New VMs/containers on shared storage; optionally migrate existing ones (e.g. NPMplus 10233, RPC, Blockscout, etc.) from local to shared.
Enable HA and add HA resources
- Enable HA in the cluster.
- Add the critical VMs/containers as HA resources (with groups/order if needed).
Test
- Put one node in maintenance or power it off; confirm HA restarts the resources on another node and services stay up.

How many R630s, and how much RAM per node?

Number of Dell PowerEdge R630s

Setup	Minimum R630s	Notes
Proxmox HA + Ceph (hyper-converged)	3	Proxmox and Ceph both need at least 3 nodes: quorum (majority) and Ceph replication (3 replicas). With 2 nodes, one failure = no quorum.
Recommended for Ceph	4	With 4 nodes, Ceph can recover to fully healthy after one node failure; with 3 it stays degraded until the node returns.
Proxmox HA with NFS (no Ceph)	2 + qdevice	Possible with 2 R630s + NFS + qdevice; 3 nodes is simpler and more robust.

Answer: At least 3 R630s for full HA with Ceph. 4 R630s is better for Ceph recovery. (Your setup: ml110 + 2 R630s; adding a third R630 gives 3 Proxmox nodes for HA + Ceph.)

RAM per R630

Role	Minimum per node	Recommended
Proxmox + HA only (NFS, no Ceph)	32 GB	64–128 GB
Proxmox + Ceph (hyper-converged)	64 GB	128–256 GB
Ceph OSD	—	≥ 8 GiB per OSD (Proxmox/Ceph recommendation)

Minimum: 64 GB per R630 for Ceph + a few VMs (Ceph recovery uses extra RAM).
Recommended: 128–256 GB per R630 for production (VMs + Ceph headroom).
Migration: The 503 GB R630 (r630-01) is the source to migrate workload from; target is 128–256 GB per server. See MIGRATE_503GB_R630_TO_128_256GB_SERVERS.md.

Summary (R630s): 3 or 4 R630s, at least 64 GB RAM per node, 128–256 GB recommended for production HA + Ceph.

Summary

Should this Proxmox be a full HA cluster? Yes, for production and to avoid “losing” those VMs (in the sense of them being down) whenever a single node is powered off.
Current: Cluster only; no shared storage; no Proxmox HA; manual migration and manual restart after maintenance.
Target: Full HA = shared storage + HA manager + HA resources so that when you power down an R630 (e.g. for DIMM B2 reseat), critical VMs/containers are restarted on another node automatically.

See also: PROXMOX_CLUSTER_ARCHITECTURE.md (current cluster and “Future Enhancements”), NPMPLUS_HA_SETUP_GUIDE.md (NPMplus-level HA with Keepalived). For 13× R630 + DoD/MIL-spec (full HA, Ceph, fencing, RAM/drives, STIG hardening), see R630_13_NODE_DOD_HA_MASTER_PLAN.md.

6.6 KiB Raw Blame History Unescape Escape