Files
proxmox/docs/02-architecture/R630_13_NODE_DOD_HA_MASTER_PLAN.md

284 lines
16 KiB
Markdown
Raw Permalink Normal View History

# 13× R630 Proxmox Cluster — DoD/MIL-Spec HA Master Plan
**Last Updated:** 2026-03-02
**Document Version:** 1.0
**Status:** Active — Master plan for 13-node HA, RAM/storage, and DoD/MIL compliance
---
## 1. Executive Summary
This document defines the target architecture for a **13-node Dell PowerEdge R630** Proxmox cluster with:
- **Full HA and failover** (shared storage, HA manager, fencing, automatic recovery).
- **DoD/MIL-spec alignment** (STIG-style hardening, audit, encryption, change control, documentation).
- **RAM and drive specifications** for each R630 to support Ceph, VMs/containers, and growth.
**Scope:** All 13 R630s as Proxmox cluster nodes; optional separate management node (e.g. ml110) or integration of management on a subset of R630s. Design assumes **hyper-converged** (Proxmox + Ceph on same nodes) for shared storage and true HA.
**Extended inventory:** The same site includes 3× Dell R750 servers, 2× Dell Precision 7920 workstations, and 2× UniFi Dream Machine Pro (gateways). See [HARDWARE_INVENTORY_MASTER.md](../11-references/HARDWARE_INVENTORY_MASTER.md), [13_NODE_NETWORK_AND_CABLING_CHECKLIST.md](../11-references/13_NODE_NETWORK_AND_CABLING_CHECKLIST.md), and [13_NODE_AND_ASSETS_BRING_ONLINE_CHECKLIST.md](../11-references/13_NODE_AND_ASSETS_BRING_ONLINE_CHECKLIST.md) for network topology, cabling, and bring-online order.
---
## 2. Cluster Design — 13 Nodes
### 2.1 Node roles and quorum
| Item | Requirement |
|------|-------------|
| **Total nodes** | 13 × R630 |
| **Quorum** | Majority = 7. With 13 nodes, up to 6 can be down and cluster still has quorum. |
| **Fencing** | Required for HA: failed node must be fenced (power off/reboot) so Ceph and HA manager can safely restart resources elsewhere. |
| **Qdevice** | Optional: add a quorum device (e.g. small VM or appliance) so quorum survives more node failures; not required with 13 nodes but improves resilience. |
### 2.2 Recommended node layout
| Role | Node count | Purpose |
|------|------------|---------|
| **Proxmox + Ceph MON/MGR/OSD** | 13 | Every R630 runs Proxmox and participates in Ceph (MON, MGR, OSD) for shared storage. |
| **Ceph OSD** | 13 | Each node contributes disk as Ceph OSD; replication (e.g. size=3, min_size=2) across nodes. |
| **Proxmox HA** | 13 | HA manager can restart VMs/containers on any node; VM disks on Ceph. |
| **Optional dedicated** | 0 | No dedicated “monitor-only” nodes required; MON/MGR run on all or a subset (e.g. 35 MONs). |
### 2.3 Network and addressing
- **Management:** One subnet (e.g. 192.168.11.0/24) for Proxmox API, SSH, Ceph public/cluster.
- **Ceph:** Separate VLAN or subnet for Ceph cluster network (recommended for DoD: isolate storage traffic).
- **VLANs:** Same VLAN-aware bridge (e.g. vmbr0) on all nodes so VMs/containers keep IPs when failed over.
- **IP plan for 13 R630s:** Reserve 13 consecutive IPs (e.g. 192.168.11.11192.168.11.23 for r630-01 … r630-13). Document in `config/ip-addresses.conf` and DNS.
### 2.4 Switching (10G backbone)
**Inventory:** 2 × UniFi XG 10G 16-port switches (see [HARDWARE_INVENTORY_MASTER.md](../11-references/HARDWARE_INVENTORY_MASTER.md)).
- Use for **Ceph cluster network** and inter-node traffic; connect all 13 R630s via 10G for storage and replication.
- **Redundancy:** Two switches allow dual-attach per node (e.g. one link per switch or LACP) for HA.
- **Management:** Can stay on existing 1G LAN or use 10G for management if NICs support it.
---
## 3. RAM Specifications — R630
### 3.1 R630 memory capabilities (reference)
| Spec | Value |
|------|--------|
| **DIMM slots** | 24 (12 per socket in 2-socket) |
| **Max RAM** | Up to 1.5 TB (with compatible LRDIMMs) |
| **Typical configs** | 32 GB, 64 GB, 128 GB, 256 GB, 384 GB, 512 GB (depending on DIMM size and count) |
| **ECC** | Required for DoD/MIL; R630 supports ECC RDIMM/LRDIMM |
### 3.2 Recommended RAM per node (DoD HA + Ceph)
| Tier | RAM per node | Use case |
|------|----------------|---------|
| **Minimum** | 128 GB | Ceph OSD + a few VMs; acceptable for lab or light production. |
| **Recommended** | 256 GB | Production: Ceph (OSD + MON/MGR) + many VMs/containers; headroom for failover and recovery. |
| **High** | 384512 GB | Heavy workloads, large Ceph OSD count per node, or when consolidating from existing 503 GB nodes. |
**Ceph guidance:** Proxmox/Ceph recommend **≥ 8 GiB per OSD** for OSD memory. With 68 OSDs per node (see storage), **4864 GiB** for Ceph plus Proxmox and guest overhead → **128 GB minimum**, **256 GB recommended**.
**DoD/MIL note:** Prefer **256 GB per node** for 13-node production so that (1) multiple node failures still leave enough capacity for HA migrations and (2) Ceph recovery and rebalancing do not cause OOM or instability.
### 3.3 RAM placement (if mixing sizes)
If not all nodes have the same RAM:
- Put **largest RAM** in nodes that run the most VMs or Ceph MON/MGR.
- Ensure **at least 128 GB** on every node that runs Ceph OSDs.
- Document exact DIMM layout per node (slot, size, speed) for change control and troubleshooting.
---
## 4. Drive Specifications — R630
### 4.1 R630 drive options (reference)
- **Internal bays:** Typically 8 × 2.5" SATA/SAS (or 10-bay with optional kit); some configs support NVMe (e.g. 4 × NVMe via PCIe).
- **Boot:** 2 drives in mirror (ZFS mirror or hardware RAID1) for Proxmox OS — **redundant, DoD-compliant**.
- **Data:** Remaining drives for Ceph OSD and/or local LVM (if hybrid).
### 4.2 Recommended drive layout per R630 (full Ceph)
| Purpose | Drives | Type | Size (example) | Configuration |
|---------|--------|------|----------------|---------------|
| **Boot (OS)** | 2 | SSD | 240480 GB each | ZFS mirror (preferred) or HW RAID1; Proxmox root only. |
| **Ceph OSD** | 46 | SSD (or NVMe) | 480 GB 1 TB each | One OSD per drive; no RAID (Ceph provides replication). |
**Example per node:** 2 × 480 GB boot (ZFS mirror) + 6 × 960 GB SSD = 6 Ceph OSDs per node.
**Cluster total:** 13 × 6 = 78 OSDs; with replication 3×, usable capacity ≈ (78 × 0.9 TB) / 3 ≈ **~23 TB** (before bluestore overhead; adjust for actual sizes).
### 4.3 DoD/MIL storage requirements
- **Encryption:** At-rest encryption for sensitive data. Options: Ceph encryption (e.g. dm-crypt for OSD), or encrypted VMs (LUKS inside guest). Document which layers are encrypted and key management.
- **Integrity:** ZFS for boot (checksum, scrub). Ceph provides replication and recovery; use **bluestore** with checksums.
- **Sanitization:** Follow DoD 5220.22-M or NIST SP 800-88 for decommissioning/destruction of drives.
- **Spare:** Maintain spare drives and document replacement and wipe procedures.
### 4.4 Sizing for your workload
- **Current (from docs):** ~50+ VMIDs, mix of Besu, Blockscout, DBIS, NPMplus, etc.; growth ~2050 GB/month.
- **Target:** Size Ceph pool so that **used + 2 years growth** stays < 75% of usable. Example: 1520 TB usable → ~57 TB used now + growth headroom.
---
## 5. Full HA and Failover Architecture
### 5.1 Components
| Component | Role |
|-----------|------|
| **Proxmox cluster** | 13 nodes; same cluster name; corosync for quorum. |
| **Ceph** | Shared storage: MON (35 nodes), MGR (2+), OSD on all 13. Replication size=3, min_size=2. |
| **Proxmox HA** | HA manager enabled; VMs/containers on Ceph added as HA resources; start/stop order and groups as needed. |
| **Fencing (STONITH)** | Mandatory: when a node is declared lost, fence device powers it off (or reboots) so Ceph and HA can safely reassign resources. Use Proxmoxs built-in fence agents (e.g. **fence_pve** with Proxmox API or IPMI/IDRAC). |
| **Network** | Redundant links where possible; same VLAN/bridge config on all nodes so failover does not change VM IPs. |
### 5.2 Ceph design (summary)
- **Pools:** At least one pool for VM/container disks (e.g. `ceph-vm`); optionally separate pool for backups or bulk data.
- **Replication:** size=3, min_size=2; tolerate 2 node failures without data loss (with 13 nodes).
- **Network:** Separate cluster network (e.g. 10.x or dedicated VLAN) for Ceph backend traffic; public for client (Proxmox) access.
- **MON/MGR:** 3 or 5 MONs (odd); 2 MGRs minimum. Spread across nodes for availability.
### 5.3 HA resource and failover behavior
- **HA resources:** Add each critical VM/CT as HA resource; define groups (e.g. “database first, then app”) and restart order.
- **Failure:** Node down → fencing → Ceph marks OSDs out → HA manager restarts VMs on other nodes using Ceph disks.
- **Maintenance:** Put node in maintenance → migrate VMs off (or let HA relocate) → fence not triggered; perform RAM/drive work.
### 5.4 What “full HA” gives you (DoD-relevant)
- **No single point of failure:** Storage replicated; compute can run on any node.
- **Automatic failover:** No manual migration for HA-managed guests.
- **Controlled maintenance:** Node can be taken down without losing services; documented procedures for patching and hardware changes.
---
## 6. DoD/MIL-Spec Compliance Framework
### 6.1 Alignment with DISA STIG / DoD requirements
DoD/MIL typically implies (summary; you must map to your exact ATO/contract):
| Area | Requirement | Implementation |
|------|-------------|----------------|
| **Hardening** | DISA STIG or equivalent for OS and applications | Apply STIG/CIS to Debian (Proxmox host) and guests; document exceptions. |
| **Authentication** | Strong auth, no default passwords, MFA where required | SSH key-only on Proxmox; no password SSH; RBAC in Proxmox; MFA for critical UIs if required. |
| **Access control** | Least privilege, RBAC, audit | Proxmox roles and permissions; separate admin vs operator; audit logs. |
| **Encryption** | TLS in transit; encryption at rest for sensitive data | TLS 1.2+ for API and Ceph; at-rest encryption (Ceph or LUKS) as required. |
| **Audit and logging** | Centralized, tamper-resistant, retention | rsyslog/syslog-ng to central log host; retention per policy; integrity (e.g. signed/hash). |
| **Change control** | Documented changes, rollback capability | Change tickets; config in Git; backups before changes; runbooks. |
| **Backup and recovery** | Regular backups, tested restore | Proxmox backups to separate storage; Ceph snapshots; DR runbook and tests. |
| **Physical and environmental** | Physical security, power, cooling | Out of scope for this doc; document in facility plan. |
### 6.2 Hardening checklist (Proxmox + Debian)
Use this as an operational checklist; align with your STIG version.
**Proxmox hosts (Debian base):**
- [ ] **SSH:** Key-only auth; PasswordAuthentication no; PermitRootLogin prohibit-password or key-only; strong ciphers/KexAlgorithms.
- [ ] **Firewall:** Restrict Proxmox API (8006) and SSH to management VLAN/CIDR; default deny.
- [ ] **Services:** Disable unnecessary services; only Proxmox, Ceph, corosync, and required dependencies.
- [ ] **Session timeout:** User session timeout (e.g. 900 s) in shell profile and/or Proxmox UI.
- [ ] **TLS:** TLS 1.2+ only; strong ciphers for pveproxy and Ceph.
- [ ] **Updates:** Security updates applied on a defined schedule; test in non-prod first.
- [ ] **FIPS:** If required by contract, use FIPS-validated crypto (kernel/openssl); document and test.
- [ ] **File permissions:** Sensitive files (keys, tokens) mode 600/400; no world-writable.
- [ ] **Audit:** auditd or equivalent for critical files and commands; logs to central host.
**Ceph:**
- [ ] **Auth:** Cephx enabled; key management per DoD key management policy.
- [ ] **Network:** Cluster network isolated; no Ceph ports exposed to user VLANs.
- [ ] **Encryption:** At-rest encryption for OSD if required; key escrow and rotation documented.
**Guests (VMs/containers):**
- [ ] **Per-guest hardening:** STIG/CIS per OS (e.g. Ubuntu, RHEL); documented baseline.
- [ ] **Secrets:** No secrets in configs in Git; use Vault or Proxmox secrets where applicable.
**Existing automation (this repo):** Use `scripts/security/run-security-on-proxmox-hosts.sh` (SSH key-only + firewall 8006), `scripts/security/setup-ssh-key-auth.sh`, and `scripts/security/firewall-proxmox-8006.sh`; extend to all 13 hosts and run with `--apply` after validating with `--dry-run`. Extend host list in scripts or via env (e.g. all R630 IPs).
### 6.3 Audit and documentation
- **Configuration baseline:** All Proxmox and Ceph configs in version control; changes via PR/ticket.
- **Runbooks:** Install, upgrade, add node, remove node, replace drive, fence test, backup/restore, disaster recovery.
- **Evidence:** Run STIG/CIS scans (e.g. OpenSCAP, Nessus) and retain reports for assessors.
- **Change log:** Document every change (who, when, why, ticket); link to runbook.
---
## 7. Phased Implementation
### Phase 1 — Prepare (no downtime)
1. **IP and DNS:** Assign and document 13 IPs for R630s; update `config/ip-addresses.conf` and DNS.
2. **RAM:** Upgrade all 13 R630s to at least 128 GB (256 GB recommended); document DIMM layout.
3. **Drives:** Install boot mirror (2 × SSD) and data drives (46 SSD per node) on each R630; configure ZFS mirror for boot.
4. **Proxmox install:** Install Proxmox VE on all 13; same version; join to one cluster; configure VLAN-aware bridge and management IPs.
5. **Hardening:** Apply SSH key-only, firewall, and STIG/CIS checklist to all nodes; document exceptions.
### Phase 2 — Ceph
1. **Ceph install:** Install Ceph on all 13 nodes (Proxmox Ceph integration); create MON (3 or 5), MGR (2), OSD (all nodes).
2. **Pools:** Create replication pool (size=3, min_size=2) for VM disks; add as Proxmox storage.
3. **Network:** Configure Ceph public and cluster networks; validate connectivity and latency.
4. **Tests:** Fill and drain; kill OSD/node and verify recovery; document procedures.
### Phase 3 — HA and fencing
1. **Fencing:** Configure fence_pve (or IPMI/IDRAC) for each node; test fence from another node.
2. **HA manager:** Enable HA in cluster; add critical VMs/containers as HA resources; set groups and order.
3. **Failover tests:** Power off one node; verify fencing and HA restart on another node; repeat for 2-node failure if desired.
4. **Runbooks:** Document failover test results and operational procedures.
### Phase 4 — Migrate workload
1. **Migrate disks:** Move VM/container disks from local storage to Ceph (live migration or backup/restore).
2. **Decommission local-only:** Once all HA resources are on Ceph, remove or repurpose local LVM for non-HA or cache.
3. **Monitoring and alerting:** Integrate with central monitoring; alerts for quorum loss, Ceph health, fence events, HA failures.
### Phase 5 — DoD/MIL continuous compliance
1. **Scans:** Schedule STIG/CIS scans; remediate and document exceptions.
2. **Backup and DR:** Automate backups; test restore quarterly; update DR runbook.
3. **Change control:** All changes via ticket + runbook; config in Git; periodic review of permissions and audit logs.
---
## 8. References and Related Docs
| Document | Purpose |
|----------|---------|
| [PROXMOX_HA_CLUSTER_ROADMAP.md](./PROXMOX_HA_CLUSTER_ROADMAP.md) | Current HA roadmap (3-node); extend to 13-node. |
| [PROXMOX_CLUSTER_ARCHITECTURE.md](./PROXMOX_CLUSTER_ARCHITECTURE.md) | Cluster and storage overview. |
| [PHYSICAL_DRIVES_AND_CONFIG.md](../04-configuration/PHYSICAL_DRIVES_AND_CONFIG.md) | Current drive layout (existing 2 R630s + ml110). |
| Proxmox Ceph documentation | [Ceph in Proxmox](https://pve.proxmox.com/pve-docs/chapter-pveceph.html). |
| Proxmox HA | [High Availability](https://pve.proxmox.com/pve-docs/chapter-ha-manager.html). |
| DISA STIG | [DISA STIGs](https://public.cyber.mil/stigs/); Debian/Ubuntu and application STIGs. |
| CIS Benchmarks | [CIS Benchmarks](https://www.cisecurity.org/cis-benchmarks); Debian, Proxmox if available. |
---
## 9. Summary Table
| Item | Specification |
|------|----------------|
| **Nodes** | 13 × Dell PowerEdge R630 |
| **Quorum** | Majority 7; up to 6 nodes can fail |
| **RAM per node** | Minimum 128 GB; **recommended 256 GB** (DoD production) |
| **Boot** | 2 × SSD (e.g. 240480 GB) ZFS mirror per node |
| **Data (Ceph)** | 46 × SSD (e.g. 480 GB 1 TB) per node, one OSD per drive |
| **Shared storage** | Ceph replicated (size=3, min_size=2) |
| **HA** | Proxmox HA manager; fencing (STONITH) required |
| **Hardening** | STIG/CIS alignment; SSH key-only; firewall; TLS; audit; change control |
| **Encryption** | TLS in transit; at-rest per policy (Ceph or LUKS) |
---
**Owner:** Architecture / Infrastructure
**Review:** Quarterly or when adding nodes / changing compliance scope
**Change control:** Update version and “Last Updated” when changing this plan; link change ticket.