diff --git a/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md b/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md index cf3e34d..9a847f0 100644 --- a/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md +++ b/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md @@ -5,8 +5,9 @@ ### Recent operator maintenance (2026-03-28) -- **Fleet checks (same day, follow-up):** Ran `collect-storage-growth-data.sh --append`, `storage-monitor.sh check`, `proxmox-host-io-optimize-pass.sh` (swappiness/sysstat; host `fstrim` N/A on LVM root). **Load:** ml110 load dominated by **Besu (Java)** and **cloudflared**; r630-01 load improved after earlier spike (still many CTs). **ZFS:** r630-01 / r630-02 `rpool` ONLINE; last scrub **2026-03-08**, 0 errors. **`/proc/mdstat` (r630-01):** RAID devices present and active (no resync observed during check). -- **CT 7811 (r630-02, thin4):** Root was **100%** full (**~44 GiB** in `/var/log/syslog` + rotated `syslog.1`). **Remediation:** truncated `syslog` / `syslog.1` and restarted `rsyslog`; root **~6%** after fix. Ensure **logrotate** for `syslog` is effective inside the guest (or lower rsyslog verbosity). +- **Fleet checks (same day, follow-up):** Ran `collect-storage-growth-data.sh --append`, `storage-monitor.sh check`, `proxmox-host-io-optimize-pass.sh` (swappiness/sysstat; host `fstrim` N/A on LVM root). **Load:** ml110 load dominated by **Besu (Java)** and **cloudflared**; r630-01 load improved after earlier spike (still many CTs). **r630-01 `data` thin:** after guest `fstrim` fleet, **pvesm** used% dropped slightly (e.g. **~71.6% → ~70.2%** on 2026-03-28 — reclaim varies by CT). **ZFS:** r630-01 / r630-02 `rpool` ONLINE; last scrub **2026-03-08**, 0 errors. **`/proc/mdstat` (r630-01):** RAID devices present and active (no resync observed during check). +- **ml110 `cloudflared`:** If the process still runs as `tunnel run --token …` on the **command line**, the token is visible in **`ps`** and process listings. Prefer **`credentials-file` + YAML** (`scripts/cloudflare-tunnels/systemd/cloudflared-ml110.service`) or **`EnvironmentFile`** with tight permissions; **rotate the tunnel token** in Cloudflare if it was ever logged. +- **CT 7811 (r630-02, thin4):** Root was **100%** full (**~44 GiB** in `/var/log/syslog` + rotated `syslog.1`). **Remediation:** truncated `syslog` / `syslog.1` and restarted `rsyslog`; root **~6%** after fix. **`/etc/logrotate.d/rsyslog`** updated to **`daily` + `maxsize 200M`** (was weekly-only) and **`su root syslog`** so rotation runs before another multi-GB spike. If logs still grow fast, reduce app/rsyslog **facility** noise at the source. - **CT 10100 (r630-01, thin1):** Root **WARN** (**~88–90%** on 8 GiB); growth mostly **`/var/lib/postgresql` (~5 GiB)**. **Remediation:** `pct resize 10100 rootfs +4G` + `resize2fs`; root **~57%** after. **Note:** Proxmox warned thin **overcommit** vs VG — monitor `pvesm` / `lvs` and avoid excessive concurrent disk expansions without pool growth. - **`storage-monitor.sh`:** Fixed **`set -e` abort** on unreachable optional nodes and **pipe-subshell** so `ALERTS+=` runs in the main shell (alerts and summaries work). - **r630-01 `pve/data` (local-lvm):** Thin pool extended (+80 GiB data, +512 MiB metadata earlier); **LVM thin auto-extend** enabled in `lvm.conf` (`thin_pool_autoextend_threshold = 80`, `thin_pool_autoextend_percent = 20`); **dmeventd** must stay active. @@ -65,7 +66,7 @@ Fill and refresh from real data. **Est. monthly growth** and **Growth factor** s | Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded | |-----------|----------------|--------------|----------|---------------|---------------------|-----------|----------------------| -| **r630-01** | data (LVM thin) | **~72%** (pvesm 2026-03-28) | ~360G pool | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool | +| **r630-01** | data (LVM thin) | **~70%** (pvesm after fstrim pass, 2026-03-28) | ~360G pool | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool | | **r630-01** | thin1 | **~48%** | ~256G pool | CT root disks on thin1 | Same | 80 / 95 | Same; watch overcommit vs `vgs` | | **r630-02** | thin1–thin6 (`thin1-r630-02` …) | **~1–27%** per pool (2026-03-28) | ~226G each | Mixed CTs | Same | 80 / 95 | **VG free ~0.12 GiB per thin VG** — expand disk/PV before growing LVs | | **ml110** | data / local-lvm | **~15%** | ~1.7T thin | Besu CTs | High | 80 / 95 | Same |