Files
proxmox/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

7.7 KiB
Raw Blame History

Storage Growth and Health — Predictable Growth Table & Proactive Monitoring

Last updated: 2026-02-15
Purpose: Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.


1. Real-time data collection

Script: scripts/monitoring/collect-storage-growth-data.sh

Run from project root (LAN, SSH key-based access to Proxmox hosts):

# Full snapshot to stdout + file under logs/storage-growth/
./scripts/monitoring/collect-storage-growth-data.sh

# Append one-line summary per storage to history CSV (for trending)
./scripts/monitoring/collect-storage-growth-data.sh --append

# CSV rows to stdout
./scripts/monitoring/collect-storage-growth-data.sh --csv

Collected data (granularity):

Layer What is collected
Host pvesm status (each storage: type, used%, total, used, avail), lvs (thin pool data_percent, metadata_percent), vgs (VG free), df -h /
VM/CT For every running container: df -h /, df -h /data, df -h /var/log; du -sh /data/besu, du -sh /var/log

Output: Snapshot file logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt. Use --append to grow logs/storage-growth/history.csv for trend analysis.

Cron (proactive)

Use the scheduler script from project root (installs cron every 6 hours; uses $PROJECT_ROOT):

./scripts/maintenance/schedule-storage-growth-cron.sh --install   # every 6h: collect + append
./scripts/maintenance/schedule-storage-growth-cron.sh --show      # print cron line
./scripts/maintenance/schedule-storage-growth-cron.sh --remove    # uninstall

Retention: Run scripts/monitoring/prune-storage-snapshots.sh weekly (e.g. keep last 30 days of snapshot files). Option: --days 14 or --dry-run to preview. See STORAGE_GROWTH_AUTOMATION_TASKS.md for full automation list.


2. Predictable growth table (template)

Fill and refresh from real data. Est. monthly growth and Growth factor should be updated from history.csv or from observed rates.

Host / VM Storage / path Current used Capacity Growth factor Est. monthly growth Threshold Action when exceeded
r630-01 data (LVM thin) e.g. 74% pool size Thin provisioned VMs + compaction 80% warn, 95% crit fstrim CTs, migrate VMs, expand pool
r630-01 local-lvm % 80 / 95 Same
r630-02 thin1 / data % 80 / 95 Same
ml110 thin1 % 80 / 95 Same
2101 / (root) % 200G Besu DB + logs High (RocksDB) 85 warn, 95 crit e2fsck, make writable, free /data
2101 /data/besu du same as / RocksDB + compaction ~15% block growth Resync or expand disk
25002505 /, /data/besu % Besu Same 85 / 95 Same as 2101
2400 /, /data/besu % 196G Besu + Nginx logs Same 85 / 95 Logrotate, Vert.x tuning
10130, 10150, 10151 / % Logs, app data Lowmedium 85 / 95 Logrotate, clean caches
5000 (Blockscout) /, DB volume % Postgres + indexer Medium 85 / 95 VACUUM, archive old data
10233, 10234 (NPMplus) / % Logs, certs Low 85 / 95 Logrotate

Growth factor short reference:

  • Besu (/data/besu): Block chain growth + RocksDB compaction spikes. Largest and least predictable.
  • Logs (/var/log): Depends on log level and rotation. Typically low if rotation is enabled.
  • Postgres/DB: Grows with chain indexer and app data.
  • Thin pool: Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.

3. Factors affecting health (detailed)

Use this list to match real-time data to causes and actions.

Factor Where it matters Typical size / rate Mitigation
LVM thin pool data% Host (r630-01 data, r630-02 thin*, ml110 thin1) 100% = no new writes fstrim in CTs, migrate VMs, remove unused LVs, expand pool
LVM thin metadata% Same High metadata% can cause issues Expand metadata LV or reduce snapshots
RocksDB (Besu) /data/besu in 2101, 25002505, 2400, 2201, etc. Grows with chain; compaction needs temp space Ensure / and /data have headroom; avoid 100% thin pool
Journal / systemd logs /var/log in every CT Can grow if not rotated logrotate, journalctl --vacuum-time=7d
Nginx / app logs /var/log, /var/www Depends on traffic logrotate, log level
Postgres / DB Blockscout, DBIS, etc. Grows with indexer and app data VACUUM, archive, resize volume
Backups (proxmox) Host storage (e.g. backup target) Per VMID, full or incremental Retention policy, offload to NAS
Root filesystem read-only Any CT when I/O or ENOSPC e2fsck on host, make writable (see 502_DEEP_DIVE)
Temp/cache /tmp, /var/cache, Besu java.io.tmpdir Spikes during compaction Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches

4. Thresholds and proactive playbook

Level Host (thin / pvesm) VM (/, /data) Action
OK < 80% < 85% Continue regular collection and trending
Warn 8095% 8595% Run collect-storage-growth-data.sh, identify top consumers; plan migration or cleanup
Critical > 95% > 95% Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage

Proactive checks (recommended):

  1. Daily or every 6h: Run collect-storage-growth-data.sh --append and inspect latest snapshot under logs/storage-growth/.
  2. Weekly: Review logs/storage-growth/history.csv for rising trends; update the Predictable growth table with current numbers and est. monthly growth.
  3. When adding VMs or chain usage: Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.

5. Matching real-time data to the table

  • Host storage %: From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
  • VM /, /data, /var/log: From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
  • Growth over time: Use history.csv (with --append runs). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.

  • Host-level alerts: scripts/storage-monitor.sh (WARN 80%, CRIT 90%). Schedule: scripts/maintenance/schedule-storage-monitor-cron.sh --install (daily 07:00).
  • In-CT disk check: scripts/maintenance/check-disk-all-vmids.sh (root /). Run daily via daily-weekly-checks.sh (cron 08:00).
  • Retention: scripts/monitoring/prune-storage-snapshots.sh (snapshots), scripts/monitoring/prune-storage-history.sh (history.csv). Both run weekly when using schedule-storage-growth-cron.sh --install.
  • Weekly remediation: daily-weekly-checks.sh weekly runs fstrim in all running CTs and journal vacuum in key CTs; see STORAGE_GROWTH_AUTOMATION_TASKS.md.
  • Logrotate audit: LOGROTATE_AUDIT_RUNBOOK.md (high-log VMIDs).
  • Making RPC VMIDs writable after full/read-only: scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh; see 502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md.
  • Thin pool full / migration: MIGRATE_CT_R630_01_TO_R630_02.md, R630-02_STORAGE_REVIEW.md.