Co-authored-by: Cursor <cursoragent@cursor.com>
7.7 KiB
Storage Growth and Health — Predictable Growth Table & Proactive Monitoring
Last updated: 2026-02-15
Purpose: Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.
1. Real-time data collection
Script: scripts/monitoring/collect-storage-growth-data.sh
Run from project root (LAN, SSH key-based access to Proxmox hosts):
# Full snapshot to stdout + file under logs/storage-growth/
./scripts/monitoring/collect-storage-growth-data.sh
# Append one-line summary per storage to history CSV (for trending)
./scripts/monitoring/collect-storage-growth-data.sh --append
# CSV rows to stdout
./scripts/monitoring/collect-storage-growth-data.sh --csv
Collected data (granularity):
| Layer | What is collected |
|---|---|
| Host | pvesm status (each storage: type, used%, total, used, avail), lvs (thin pool data_percent, metadata_percent), vgs (VG free), df -h / |
| VM/CT | For every running container: df -h /, df -h /data, df -h /var/log; du -sh /data/besu, du -sh /var/log |
Output: Snapshot file logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt. Use --append to grow logs/storage-growth/history.csv for trend analysis.
Cron (proactive)
Use the scheduler script from project root (installs cron every 6 hours; uses $PROJECT_ROOT):
./scripts/maintenance/schedule-storage-growth-cron.sh --install # every 6h: collect + append
./scripts/maintenance/schedule-storage-growth-cron.sh --show # print cron line
./scripts/maintenance/schedule-storage-growth-cron.sh --remove # uninstall
Retention: Run scripts/monitoring/prune-storage-snapshots.sh weekly (e.g. keep last 30 days of snapshot files). Option: --days 14 or --dry-run to preview. See STORAGE_GROWTH_AUTOMATION_TASKS.md for full automation list.
2. Predictable growth table (template)
Fill and refresh from real data. Est. monthly growth and Growth factor should be updated from history.csv or from observed rates.
| Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded |
|---|---|---|---|---|---|---|---|
| r630-01 | data (LVM thin) | e.g. 74% | pool size | Thin provisioned | VMs + compaction | 80% warn, 95% crit | fstrim CTs, migrate VMs, expand pool |
| r630-01 | local-lvm | % | — | — | — | 80 / 95 | Same |
| r630-02 | thin1 / data | % | — | — | — | 80 / 95 | Same |
| ml110 | thin1 | % | — | — | — | 80 / 95 | Same |
| 2101 | / (root) | % | 200G | Besu DB + logs | High (RocksDB) | 85 warn, 95 crit | e2fsck, make writable, free /data |
| 2101 | /data/besu | du | same as / | RocksDB + compaction | ~1–5% block growth | — | Resync or expand disk |
| 2500–2505 | /, /data/besu | % | — | Besu | Same | 85 / 95 | Same as 2101 |
| 2400 | /, /data/besu | % | 196G | Besu + Nginx logs | Same | 85 / 95 | Logrotate, Vert.x tuning |
| 10130, 10150, 10151 | / | % | — | Logs, app data | Low–medium | 85 / 95 | Logrotate, clean caches |
| 5000 (Blockscout) | /, DB volume | % | — | Postgres + indexer | Medium | 85 / 95 | VACUUM, archive old data |
| 10233, 10234 (NPMplus) | / | % | — | Logs, certs | Low | 85 / 95 | Logrotate |
Growth factor short reference:
- Besu (/data/besu): Block chain growth + RocksDB compaction spikes. Largest and least predictable.
- Logs (/var/log): Depends on log level and rotation. Typically low if rotation is enabled.
- Postgres/DB: Grows with chain indexer and app data.
- Thin pool: Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.
3. Factors affecting health (detailed)
Use this list to match real-time data to causes and actions.
| Factor | Where it matters | Typical size / rate | Mitigation |
|---|---|---|---|
| LVM thin pool data% | Host (r630-01 data, r630-02 thin*, ml110 thin1) | 100% = no new writes | fstrim in CTs, migrate VMs, remove unused LVs, expand pool |
| LVM thin metadata% | Same | High metadata% can cause issues | Expand metadata LV or reduce snapshots |
| RocksDB (Besu) | /data/besu in 2101, 2500–2505, 2400, 2201, etc. | Grows with chain; compaction needs temp space | Ensure / and /data have headroom; avoid 100% thin pool |
| Journal / systemd logs | /var/log in every CT | Can grow if not rotated | logrotate, journalctl --vacuum-time=7d |
| Nginx / app logs | /var/log, /var/www | Depends on traffic | logrotate, log level |
| Postgres / DB | Blockscout, DBIS, etc. | Grows with indexer and app data | VACUUM, archive, resize volume |
| Backups (proxmox) | Host storage (e.g. backup target) | Per VMID, full or incremental | Retention policy, offload to NAS |
| Root filesystem read-only | Any CT when I/O or ENOSPC | — | e2fsck on host, make writable (see 502_DEEP_DIVE) |
| Temp/cache | /tmp, /var/cache, Besu java.io.tmpdir | Spikes during compaction | Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches |
4. Thresholds and proactive playbook
| Level | Host (thin / pvesm) | VM (/, /data) | Action |
|---|---|---|---|
| OK | < 80% | < 85% | Continue regular collection and trending |
| Warn | 80–95% | 85–95% | Run collect-storage-growth-data.sh, identify top consumers; plan migration or cleanup |
| Critical | > 95% | > 95% | Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage |
Proactive checks (recommended):
- Daily or every 6h: Run
collect-storage-growth-data.sh --appendand inspect latest snapshot underlogs/storage-growth/. - Weekly: Review
logs/storage-growth/history.csvfor rising trends; update the Predictable growth table with current numbers and est. monthly growth. - When adding VMs or chain usage: Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.
5. Matching real-time data to the table
- Host storage %: From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
- VM /, /data, /var/log: From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
- Growth over time: Use
history.csv(with--appendruns). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.
6. Related
- Host-level alerts:
scripts/storage-monitor.sh(WARN 80%, CRIT 90%). Schedule:scripts/maintenance/schedule-storage-monitor-cron.sh --install(daily 07:00). - In-CT disk check:
scripts/maintenance/check-disk-all-vmids.sh(root /). Run daily viadaily-weekly-checks.sh(cron 08:00). - Retention:
scripts/monitoring/prune-storage-snapshots.sh(snapshots),scripts/monitoring/prune-storage-history.sh(history.csv). Both run weekly when usingschedule-storage-growth-cron.sh --install. - Weekly remediation:
daily-weekly-checks.sh weeklyruns fstrim in all running CTs and journal vacuum in key CTs; see STORAGE_GROWTH_AUTOMATION_TASKS.md. - Logrotate audit: LOGROTATE_AUDIT_RUNBOOK.md (high-log VMIDs).
- Making RPC VMIDs writable after full/read-only:
scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh; see 502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md. - Thin pool full / migration: MIGRATE_CT_R630_01_TO_R630_02.md, R630-02_STORAGE_REVIEW.md.