Files

defiQUG bea1903ac9

Deploy to Phoenix / deploy (push) Has been cancelled

Details

Sync all local changes: docs, config, scripts, submodule refs, verification evidence

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-21 15:46:06 -08:00

7.7 KiB

Raw Blame History

Storage Growth and Health — Predictable Growth Table & Proactive Monitoring

Last updated: 2026-02-15
Purpose: Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.

1. Real-time data collection

Script: `scripts/monitoring/collect-storage-growth-data.sh`

Run from project root (LAN, SSH key-based access to Proxmox hosts):

# Full snapshot to stdout + file under logs/storage-growth/
./scripts/monitoring/collect-storage-growth-data.sh

# Append one-line summary per storage to history CSV (for trending)
./scripts/monitoring/collect-storage-growth-data.sh --append

# CSV rows to stdout
./scripts/monitoring/collect-storage-growth-data.sh --csv

Collected data (granularity):

Layer	What is collected
Host	`pvesm status` (each storage: type, used%, total, used, avail), `lvs` (thin pool data_percent, metadata_percent), `vgs` (VG free), `df -h /`
VM/CT	For every running container: `df -h /`, `df -h /data`, `df -h /var/log`; `du -sh /data/besu`, `du -sh /var/log`

Output: Snapshot file logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt. Use --append to grow logs/storage-growth/history.csv for trend analysis.

Cron (proactive)

Use the scheduler script from project root (installs cron every 6 hours; uses $PROJECT_ROOT):

./scripts/maintenance/schedule-storage-growth-cron.sh --install   # every 6h: collect + append
./scripts/maintenance/schedule-storage-growth-cron.sh --show      # print cron line
./scripts/maintenance/schedule-storage-growth-cron.sh --remove    # uninstall

Retention: Run scripts/monitoring/prune-storage-snapshots.sh weekly (e.g. keep last 30 days of snapshot files). Option: --days 14 or --dry-run to preview. See STORAGE_GROWTH_AUTOMATION_TASKS.md for full automation list.

2. Predictable growth table (template)

Fill and refresh from real data. Est. monthly growth and Growth factor should be updated from history.csv or from observed rates.

Host / VM	Storage / path	Current used	Capacity	Growth factor	Est. monthly growth	Threshold	Action when exceeded
r630-01	data (LVM thin)	e.g. 74%	pool size	Thin provisioned	VMs + compaction	80% warn, 95% crit	fstrim CTs, migrate VMs, expand pool
r630-01	local-lvm	%	—	—	—	80 / 95	Same
r630-02	thin1 / data	%	—	—	—	80 / 95	Same
ml110	thin1	%	—	—	—	80 / 95	Same
2101	/ (root)	%	200G	Besu DB + logs	High (RocksDB)	85 warn, 95 crit	e2fsck, make writable, free /data
2101	/data/besu	du	same as /	RocksDB + compaction	~1–5% block growth	—	Resync or expand disk
2500–2505	/, /data/besu	%	—	Besu	Same	85 / 95	Same as 2101
2400	/, /data/besu	%	196G	Besu + Nginx logs	Same	85 / 95	Logrotate, Vert.x tuning
10130, 10150, 10151	/	%	—	Logs, app data	Low–medium	85 / 95	Logrotate, clean caches
5000 (Blockscout)	/, DB volume	%	—	Postgres + indexer	Medium	85 / 95	VACUUM, archive old data
10233, 10234 (NPMplus)	/	%	—	Logs, certs	Low	85 / 95	Logrotate

Growth factor short reference:

Besu (/data/besu): Block chain growth + RocksDB compaction spikes. Largest and least predictable.
Logs (/var/log): Depends on log level and rotation. Typically low if rotation is enabled.
Postgres/DB: Grows with chain indexer and app data.
Thin pool: Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.

3. Factors affecting health (detailed)

Use this list to match real-time data to causes and actions.

Factor	Where it matters	Typical size / rate	Mitigation
LVM thin pool data%	Host (r630-01 data, r630-02 thin*, ml110 thin1)	100% = no new writes	fstrim in CTs, migrate VMs, remove unused LVs, expand pool
LVM thin metadata%	Same	High metadata% can cause issues	Expand metadata LV or reduce snapshots
RocksDB (Besu)	/data/besu in 2101, 2500–2505, 2400, 2201, etc.	Grows with chain; compaction needs temp space	Ensure / and /data have headroom; avoid 100% thin pool
Journal / systemd logs	/var/log in every CT	Can grow if not rotated	logrotate, journalctl --vacuum-time=7d
Nginx / app logs	/var/log, /var/www	Depends on traffic	logrotate, log level
Postgres / DB	Blockscout, DBIS, etc.	Grows with indexer and app data	VACUUM, archive, resize volume
Backups (proxmox)	Host storage (e.g. backup target)	Per VMID, full or incremental	Retention policy, offload to NAS
Root filesystem read-only	Any CT when I/O or ENOSPC	—	e2fsck on host, make writable (see 502_DEEP_DIVE)
Temp/cache	/tmp, /var/cache, Besu java.io.tmpdir	Spikes during compaction	Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches

4. Thresholds and proactive playbook

Level	Host (thin / pvesm)	VM (/, /data)	Action
OK	< 80%	< 85%	Continue regular collection and trending
Warn	80–95%	85–95%	Run `collect-storage-growth-data.sh`, identify top consumers; plan migration or cleanup
Critical	> 95%	> 95%	Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage

Proactive checks (recommended):

Daily or every 6h: Run collect-storage-growth-data.sh --append and inspect latest snapshot under logs/storage-growth/.
Weekly: Review logs/storage-growth/history.csv for rising trends; update the Predictable growth table with current numbers and est. monthly growth.
When adding VMs or chain usage: Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.

5. Matching real-time data to the table

Host storage %: From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
VM /, /data, /var/log: From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
Growth over time: Use history.csv (with --append runs). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.

Host-level alerts: scripts/storage-monitor.sh (WARN 80%, CRIT 90%). Schedule: scripts/maintenance/schedule-storage-monitor-cron.sh --install (daily 07:00).
In-CT disk check: scripts/maintenance/check-disk-all-vmids.sh (root /). Run daily via daily-weekly-checks.sh (cron 08:00).
Retention: scripts/monitoring/prune-storage-snapshots.sh (snapshots), scripts/monitoring/prune-storage-history.sh (history.csv). Both run weekly when using schedule-storage-growth-cron.sh --install.
Weekly remediation: daily-weekly-checks.sh weekly runs fstrim in all running CTs and journal vacuum in key CTs; see STORAGE_GROWTH_AUTOMATION_TASKS.md.
Logrotate audit: LOGROTATE_AUDIT_RUNBOOK.md (high-log VMIDs).
Making RPC VMIDs writable after full/read-only: scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh; see 502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md.
Thin pool full / migration: MIGRATE_CT_R630_01_TO_R630_02.md, R630-02_STORAGE_REVIEW.md.

7.7 KiB Raw Blame History Unescape Escape