Files

defiQUG bea1903ac9

Deploy to Phoenix / deploy (push) Has been cancelled

Details

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-21 15:46:06 -08:00

Storage Growth & Health — Automation Tasks, Fixes, and Migrations

Last updated: 2026-02-15
Purpose: List all tasks to automate proactive storage monitoring, plus required fixes and migrations.

1. Tasks to automate

#	Task	Description	How
A1	Storage snapshot + history append	Run `collect-storage-growth-data.sh --append` on a schedule so `history.csv` grows for trend analysis.	Cron every 6 hours (or daily). Use `scripts/maintenance/schedule-storage-growth-cron.sh --install`.
A2	Snapshot retention	Prune old snapshot files under `logs/storage-growth/` so the directory does not grow unbounded.	Done. Script: `scripts/monitoring/prune-storage-snapshots.sh` (default keep 30 days; `--days N`, `--dry-run`). Schedule weekly or run manually.
A3	History CSV retention	Cap `history.csv` size (keep last 10k rows or ~90 days).	Done. Script: `scripts/monitoring/prune-storage-history.sh` (default 90 days proxy; `--max-rows N`, `--days N`, `--dry-run`). Run weekly via schedule-storage-growth-cron (prune line).

#	Task	Description	How
A4	Thin pool / pvesm check (all hosts)	Fail or warn when any host’s thin pool or pvesm storage is ≥ 95% (critical) or ≥ 80% (warn).	Done. In `daily-weekly-checks.sh` weekly (F3/M2).
A5	In-CT disk check in cron	Run `check-disk-all-vmids.sh` on a schedule and log or alert on WARN/CRIT.	Done. Called from `daily-weekly-checks.sh` daily (cron 08:00).
A6	Integrate with existing storage-monitor.sh	`storage-monitor.sh` already has WARN 80%, CRIT 90% and optional ALERT_EMAIL / ALERT_WEBHOOK.	Done. `scripts/maintenance/schedule-storage-monitor-cron.sh --install` (daily 07:00).
A7	Metric file for alerting	Write a metric file (e.g. `logs/storage-growth/last_run.metric`) with max thin pool % and timestamp so an external monitor can alert.	Done. Weekly run writes `STORAGE_METRIC_FILE` (storage_max_pct, storage_metric_timestamp).

#	Task	Description	How
A8	Weekly fstrim in CTs	Run `fstrim` inside running CTs on hosts with thin pools to reclaim space.	Done. `scripts/maintenance/fstrim-all-running-ct.sh`; run from `daily-weekly-checks.sh` weekly.
A9	Logrotate audit	Ensure high-log VMIDs (10130, 10150, 10151, 5000, 10233, 10234, 2400) have logrotate or equivalent.	Done. Runbook: `docs/04-configuration/LOGROTATE_AUDIT_RUNBOOK.md`.
A10	Journal vacuum	Run `journalctl --vacuum-time=7d` in key CTs on a schedule.	Done. `scripts/maintenance/journal-vacuum-key-ct.sh`; run from `daily-weekly-checks.sh` weekly.

#	Fix	Location	Detail
F1	Implement or remove --json	`scripts/monitoring/collect-storage-growth-data.sh`	Done. `--json` outputs a JSON object with `timestamp` and `csv_rows` (array of CSV line strings).
F2	CSV quoting for detail column	`scripts/monitoring/collect-storage-growth-data.sh`	Done. Detail field is quoted when it contains commas or quotes via `csv_quote()`.
F3	Thin pool check on all three hosts	`scripts/maintenance/daily-weekly-checks.sh`	Done. [138a] now runs thin pool/storage check on r630-02, r630-01, and ml110 (WARN ≥85%, FAIL ≥95%/100%).
F4	PROJECT_ROOT in cron	`schedule-daily-weekly-cron.sh` / new storage cron	Cron lines use `$PROJECT_ROOT`; crontab is installed by the user who runs the script, so path is correct. For schedule-storage-growth-cron.sh use same pattern (cd $PROJECT_ROOT && ...).

#	Migration	Description
M1	Add schedule-storage-growth-cron.sh	Done. Script: `scripts/maintenance/schedule-storage-growth-cron.sh` (same style as schedule-daily-weekly-cron.sh): `--show`, `--install`, `--remove`. Cron runs `collect-storage-growth-data.sh --append` every 6 hours.
M2	Extend weekly checks to all-host thin pool	Done. Implemented with F3 in `daily-weekly-checks.sh`: `check_thin_pool_one_host` for r630-02, r630-01, ml110.
M3	Doc and index updates	Done. STORAGE_GROWTH_AND_HEALTH.md references schedule-storage-growth-cron.sh and prune script; MASTER_INDEX and OPERATIONAL_RUNBOOKS list storage growth cron.
M4	Optional: CI job	Add a GitHub Actions (or Gitea) workflow that runs `collect-storage-growth-data.sh --csv` (or a dry run that only checks script syntax / host reachability) so config changes don’t break the script. Optional because the script requires LAN/SSH to hosts.

F2 (CSV quoting) and F1 (--json) in collect-storage-growth-data.sh.
M1 Add schedule-storage-growth-cron.sh and M3 update docs.
F3 and M2 Extend daily-weekly-checks.sh to check thin pool on all three hosts.
A1 Install storage growth cron (via M1).
A2 Add prune-storage-snapshots.sh and schedule weekly (or in same cron wrapper).
A4/A7 Optionally have weekly check write a metric file; wire A5 (check-disk-all-vmids) into daily if desired.
A8–A10 As needed (fstrim, logrotate audit, journal vacuum).

Script	Purpose
`scripts/monitoring/collect-storage-growth-data.sh`	Collect host + VM storage; output snapshot + optional growth table; `--append` for history.csv.
`scripts/maintenance/schedule-storage-growth-cron.sh`	Install/show/remove cron for storage collection (every 6h).
`scripts/monitoring/prune-storage-snapshots.sh`	Prune snapshot_*.txt older than N days (default 30); `--days N`, `--dry-run`.
`scripts/monitoring/prune-storage-history.sh`	Prune history.csv to last N rows (default ~90d); `--days N`, `--max-rows N`, `--dry-run`.
`scripts/maintenance/daily-weekly-checks.sh`	Daily: explorer, RPC, indexer lag, in-CT disk (A5). Weekly: config API, thin pool, fstrim (A8), journal vacuum (A10), storage metric (A7).
`scripts/maintenance/check-disk-all-vmids.sh`	In-CT df / for all running CTs; WARN 85%, CRIT 95%.
`scripts/maintenance/schedule-storage-monitor-cron.sh`	Install/show/remove cron for storage-monitor.sh (daily 07:00).
`scripts/maintenance/fstrim-all-running-ct.sh`	fstrim -v / in all running CTs; `--dry-run`.
`scripts/maintenance/journal-vacuum-key-ct.sh`	journalctl --vacuum-time=7d in key CTs; `--dry-run`.
`scripts/storage-monitor.sh`	Host pvesm + VG; alerts at 80%/90%; optional email/webhook.
`docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md`	Growth table template, factors, thresholds, how to use data.