Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Co-authored-by: Cursor <cursoragent@cursor.com>
6.8 KiB
6.8 KiB
Storage Growth & Health — Automation Tasks, Fixes, and Migrations
Last updated: 2026-02-15
Purpose: List all tasks to automate proactive storage monitoring, plus required fixes and migrations.
1. Tasks to automate
1.1 Scheduled data collection
| # | Task | Description | How |
|---|---|---|---|
| A1 | Storage snapshot + history append | Run collect-storage-growth-data.sh --append on a schedule so history.csv grows for trend analysis. |
Cron every 6 hours (or daily). Use scripts/maintenance/schedule-storage-growth-cron.sh --install. |
| A2 | Snapshot retention | Prune old snapshot files under logs/storage-growth/ so the directory does not grow unbounded. |
Done. Script: scripts/monitoring/prune-storage-snapshots.sh (default keep 30 days; --days N, --dry-run). Schedule weekly or run manually. |
| A3 | History CSV retention | Cap history.csv size (keep last 10k rows or ~90 days). |
Done. Script: scripts/monitoring/prune-storage-history.sh (default 90 days proxy; --max-rows N, --days N, --dry-run). Run weekly via schedule-storage-growth-cron (prune line). |
1.2 Threshold checks and alerting
| # | Task | Description | How |
|---|---|---|---|
| A4 | Thin pool / pvesm check (all hosts) | Fail or warn when any host’s thin pool or pvesm storage is ≥ 95% (critical) or ≥ 80% (warn). | Done. In daily-weekly-checks.sh weekly (F3/M2). |
| A5 | In-CT disk check in cron | Run check-disk-all-vmids.sh on a schedule and log or alert on WARN/CRIT. |
Done. Called from daily-weekly-checks.sh daily (cron 08:00). |
| A6 | Integrate with existing storage-monitor.sh | storage-monitor.sh already has WARN 80%, CRIT 90% and optional ALERT_EMAIL / ALERT_WEBHOOK. |
Done. scripts/maintenance/schedule-storage-monitor-cron.sh --install (daily 07:00). |
| A7 | Metric file for alerting | Write a metric file (e.g. logs/storage-growth/last_run.metric) with max thin pool % and timestamp so an external monitor can alert. |
Done. Weekly run writes STORAGE_METRIC_FILE (storage_max_pct, storage_metric_timestamp). |
1.3 Proactive remediation (optional)
| # | Task | Description | How |
|---|---|---|---|
| A8 | Weekly fstrim in CTs | Run fstrim inside running CTs on hosts with thin pools to reclaim space. |
Done. scripts/maintenance/fstrim-all-running-ct.sh; run from daily-weekly-checks.sh weekly. |
| A9 | Logrotate audit | Ensure high-log VMIDs (10130, 10150, 10151, 5000, 10233, 10234, 2400) have logrotate or equivalent. | Done. Runbook: docs/04-configuration/LOGROTATE_AUDIT_RUNBOOK.md. |
| A10 | Journal vacuum | Run journalctl --vacuum-time=7d in key CTs on a schedule. |
Done. scripts/maintenance/journal-vacuum-key-ct.sh; run from daily-weekly-checks.sh weekly. |
2. Fixes required
| # | Fix | Location | Detail |
|---|---|---|---|
| F1 | Implement or remove --json | scripts/monitoring/collect-storage-growth-data.sh |
Done. --json outputs a JSON object with timestamp and csv_rows (array of CSV line strings). |
| F2 | CSV quoting for detail column | scripts/monitoring/collect-storage-growth-data.sh |
Done. Detail field is quoted when it contains commas or quotes via csv_quote(). |
| F3 | Thin pool check on all three hosts | scripts/maintenance/daily-weekly-checks.sh |
Done. [138a] now runs thin pool/storage check on r630-02, r630-01, and ml110 (WARN ≥85%, FAIL ≥95%/100%). |
| F4 | PROJECT_ROOT in cron | schedule-daily-weekly-cron.sh / new storage cron |
Cron lines use $PROJECT_ROOT; crontab is installed by the user who runs the script, so path is correct. For schedule-storage-growth-cron.sh use same pattern (cd $PROJECT_ROOT && ...). |
3. Migrations
| # | Migration | Description |
|---|---|---|
| M1 | Add schedule-storage-growth-cron.sh | Done. Script: scripts/maintenance/schedule-storage-growth-cron.sh (same style as schedule-daily-weekly-cron.sh): --show, --install, --remove. Cron runs collect-storage-growth-data.sh --append every 6 hours. |
| M2 | Extend weekly checks to all-host thin pool | Done. Implemented with F3 in daily-weekly-checks.sh: check_thin_pool_one_host for r630-02, r630-01, ml110. |
| M3 | Doc and index updates | Done. STORAGE_GROWTH_AND_HEALTH.md references schedule-storage-growth-cron.sh and prune script; MASTER_INDEX and OPERATIONAL_RUNBOOKS list storage growth cron. |
| M4 | Optional: CI job | Add a GitHub Actions (or Gitea) workflow that runs collect-storage-growth-data.sh --csv (or a dry run that only checks script syntax / host reachability) so config changes don’t break the script. Optional because the script requires LAN/SSH to hosts. |
4. Implementation order
- F2 (CSV quoting) and F1 (--json) in
collect-storage-growth-data.sh. - M1 Add
schedule-storage-growth-cron.shand M3 update docs. - F3 and M2 Extend daily-weekly-checks.sh to check thin pool on all three hosts.
- A1 Install storage growth cron (via M1).
- A2 Add
prune-storage-snapshots.shand schedule weekly (or in same cron wrapper). - A4/A7 Optionally have weekly check write a metric file; wire A5 (check-disk-all-vmids) into daily if desired.
- A8–A10 As needed (fstrim, logrotate audit, journal vacuum).
5. Quick reference
| Script | Purpose |
|---|---|
scripts/monitoring/collect-storage-growth-data.sh |
Collect host + VM storage; output snapshot + optional growth table; --append for history.csv. |
scripts/maintenance/schedule-storage-growth-cron.sh |
Install/show/remove cron for storage collection (every 6h). |
scripts/monitoring/prune-storage-snapshots.sh |
Prune snapshot_*.txt older than N days (default 30); --days N, --dry-run. |
scripts/monitoring/prune-storage-history.sh |
Prune history.csv to last N rows (default ~90d); --days N, --max-rows N, --dry-run. |
scripts/maintenance/daily-weekly-checks.sh |
Daily: explorer, RPC, indexer lag, in-CT disk (A5). Weekly: config API, thin pool, fstrim (A8), journal vacuum (A10), storage metric (A7). |
scripts/maintenance/check-disk-all-vmids.sh |
In-CT df / for all running CTs; WARN 85%, CRIT 95%. |
scripts/maintenance/schedule-storage-monitor-cron.sh |
Install/show/remove cron for storage-monitor.sh (daily 07:00). |
scripts/maintenance/fstrim-all-running-ct.sh |
fstrim -v / in all running CTs; --dry-run. |
scripts/maintenance/journal-vacuum-key-ct.sh |
journalctl --vacuum-time=7d in key CTs; --dry-run. |
scripts/storage-monitor.sh |
Host pvesm + VG; alerts at 80%/90%; optional email/webhook. |
docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md |
Growth table template, factors, thresholds, how to use data. |