Files
proxmox/scripts/maintenance/README.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

51 lines
4.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Maintenance Scripts
**health-check-rpc-2101.sh** — Health check for Besu RPC on VMID 2101: container status, besu-rpc service, port 8545, eth_chainId, eth_blockNumber. Run from project root (LAN). See docs/09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md.
**fix-core-rpc-2101.sh** — One-command fix for Core RPC 2101: start CT if stopped, restart Besu, verify RPC. Options: `--dry-run`, `--restart-only`. If Besu fails with JNA/NoClassDefFoundError, run fix-rpc-2101-jna-reinstall.sh first.
**fix-rpc-2101-jna-reinstall.sh** — Reinstall Besu in CT 2101 to fix JNA/NoClassDefFoundError; then re-run fix-core-rpc-2101.sh. Use `--dry-run` to print steps only.
**check-disk-all-vmids.sh** — Check root disk usage in all running containers on ml110, r630-01, r630-02. Use `--csv` for tab-separated output. For prevention and audits.
**run-all-maintenance-via-proxmox-ssh.sh** — Run all maintenance/fix scripts that use SSH to Proxmox VE (r630-01, ml110, r630-02). **Runs make-rpc-vmids-writable-via-ssh.sh first** (so 2101, 2500-2505 are writable), then resolve-and-fix-all, fix-rpc-2101-jna-reinstall, install-besu-permanent-on-missing-nodes, address-all-remaining-502s; optional E2E with `--e2e`. Use `--no-npm` to skip NPM proxy update, `--dry-run` to print steps only, `--verbose` to show all step output (no stderr hidden). Step 2 (2101 fix) has optional timeout: `STEP2_TIMEOUT=900` (default) or `STEP2_TIMEOUT=0` to disable. Run from project root (LAN).
**make-rpc-vmids-writable-via-ssh.sh** — SSHs to r630-01 and for each VMID 2101, 2500-2505: stops the CT, runs `e2fsck -f -y` on the rootfs LV, starts the CT. Use before fix-rpc-2101 or install-besu-permanent when CTs are read-only. `--dry-run` to print only. Run from project root (LAN).
**make-validator-vmids-writable-via-ssh.sh** — SSHs to r630-01 (1000, 1001, 1002) and ml110 (1003, 1004); stops each validator CT, runs `e2fsck -f -y` on rootfs, starts the CT. Fixes "Read-only file system" / JNA crash loop on validators. Then run `fix-all-validators-and-txpool.sh`. See docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.
**Sentries 15001502 (r630-01)** — If deploy-besu-node-lists or set-all-besu-max-peers-32 reports Skip/fail or "Read-only file system" for 15001502, they have the same read-only root issue. On the host: `pct stop 1500; e2fsck -f -y /dev/pve/vm-1500-disk-0; pct start 1500` (repeat for 1501, 1502). Then re-run deploy and max-peers/restart.
**address-all-remaining-502s.sh** — One flow to address remaining E2E 502s: runs `fix-all-502s-comprehensive.sh`, then (if `NPM_PASSWORD` set) NPMplus proxy update, then RPC diagnostics (`diagnose-rpc-502s.sh`), optionally `fix-all-besu-nodes.sh` and E2E. Use `--no-npm`, `--run-besu-fix`, `--e2e`, `--dry-run` (print steps only). Run from LAN.
**diagnose-rpc-502s.sh** — Collects for VMIDs 2101 and 25002505: `ss -tlnp` and `journalctl -u besu-rpc` / `besu`. Pipe to a file or use from `address-all-remaining-502s.sh`.
**fix-all-502s-comprehensive.sh** — Starts/serves backends for 10130, 10150/10151, 2101, 25002505, Cacti (Python stubs if needed). Use `--dry-run` to print actions without SSH. Does not update NPMplus; use `update-npmplus-proxy-hosts-api.sh` from LAN for that.
**daily-weekly-checks.sh** — Daily (explorer, indexer lag, RPC) and weekly (config API, thin pool, log reminder).
**schedule-daily-weekly-cron.sh** — Install cron: daily 08:00, weekly Sun 09:00.
**check-and-fix-explorer-lag.sh** — Checks RPC vs Blockscout block; if lag > threshold (default 500), runs `fix-explorer-indexer-lag.sh` (restart Blockscout).
**schedule-explorer-lag-cron.sh** — Install cron for lag check-and-fix: every 6 hours (0, 6, 12, 18). Log: `logs/explorer-lag-fix.log`. Use `--show` to print the line, `--install` to add to crontab, `--remove` to remove.
## Optional: Alerting on failures
The daily/weekly script writes a **metric file** when run (if `MAINTENANCE_METRIC_FILE` is set or default `logs/maintenance-checks.metric`):
```
maintenance_checks_failed 0
maintenance_checks_timestamp 1739123456
```
- **Use in cron:** After the check, if `maintenance_checks_failed` > 0, send alert.
- **Example wrapper (email on failure):**
```bash
cd /path/to/proxmox && bash scripts/maintenance/daily-weekly-checks.sh daily >> logs/daily-weekly-checks.log 2>&1
FAILED=$(grep '^maintenance_checks_failed' logs/maintenance-checks.metric 2>/dev/null | awk '{print $2}')
[ -n "$FAILED" ] && [ "$FAILED" -gt 0 ] && echo "Maintenance checks failed: $FAILED" | mail -s "Explorer/maintenance alert" ops@example.com
```
- **Slack:** Use a small script that reads the metric file and posts to a webhook when `maintenance_checks_failed` > 0.
- **Prometheus/Grafana:** Scrape the metric file or run a node_exporter textfile collector on `logs/maintenance-checks.metric`.
To disable the metric file, set `MAINTENANCE_METRIC_FILE=` (empty) before running the script.