Files
proxmox/scripts/maintenance/README.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

4.8 KiB
Raw Blame History

Maintenance Scripts

health-check-rpc-2101.sh — Health check for Besu RPC on VMID 2101: container status, besu-rpc service, port 8545, eth_chainId, eth_blockNumber. Run from project root (LAN). See docs/09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md.

fix-core-rpc-2101.sh — One-command fix for Core RPC 2101: start CT if stopped, restart Besu, verify RPC. Options: --dry-run, --restart-only. If Besu fails with JNA/NoClassDefFoundError, run fix-rpc-2101-jna-reinstall.sh first.

fix-rpc-2101-jna-reinstall.sh — Reinstall Besu in CT 2101 to fix JNA/NoClassDefFoundError; then re-run fix-core-rpc-2101.sh. Use --dry-run to print steps only.

check-disk-all-vmids.sh — Check root disk usage in all running containers on ml110, r630-01, r630-02. Use --csv for tab-separated output. For prevention and audits.

run-all-maintenance-via-proxmox-ssh.sh — Run all maintenance/fix scripts that use SSH to Proxmox VE (r630-01, ml110, r630-02). Runs make-rpc-vmids-writable-via-ssh.sh first (so 2101, 2500-2505 are writable), then resolve-and-fix-all, fix-rpc-2101-jna-reinstall, install-besu-permanent-on-missing-nodes, address-all-remaining-502s; optional E2E with --e2e. Use --no-npm to skip NPM proxy update, --dry-run to print steps only, --verbose to show all step output (no stderr hidden). Step 2 (2101 fix) has optional timeout: STEP2_TIMEOUT=900 (default) or STEP2_TIMEOUT=0 to disable. Run from project root (LAN).

make-rpc-vmids-writable-via-ssh.sh — SSHs to r630-01 and for each VMID 2101, 2500-2505: stops the CT, runs e2fsck -f -y on the rootfs LV, starts the CT. Use before fix-rpc-2101 or install-besu-permanent when CTs are read-only. --dry-run to print only. Run from project root (LAN).

make-validator-vmids-writable-via-ssh.sh — SSHs to r630-01 (1000, 1001, 1002) and ml110 (1003, 1004); stops each validator CT, runs e2fsck -f -y on rootfs, starts the CT. Fixes "Read-only file system" / JNA crash loop on validators. Then run fix-all-validators-and-txpool.sh. See docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.

Sentries 15001502 (r630-01) — If deploy-besu-node-lists or set-all-besu-max-peers-32 reports Skip/fail or "Read-only file system" for 15001502, they have the same read-only root issue. On the host: pct stop 1500; e2fsck -f -y /dev/pve/vm-1500-disk-0; pct start 1500 (repeat for 1501, 1502). Then re-run deploy and max-peers/restart.

address-all-remaining-502s.sh — One flow to address remaining E2E 502s: runs fix-all-502s-comprehensive.sh, then (if NPM_PASSWORD set) NPMplus proxy update, then RPC diagnostics (diagnose-rpc-502s.sh), optionally fix-all-besu-nodes.sh and E2E. Use --no-npm, --run-besu-fix, --e2e, --dry-run (print steps only). Run from LAN.

diagnose-rpc-502s.sh — Collects for VMIDs 2101 and 25002505: ss -tlnp and journalctl -u besu-rpc / besu. Pipe to a file or use from address-all-remaining-502s.sh.

fix-all-502s-comprehensive.sh — Starts/serves backends for 10130, 10150/10151, 2101, 25002505, Cacti (Python stubs if needed). Use --dry-run to print actions without SSH. Does not update NPMplus; use update-npmplus-proxy-hosts-api.sh from LAN for that.

daily-weekly-checks.sh — Daily (explorer, indexer lag, RPC) and weekly (config API, thin pool, log reminder).
schedule-daily-weekly-cron.sh — Install cron: daily 08:00, weekly Sun 09:00.

check-and-fix-explorer-lag.sh — Checks RPC vs Blockscout block; if lag > threshold (default 500), runs fix-explorer-indexer-lag.sh (restart Blockscout).
schedule-explorer-lag-cron.sh — Install cron for lag check-and-fix: every 6 hours (0, 6, 12, 18). Log: logs/explorer-lag-fix.log. Use --show to print the line, --install to add to crontab, --remove to remove.

Optional: Alerting on failures

The daily/weekly script writes a metric file when run (if MAINTENANCE_METRIC_FILE is set or default logs/maintenance-checks.metric):

maintenance_checks_failed 0
maintenance_checks_timestamp 1739123456
  • Use in cron: After the check, if maintenance_checks_failed > 0, send alert.
  • Example wrapper (email on failure):
    cd /path/to/proxmox && bash scripts/maintenance/daily-weekly-checks.sh daily >> logs/daily-weekly-checks.log 2>&1
    FAILED=$(grep '^maintenance_checks_failed' logs/maintenance-checks.metric 2>/dev/null | awk '{print $2}')
    [ -n "$FAILED" ] && [ "$FAILED" -gt 0 ] && echo "Maintenance checks failed: $FAILED" | mail -s "Explorer/maintenance alert" ops@example.com
    
  • Slack: Use a small script that reads the metric file and posts to a webhook when maintenance_checks_failed > 0.
  • Prometheus/Grafana: Scrape the metric file or run a node_exporter textfile collector on logs/maintenance-checks.metric.

To disable the metric file, set MAINTENANCE_METRIC_FILE= (empty) before running the script.