proxmox/scripts/maintenance/README.md

# Maintenance Scripts

**health-check-rpc-2101.sh** — Health check for Besu RPC on VMID 2101: container status, besu-rpc service, port 8545, eth_chainId, eth_blockNumber. Run from project root (LAN). See docs/09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md.

**fix-core-rpc-2101.sh** — One-command fix for Core RPC 2101: start CT if stopped, restart Besu, verify RPC. Options: `--dry-run`, `--restart-only`. If Besu fails with JNA/NoClassDefFoundError, run fix-rpc-2101-jna-reinstall.sh first.

**fix-rpc-2101-jna-reinstall.sh** — Reinstall Besu in CT 2101 to fix JNA/NoClassDefFoundError; then re-run fix-core-rpc-2101.sh. Use `--dry-run` to print steps only.

**check-disk-all-vmids.sh** — Check root disk usage in all running containers on ml110, r630-01, r630-02. Use `--csv` for tab-separated output. For prevention and audits.

**run-all-maintenance-via-proxmox-ssh.sh** — Run all maintenance/fix scripts that use SSH to Proxmox VE (r630-01, ml110, r630-02). **Runs make-rpc-vmids-writable-via-ssh.sh first** (so 2101, 2500-2505 are writable), then resolve-and-fix-all, fix-rpc-2101-jna-reinstall, install-besu-permanent-on-missing-nodes, address-all-remaining-502s; optional E2E with `--e2e`. Use `--no-npm` to skip NPM proxy update, `--dry-run` to print steps only, `--verbose` to show all step output (no stderr hidden). Step 2 (2101 fix) has optional timeout: `STEP2_TIMEOUT=900` (default) or `STEP2_TIMEOUT=0` to disable. Run from project root (LAN).

**make-rpc-vmids-writable-via-ssh.sh** — SSHs to r630-01 and for each VMID 2101, 2500-2505: stops the CT, runs `e2fsck -f -y` on the rootfs LV, starts the CT. Use before fix-rpc-2101 or install-besu-permanent when CTs are read-only. `--dry-run` to print only. Run from project root (LAN).

**make-validator-vmids-writable-via-ssh.sh** — SSHs to r630-01 (1000, 1001, 1002) and ml110 (1003, 1004); stops each validator CT, runs `e2fsck -f -y` on rootfs, starts the CT. Fixes "Read-only file system" / JNA crash loop on validators. Then run `fix-all-validators-and-txpool.sh`. See docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.

**Sentries 1500–1502 (r630-01)** — If deploy-besu-node-lists or set-all-besu-max-peers-32 reports Skip/fail or "Read-only file system" for 1500–1502, they have the same read-only root issue. On the host: `pct stop 1500; e2fsck -f -y /dev/pve/vm-1500-disk-0; pct start 1500` (repeat for 1501, 1502). Then re-run deploy and max-peers/restart.

**address-all-remaining-502s.sh** — One flow to address remaining E2E 502s: runs `fix-all-502s-comprehensive.sh`, then (if `NPM_PASSWORD` set) NPMplus proxy update, then RPC diagnostics (`diagnose-rpc-502s.sh`), optionally `fix-all-besu-nodes.sh` and E2E. Use `--no-npm`, `--run-besu-fix`, `--e2e`, `--dry-run` (print steps only). Run from LAN.

**diagnose-rpc-502s.sh** — Collects for VMIDs 2101 and 2500–2505: `ss -tlnp` and `journalctl -u besu-rpc` / `besu`. Pipe to a file or use from `address-all-remaining-502s.sh`.

**fix-all-502s-comprehensive.sh** — Starts/serves backends for 10130, 10150/10151, 2101, 2500–2505, Cacti (Python stubs if needed). Use `--dry-run` to print actions without SSH. Does not update NPMplus; use `update-npmplus-proxy-hosts-api.sh` from LAN for that.

**daily-weekly-checks.sh** — Daily (explorer, indexer lag, RPC) and weekly (config API, thin pool, log reminder).
**schedule-daily-weekly-cron.sh** — Install cron: daily 08:00, weekly Sun 09:00.

**check-and-fix-explorer-lag.sh** — Checks RPC vs Blockscout block; if lag > threshold (default 500), runs `fix-explorer-indexer-lag.sh` (restart Blockscout).
**schedule-explorer-lag-cron.sh** — Install cron for lag check-and-fix: every 6 hours (0, 6, 12, 18). Log: `logs/explorer-lag-fix.log`. Use `--show` to print the line, `--install` to add to crontab, `--remove` to remove.

## Optional: Alerting on failures

The daily/weekly script writes a **metric file** when run (if `MAINTENANCE_METRIC_FILE` is set or default `logs/maintenance-checks.metric`):

```
maintenance_checks_failed 0
maintenance_checks_timestamp 1739123456
```

- **Use in cron:** After the check, if `maintenance_checks_failed` > 0, send alert.
- **Example wrapper (email on failure):**
  ```bash
  cd /path/to/proxmox && bash scripts/maintenance/daily-weekly-checks.sh daily >> logs/daily-weekly-checks.log 2>&1
  FAILED=$(grep '^maintenance_checks_failed' logs/maintenance-checks.metric 2>/dev/null | awk '{print $2}')
  [ -n "$FAILED" ] && [ "$FAILED" -gt 0 ] && echo "Maintenance checks failed: $FAILED" | mail -s "Explorer/maintenance alert" ops@example.com
  ```
- **Slack:** Use a small script that reads the metric file and posts to a webhook when `maintenance_checks_failed` > 0.
- **Prometheus/Grafana:** Scrape the metric file or run a node_exporter textfile collector on `logs/maintenance-checks.metric`.

To disable the metric file, set `MAINTENANCE_METRIC_FILE=` (empty) before running the script.