Files
proxmox/docs/00-meta/MAINTENANCE_SCRIPTS_REVIEW.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

5.4 KiB
Raw Permalink Blame History

Maintenance scripts review

Date: 2026-02-15
Scope: RPC/502 fix flow, writability step, runner, and related docs.


1. Flow overview

Step Script Purpose
0 make-rpc-vmids-writable-via-ssh.sh Stop 2101, 25002505 on r630-01; e2fsck rootfs; start; verify /tmp writable
1 resolve-and-fix-all-via-proxmox-ssh.sh Dev VM IP .59, start containers, DBIS services (r630-01, ml110)
2 fix-rpc-2101-jna-reinstall.sh Reinstall Besu in 2101 (JNA fix), use /tmp in CT, set java.io.tmpdir=/data/besu/tmp
3 install-besu-permanent-on-missing-nodes.sh Install Besu on 15051508 (ml110), 25002505 (r630-01) where missing
4 address-all-remaining-502s.sh fix-all-502s-comprehensive + NPM proxy update + RPC diagnostics
5 verify-end-to-end-routing.sh E2E (optional via --e2e)

Single entry point: ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh [--no-npm] [--e2e] [--dry-run]


2. What works well

  • Writability first: Step 0 fixes read-only root (ext4 errors) so steps 2 and 3 can write to CTs. All seven RPC VMIDs (2101, 25002505) are handled on r630-01.
  • Clear ordering: Make writable → resolve/start → fix 2101 → install Besu on missing → address 502s → E2E. Dependencies are respected.
  • Config-driven: Hosts and IPs come from config/ip-addresses.conf (PROXMOX_HOST_R630_01, etc.).
  • Idempotent / skip logic: resolve-and-fix skips if already correct; install-besu-permanent skips VMIDs that already have /opt/besu/bin/besu.
  • Docs linked: 502_DEEP_DIVE (§ Read-only CT), CHECK_ALL_UPDATES (§9 Remaining fixes), maintenance README all reference the runner and make-writable script.
  • JNA tmpdir: Standalone installer and 2101 fix set -Djava.io.tmpdir=/data/besu/tmp so Besu/JNA work when /tmp is restricted.
  • Apt resilience: Standalone installer allows apt-get update to fail (e.g. command-not-found I/O error) and still requires java and wget before continuing.

3. Gaps and risks

  • Step 2 (2101) can be slow: Apt install inside the CT can take 515+ minutes; the runner has no per-step timeout, so the whole run can appear to hang at “Installing packages…”.
  • Errors hidden: The runner uses 2>/dev/null on each step and only prints “Done” or “Step had warnings.” Failures (e.g. 2101 install fail, 2505 install fail) are not surfaced unless you read the full output.
  • Disk space: 2502/2504 have historically hit “No space left on device” in /data/besu (RocksDB). The scripts do not check or resize CT disk; that remains manual (e.g. pct resize <vmid> rootfs +50G or free space inside CT).
  • LV name assumption: make-rpc-vmids-writable assumes LVs are /dev/pve/vm-<vmid>-disk-0. Different storage or naming would need script changes.
  • Single host for RPC: make-rpc-vmids-writable only targets r630-01. If any RPC VMIDs are moved to ml110/r630-02, the script would need to be extended (or a second call with a different host).

4. Recommendations and completion

  1. Optional verbose mode: Done. Runner supports --verbose; when set, step output is not redirected (no 2>/dev/null), so failures are visible.
  2. Optional timeout for step 2: Done. STEP2_TIMEOUT (default 900) applies to the 2101 fix; exit code 124 is detected and a message tells the user to re-run the fix manually. Use STEP2_TIMEOUT=0 to disable.
  3. §9 checklist: CHECK_ALL_UPDATES §9 includes "RPC CTs read-only → make-rpc-vmids-writable first"; operators have a single place for order of operations.
  4. Disk check (future): Not implemented. Optionally run pct exec <vmid> -- df -h / /data/besu before install/fix and warn if usage > 90%.

5. File reference

File Role
scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh Main runner (steps 05)
scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh e2fsck 2101, 25002505 on r630-01
scripts/maintenance/address-all-remaining-502s.sh Backends + NPM + diagnostics
scripts/maintenance/fix-rpc-2101-jna-reinstall.sh 2101 Besu reinstall, /tmp + JNA tmpdir
scripts/install-besu-in-ct-standalone.sh In-CT Besu install; apt tolerant; JNA tmpdir
scripts/besu/install-besu-permanent-on-missing-nodes.sh Besu on 15051508, 25002505; writability check
docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md Root causes, Read-only CT, 2101/25002505 fixes
docs/05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md Config, tunnels, verification, §9 remaining fixes

6. Quick commands

# Full run (writable → fix → install → 502s → E2E)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e

# Show all step output (no 2>/dev/null)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e --verbose

# Step 2 (2101 fix) timeout: default 900s; disable with 0
STEP2_TIMEOUT=1200 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
STEP2_TIMEOUT=0 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e

# Only make RPC CTs writable
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh

# Dry-run (print steps only)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --dry-run

Reports and diagnostics: docs/04-configuration/verification-evidence/ (RPC diagnostics, E2E reports).