Files
proxmox/docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

4.6 KiB
Raw Blame History

RPC and Validator Testing — Runbook

Last Updated: 2026-02-18
Purpose: Single flow to fix and test the Chain 138 Core RPC (VMID 2101) and the 5 validators (10001004). Run from repo root with SSH to r630-01 (192.168.11.11) and ml110 (192.168.11.10).


Quick verification (no SSH for RPC-only)

From anywhere with access to http://192.168.11.211:8545:

./scripts/verify/verify-rpc-2101-approve-and-sync.sh
./scripts/monitoring/monitor-blockchain-health.sh

All possible peers vs connected:

./scripts/verify/check-rpc-2101-all-peers.sh

Lists connected peer IPs and allowlist IPs that are not yet connected (source: config/besu-node-lists/permissions-nodes.toml).

Peer topology and plan: See PEER_CONNECTIONS_PLAN.md for peer counts by node, the 9 IPs not connected to 2101, and a plan (allowlist cleanup, 2101↔2102, 2201 P2P, optional more peers).


Full fix and test sequence (from LAN with SSH)

Run in order. Allow 25 minutes after restarts for validators to become active and block production to resume.

1. Make validator VMIDs writable (if validators crash with "Read-only file system" / JNA)

Validators 10001004 can remount read-only after ext4 errors; Besu then fails with UnsatisfiedLinkError: Read-only file system when writing JNA temp files.

./scripts/maintenance/make-validator-vmids-writable-via-ssh.sh

Then restart validators (step 3).

2. Deploy node lists (permissions + static nodes) to all nodes including RPC 2101

Ensures RPC 2101 and all validators have the same allowlist (all 5 validators, sentries, RPCs).

./scripts/deploy-besu-node-lists-to-all.sh

3. Validator permissioning and restart (validators only; uses /var/lib/besu/)

./scripts/fix-validator-permissioning-toml.sh

4. Validator tx-pool config and restart

./scripts/fix-all-validators-and-txpool.sh

5. Restart RPC 2101 (reload node lists)

./scripts/maintenance/fix-core-rpc-2101.sh --restart-only

6. Make RPC VMIDs writable (if RPC 2101 or 25002505 have read-only issues)

./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh

Then re-run step 5 if needed.

7. Verify and monitor

Wait 12 minutes, then:

./scripts/verify/verify-rpc-2101-approve-and-sync.sh
./scripts/monitoring/monitor-blockchain-health.sh

Expected: RPC Chain 138, ≥5 peers, all 5 validator IPs in peer list (or 24+ peers), block production advancing, 5/5 validators active. Block production can take 35 minutes to resume after validator restarts; if still stalled with 5/5 active, see BLOCK_PRODUCTION_FIX_RUNBOOK.md and check validator logs for "Proposed block" / QBFT.


One-shot script (optional)

To run the full fix sequence in one go (no validator writable step; add manually if needed):

./scripts/deploy-besu-node-lists-to-all.sh && \
./scripts/fix-validator-permissioning-toml.sh && \
./scripts/fix-all-validators-and-txpool.sh && \
./scripts/maintenance/fix-core-rpc-2101.sh --restart-only

Then wait 90s and run ./scripts/verify/verify-rpc-2101-approve-and-sync.sh and ./scripts/monitoring/monitor-blockchain-health.sh.


Troubleshooting

Symptom Action
Validator "activating" or crash-loop Check logs: ssh root@192.168.11.11 "pct exec 1002 -- journalctl -u besu-validator -n 50 --no-pager". If "Read-only file system" or JNA: run make-validator-vmids-writable-via-ssh.sh then restart validators.
Only 2/5 validator IPs in RPC peers RPC may connect via sentries; 24 peers is OK. To get all 5 validator IPs in peer list, ensure config/besu-node-lists/permissions-nodes.toml includes all 5 and run deploy-besu-node-lists-to-all.sh, then restart RPC 2101.
Block production stalled Ensure 4/5 or 5/5 validators active (QBFT quorum). Run fix-validator-permissioning-toml.sh and fix-all-validators-and-txpool.sh; if validators are read-only, run make-validator-vmids-writable-via-ssh.sh first.
RPC 2101 not responding Run ./scripts/maintenance/health-check-rpc-2101.sh then ./scripts/maintenance/fix-core-rpc-2101.sh.

References