Files

defiQUG cb47cce074 Complete markdown files cleanup and organization

- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.

2026-01-06 01:46:25 -08:00

2.7 KiB

Raw Blame History

RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)

Date: 2026-01-05
VMID: 2505
IP: 192.168.11.201
Role: Named RPC node (Luis / Chain 0x8a)

Symptoms

From client: TCP connection to 192.168.11.201:8545 succeeded, but HTTP never returned any bytes (hung).
pct exec 2505 -- ... timed out repeatedly (container could not spawn commands).

Diagnosis

Container memory pressure was extreme:
- pvesh ... status/current showed memory essentially maxed and swap nearly fully used.
- The container init process (/sbin/init) was in D (uninterruptible sleep) with a stack indicating it was blocked waiting on page-in (filemap_fault / folio_wait_bit_common), consistent with swap/IO thrash.
After restarting the container, RPC still did not come up because:
- The Besu systemd unit had Environment="BESU_OPTS=-Xmx8g -Xms8g" while the container only had ~4GB before (and later 6GB). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive.
- Besu logs indicated it was performing RocksDB compaction at startup; the oversized heap made recovery worse.

Remediation / Fixes Applied

1) Make storage available to start the container on node `ml110`

Starting VMID 2505 initially failed with:

storage 'local-lvm' is not available on node 'ml110'

Root cause: /etc/pve/storage.cfg restricted local-lvm to node r630-01, but this VMID was running on ml110.
Fix: Updated /etc/pve/storage.cfg to include ml110 for lvmthin: local-lvm (backup created first). After this, local-lvm became active on ml110 and the container could start.

2) Increase VMID 2505 memory/swap

Updated VMID 2505 to memory=6144MB, swap=1024MB.

3) Reduce Besu heap to fit container memory

Inside VMID 2505:

Updated /etc/systemd/system/besu-rpc.service:
- From: BESU_OPTS=-Xmx8g -Xms8g
- To: BESU_OPTS=-Xms2g -Xmx4g
Ran: systemctl daemon-reload && systemctl restart besu-rpc
Confirmed listeners came up on :8545 (HTTP RPC), :8546 (WS), :9545 (metrics)

Verification

External JSON-RPC works again:
- eth_chainId returns 0x8a
- eth_blockNumber returns a valid block
Full fleet retest:
- Report: reports/rpc_nodes_test_20260105_062846.md
- Result: Reachable 12/12, Authorized+responding 12/12, Block spread Δ0

Follow-ups / Recommendations

Keep Besu heap aligned to container memory (avoid Xmx near/above memory limit).
Investigate why node ml110 is hosting VMIDs whose storage is restricted to r630-01 in storage.cfg (possible migration/renaming mismatch).
The Proxmox host ml110 showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.

2.7 KiB Raw Blame History