Files
proxmox/reports/RPC_NODE_2505_TROUBLESHOOTING_20260105.md
defiQUG cb47cce074 Complete markdown files cleanup and organization
- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.
2026-01-06 01:46:25 -08:00

2.7 KiB

RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)

Date: 2026-01-05
VMID: 2505
IP: 192.168.11.201
Role: Named RPC node (Luis / Chain 0x8a)

Symptoms

  • From client: TCP connection to 192.168.11.201:8545 succeeded, but HTTP never returned any bytes (hung).
  • pct exec 2505 -- ... timed out repeatedly (container could not spawn commands).

Diagnosis

  • Container memory pressure was extreme:
    • pvesh ... status/current showed memory essentially maxed and swap nearly fully used.
    • The container init process (/sbin/init) was in D (uninterruptible sleep) with a stack indicating it was blocked waiting on page-in (filemap_fault / folio_wait_bit_common), consistent with swap/IO thrash.
  • After restarting the container, RPC still did not come up because:
    • The Besu systemd unit had Environment="BESU_OPTS=-Xmx8g -Xms8g" while the container only had ~4GB before (and later 6GB). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive.
    • Besu logs indicated it was performing RocksDB compaction at startup; the oversized heap made recovery worse.

Remediation / Fixes Applied

1) Make storage available to start the container on node ml110

Starting VMID 2505 initially failed with:

  • storage 'local-lvm' is not available on node 'ml110'

Root cause: /etc/pve/storage.cfg restricted local-lvm to node r630-01, but this VMID was running on ml110.
Fix: Updated /etc/pve/storage.cfg to include ml110 for lvmthin: local-lvm (backup created first). After this, local-lvm became active on ml110 and the container could start.

2) Increase VMID 2505 memory/swap

  • Updated VMID 2505 to memory=6144MB, swap=1024MB.

3) Reduce Besu heap to fit container memory

Inside VMID 2505:

  • Updated /etc/systemd/system/besu-rpc.service:
    • From: BESU_OPTS=-Xmx8g -Xms8g
    • To: BESU_OPTS=-Xms2g -Xmx4g
  • Ran: systemctl daemon-reload && systemctl restart besu-rpc
  • Confirmed listeners came up on :8545 (HTTP RPC), :8546 (WS), :9545 (metrics)

Verification

  • External JSON-RPC works again:
    • eth_chainId returns 0x8a
    • eth_blockNumber returns a valid block
  • Full fleet retest:
    • Report: reports/rpc_nodes_test_20260105_062846.md
    • Result: Reachable 12/12, Authorized+responding 12/12, Block spread Δ0

Follow-ups / Recommendations

  • Keep Besu heap aligned to container memory (avoid Xmx near/above memory limit).
  • Investigate why node ml110 is hosting VMIDs whose storage is restricted to r630-01 in storage.cfg (possible migration/renaming mismatch).
  • The Proxmox host ml110 showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.