- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
2.7 KiB
2.7 KiB
RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)
Date: 2026-01-05
VMID: 2505
IP: 192.168.11.201
Role: Named RPC node (Luis / Chain 0x8a)
Symptoms
- From client: TCP connection to
192.168.11.201:8545succeeded, but HTTP never returned any bytes (hung). pct exec 2505 -- ...timed out repeatedly (container could not spawn commands).
Diagnosis
- Container memory pressure was extreme:
pvesh ... status/currentshowed memory essentially maxed and swap nearly fully used.- The container init process (
/sbin/init) was in D (uninterruptible sleep) with a stack indicating it was blocked waiting on page-in (filemap_fault/folio_wait_bit_common), consistent with swap/IO thrash.
- After restarting the container, RPC still did not come up because:
- The Besu systemd unit had
Environment="BESU_OPTS=-Xmx8g -Xms8g"while the container only had ~4GB before (and later 6GB). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive. - Besu logs indicated it was performing RocksDB compaction at startup; the oversized heap made recovery worse.
- The Besu systemd unit had
Remediation / Fixes Applied
1) Make storage available to start the container on node ml110
Starting VMID 2505 initially failed with:
storage 'local-lvm' is not available on node 'ml110'
Root cause: /etc/pve/storage.cfg restricted local-lvm to node r630-01, but this VMID was running on ml110.
Fix: Updated /etc/pve/storage.cfg to include ml110 for lvmthin: local-lvm (backup created first). After this, local-lvm became active on ml110 and the container could start.
2) Increase VMID 2505 memory/swap
- Updated VMID 2505 to memory=6144MB, swap=1024MB.
3) Reduce Besu heap to fit container memory
Inside VMID 2505:
- Updated
/etc/systemd/system/besu-rpc.service:- From:
BESU_OPTS=-Xmx8g -Xms8g - To:
BESU_OPTS=-Xms2g -Xmx4g
- From:
- Ran:
systemctl daemon-reload && systemctl restart besu-rpc - Confirmed listeners came up on
:8545(HTTP RPC),:8546(WS),:9545(metrics)
Verification
- External JSON-RPC works again:
eth_chainIdreturns0x8aeth_blockNumberreturns a valid block
- Full fleet retest:
- Report:
reports/rpc_nodes_test_20260105_062846.md - Result: Reachable 12/12, Authorized+responding 12/12, Block spread Δ0
- Report:
Follow-ups / Recommendations
- Keep Besu heap aligned to container memory (avoid
Xmxnear/above memory limit). - Investigate why node
ml110is hosting VMIDs whose storage is restricted tor630-01instorage.cfg(possible migration/renaming mismatch). - The Proxmox host
ml110showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.