Files
proxmox/reports/RPC_NODE_2505_TROUBLESHOOTING_20260105.md
defiQUG cb47cce074 Complete markdown files cleanup and organization
- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.
2026-01-06 01:46:25 -08:00

62 lines
2.7 KiB
Markdown

# RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)
**Date**: 2026-01-05
**VMID**: 2505
**IP**: 192.168.11.201
**Role**: Named RPC node (Luis / Chain 0x8a)
## Symptoms
- From client: TCP connection to `192.168.11.201:8545` succeeded, but HTTP never returned any bytes (hung).
- `pct exec 2505 -- ...` timed out repeatedly (container could not spawn commands).
## Diagnosis
- **Container memory pressure** was extreme:
- `pvesh ... status/current` showed memory essentially maxed and swap nearly fully used.
- The container init process (`/sbin/init`) was in **D (uninterruptible sleep)** with a stack indicating it was blocked waiting on page-in (`filemap_fault` / `folio_wait_bit_common`), consistent with **swap/IO thrash**.
- After restarting the container, RPC still did not come up because:
- The Besu systemd unit had `Environment="BESU_OPTS=-Xmx8g -Xms8g"` while the container only had **~4GB** before (and later **6GB**). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive.
- Besu logs indicated it was performing **RocksDB compaction** at startup; the oversized heap made recovery worse.
## Remediation / Fixes Applied
### 1) Make storage available to start the container on node `ml110`
Starting VMID 2505 initially failed with:
- `storage 'local-lvm' is not available on node 'ml110'`
Root cause: `/etc/pve/storage.cfg` restricted `local-lvm` to node `r630-01`, but this VMID was running on `ml110`.
Fix: Updated `/etc/pve/storage.cfg` to include `ml110` for `lvmthin: local-lvm` (backup created first). After this, `local-lvm` became active on `ml110` and the container could start.
### 2) Increase VMID 2505 memory/swap
- Updated VMID 2505 to **memory=6144MB**, **swap=1024MB**.
### 3) Reduce Besu heap to fit container memory
Inside VMID 2505:
- Updated `/etc/systemd/system/besu-rpc.service`:
- From: `BESU_OPTS=-Xmx8g -Xms8g`
- To: `BESU_OPTS=-Xms2g -Xmx4g`
- Ran: `systemctl daemon-reload && systemctl restart besu-rpc`
- Confirmed listeners came up on `:8545` (HTTP RPC), `:8546` (WS), `:9545` (metrics)
## Verification
- External JSON-RPC works again:
- `eth_chainId` returns `0x8a`
- `eth_blockNumber` returns a valid block
- Full fleet retest:
- Report: `reports/rpc_nodes_test_20260105_062846.md`
- Result: **Reachable 12/12**, **Authorized+responding 12/12**, **Block spread Δ0**
## Follow-ups / Recommendations
- Keep Besu heap aligned to container memory (avoid `Xmx` near/above memory limit).
- Investigate why node `ml110` is hosting VMIDs whose storage is restricted to `r630-01` in `storage.cfg` (possible migration/renaming mismatch).
- The Proxmox host `ml110` showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.