- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
62 lines
2.7 KiB
Markdown
62 lines
2.7 KiB
Markdown
# RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)
|
|
|
|
**Date**: 2026-01-05
|
|
**VMID**: 2505
|
|
**IP**: 192.168.11.201
|
|
**Role**: Named RPC node (Luis / Chain 0x8a)
|
|
|
|
## Symptoms
|
|
|
|
- From client: TCP connection to `192.168.11.201:8545` succeeded, but HTTP never returned any bytes (hung).
|
|
- `pct exec 2505 -- ...` timed out repeatedly (container could not spawn commands).
|
|
|
|
## Diagnosis
|
|
|
|
- **Container memory pressure** was extreme:
|
|
- `pvesh ... status/current` showed memory essentially maxed and swap nearly fully used.
|
|
- The container init process (`/sbin/init`) was in **D (uninterruptible sleep)** with a stack indicating it was blocked waiting on page-in (`filemap_fault` / `folio_wait_bit_common`), consistent with **swap/IO thrash**.
|
|
- After restarting the container, RPC still did not come up because:
|
|
- The Besu systemd unit had `Environment="BESU_OPTS=-Xmx8g -Xms8g"` while the container only had **~4GB** before (and later **6GB**). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive.
|
|
- Besu logs indicated it was performing **RocksDB compaction** at startup; the oversized heap made recovery worse.
|
|
|
|
## Remediation / Fixes Applied
|
|
|
|
### 1) Make storage available to start the container on node `ml110`
|
|
|
|
Starting VMID 2505 initially failed with:
|
|
|
|
- `storage 'local-lvm' is not available on node 'ml110'`
|
|
|
|
Root cause: `/etc/pve/storage.cfg` restricted `local-lvm` to node `r630-01`, but this VMID was running on `ml110`.
|
|
Fix: Updated `/etc/pve/storage.cfg` to include `ml110` for `lvmthin: local-lvm` (backup created first). After this, `local-lvm` became active on `ml110` and the container could start.
|
|
|
|
### 2) Increase VMID 2505 memory/swap
|
|
|
|
- Updated VMID 2505 to **memory=6144MB**, **swap=1024MB**.
|
|
|
|
### 3) Reduce Besu heap to fit container memory
|
|
|
|
Inside VMID 2505:
|
|
|
|
- Updated `/etc/systemd/system/besu-rpc.service`:
|
|
- From: `BESU_OPTS=-Xmx8g -Xms8g`
|
|
- To: `BESU_OPTS=-Xms2g -Xmx4g`
|
|
- Ran: `systemctl daemon-reload && systemctl restart besu-rpc`
|
|
- Confirmed listeners came up on `:8545` (HTTP RPC), `:8546` (WS), `:9545` (metrics)
|
|
|
|
## Verification
|
|
|
|
- External JSON-RPC works again:
|
|
- `eth_chainId` returns `0x8a`
|
|
- `eth_blockNumber` returns a valid block
|
|
- Full fleet retest:
|
|
- Report: `reports/rpc_nodes_test_20260105_062846.md`
|
|
- Result: **Reachable 12/12**, **Authorized+responding 12/12**, **Block spread Δ0**
|
|
|
|
## Follow-ups / Recommendations
|
|
|
|
- Keep Besu heap aligned to container memory (avoid `Xmx` near/above memory limit).
|
|
- Investigate why node `ml110` is hosting VMIDs whose storage is restricted to `r630-01` in `storage.cfg` (possible migration/renaming mismatch).
|
|
- The Proxmox host `ml110` showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.
|
|
|