proxmox/reports/RPC_NODE_2505_TROUBLESHOOTING_20260105.md

# RPC Node Troubleshooting Report — VMID 2505 (besu-rpc-luis-0x8a)

**Date**: 2026-01-05
**VMID**: 2505
**IP**: 192.168.11.201
**Role**: Named RPC node (Luis / Chain 0x8a)

## Symptoms

- From client: TCP connection to `192.168.11.201:8545` succeeded, but HTTP never returned any bytes (hung).
- `pct exec 2505 -- ...` timed out repeatedly (container could not spawn commands).

## Diagnosis

- **Container memory pressure** was extreme:
  - `pvesh ... status/current` showed memory essentially maxed and swap nearly fully used.
  - The container init process (`/sbin/init`) was in **D (uninterruptible sleep)** with a stack indicating it was blocked waiting on page-in (`filemap_fault` / `folio_wait_bit_common`), consistent with **swap/IO thrash**.
- After restarting the container, RPC still did not come up because:
  - The Besu systemd unit had `Environment="BESU_OPTS=-Xmx8g -Xms8g"` while the container only had **~4GB** before (and later **6GB**). This can cause severe memory pressure/OOM behavior and prevent services from becoming responsive.
  - Besu logs indicated it was performing **RocksDB compaction** at startup; the oversized heap made recovery worse.

## Remediation / Fixes Applied

### 1) Make storage available to start the container on node `ml110`

Starting VMID 2505 initially failed with:

- `storage 'local-lvm' is not available on node 'ml110'`

Root cause: `/etc/pve/storage.cfg` restricted `local-lvm` to node `r630-01`, but this VMID was running on `ml110`.
Fix: Updated `/etc/pve/storage.cfg` to include `ml110` for `lvmthin: local-lvm` (backup created first). After this, `local-lvm` became active on `ml110` and the container could start.

### 2) Increase VMID 2505 memory/swap

- Updated VMID 2505 to **memory=6144MB**, **swap=1024MB**.

### 3) Reduce Besu heap to fit container memory

Inside VMID 2505:

- Updated `/etc/systemd/system/besu-rpc.service`:
  - From: `BESU_OPTS=-Xmx8g -Xms8g`
  - To: `BESU_OPTS=-Xms2g -Xmx4g`
- Ran: `systemctl daemon-reload && systemctl restart besu-rpc`
- Confirmed listeners came up on `:8545` (HTTP RPC), `:8546` (WS), `:9545` (metrics)

## Verification

- External JSON-RPC works again:
  - `eth_chainId` returns `0x8a`
  - `eth_blockNumber` returns a valid block
- Full fleet retest:
  - Report: `reports/rpc_nodes_test_20260105_062846.md`
  - Result: **Reachable 12/12**, **Authorized+responding 12/12**, **Block spread Δ0**

## Follow-ups / Recommendations

- Keep Besu heap aligned to container memory (avoid `Xmx` near/above memory limit).
- Investigate why node `ml110` is hosting VMIDs whose storage is restricted to `r630-01` in `storage.cfg` (possible migration/renaming mismatch).
- The Proxmox host `ml110` showed extremely high load earlier; consider checking IO wait and overall node health if issues recur.