Files
proxmox/docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md

176 lines
14 KiB
Markdown
Raw Normal View History

# 502 Deep Dive: Root Causes and Fixes
**Last updated:** 2026-02-14
This document maps each E2E 502 to its backend, root cause, and fix. Use from **LAN** with SSH to Proxmox.
## Full maintenance (all RPC + 502 in one run)
From project root on **LAN** (SSH to r630-01, ml110, r630-02):
```bash
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
```
This runs in order: (0) make RPC VMIDs 2101, 25002505 writable (e2fsck); (1) resolve-and-fix (Dev VM IP, start containers, DBIS); (2) fix 2101 JNA reinstall; (3) install Besu on missing nodes (25002505, 15051508); (4) address-all-502s (backends + NPM + RPC diagnostics); (5) E2E verification. Use `--verbose` to see all step output; `STEP2_TIMEOUT=0` to disable step-2 timeout. See [MAINTENANCE_SCRIPTS_REVIEW.md](MAINTENANCE_SCRIPTS_REVIEW.md) and [CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md](../05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md) §9.
## Backend map (domain → IP:port → VMID, host)
| Domain(s) | Backend | VMID | Proxmox host | Service to start |
|-----------|---------|------|--------------|-------------------|
| dbis-admin.d-bis.org, secure.d-bis.org | 192.168.11.130:80 | 10130 | r630-01 (192.168.11.11) | nginx |
| dbis-api.d-bis.org, dbis-api-2.d-bis.org | 192.168.11.155:3000, .156:3000 | 10150, 10151 | r630-01 | node |
| rpc-http-prv.d-bis.org, rpc-ws-prv.d-bis.org | 192.168.11.211:8545/8546 | 2101 | r630-01 | besu |
| mim4u.org, www.mim4u.org, secure.mim4u.org, training.mim4u.org | 192.168.11.37:80 | 7810 | r630-02 (192.168.11.12) | nginx (or python stub in fix-all-502s-comprehensive.sh) |
| rpc-alltra*.d-bis.org (3) | 192.168.11.172/173/174:8545 | 2500, 2501, 2502 | r630-01 | besu |
| rpc-hybx*.d-bis.org (3) | 192.168.11.246/247/248:8545 | 2503, 2504, 2505 | r630-01 or ml110 | besu |
| cacti-1 (if proxied) | 192.168.11.80:80 | 5200 | r630-02 | nginx/apache2 |
| cacti-alltra.d-bis.org | 192.168.11.177:80 | 5201 | r630-02 | nginx/apache2 |
| cacti-hybx.d-bis.org | 192.168.11.251:80 | 5202 | r630-02 | nginx/apache2 |
## One-command: address all remaining 502s
From a host on the **LAN** (can reach NPMplus and Proxmox):
```bash
# Full flow: backends + NPMplus proxy update (if NPM_PASSWORD set) + RPC diagnostics
./scripts/maintenance/address-all-remaining-502s.sh
# Skip NPMplus update (e.g. no .env yet)
./scripts/maintenance/address-all-remaining-502s.sh --no-npm
# Also run Besu mass-fix (config + restart) and E2E at the end
./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix --e2e
```
This runs in order: (1) `fix-all-502s-comprehensive.sh`, (2) NPMplus proxy update when `NPM_PASSWORD` is set, (3) `diagnose-rpc-502s.sh` (saves report under `docs/04-configuration/verification-evidence/`), (4) optional `fix-all-besu-nodes.sh`, (5) optional E2E.
## Per-step diagnose and fix
From a host that can SSH to Proxmox (r630-01, r630-02, ml110):
```bash
# Comprehensive fix (DBIS 10130 Python, dbis-api, 2101, 2500-2505 Besu, Cacti Python)
./scripts/maintenance/fix-all-502s-comprehensive.sh
# RPC diagnostics only (2101, 2500-2505): ss -tlnp + journalctl, to file
./scripts/maintenance/diagnose-rpc-502s.sh | tee docs/04-configuration/verification-evidence/rpc-502-diagnostics.txt
# Diagnose only (no starts)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh --diagnose-only
# Apply fixes per-backend (start containers + nginx/node/besu)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh
```
The comprehensive fix script will:
- For each backend: SSH to the host, check `pct status <vmid>`, start container if stopped.
- If container is running: curl from host to backend IP:port; if 000/fail, run `systemctl start nginx` / `node` / `besu` as appropriate and show in-CT `ss -tlnp`.
- HYBX (25032505): if ML110 has no such VMID, try r630-01.
- Cacti: VMID 5200 (cacti-1), 5201 (cacti-alltra), 5202 (cacti-hybx) on r630-02 (migrated 2026-02-15).
## Root cause summary (typical)
| 502 | Typical cause | Fix |
|-----|----------------|-----|
| dbis-admin, secure | Container 10130 stopped or nginx not running | `pct start 10130` on r630-01; inside CT: `systemctl start nginx` |
| dbis-api, dbis-api-2 | Containers 10150/10151 stopped or Node app not running | `pct start` on r630-01; inside CT: `systemctl start node` |
| rpc-http-prv | Container 2101 stopped or Besu not listening on 8545 | `pct start 2101`; inside CT: `systemctl start besu` (allow 3060s) |
| rpc-alltra*, rpc-hybx* | Containers 25002505 stopped or Besu not running | Same: `pct start <vmid>`; inside CT: `systemctl start besu` |
| cacti-alltra, cacti-hybx, cacti-1 | 5200/5201/5202 stopped or web server not running | On r630-02: `pct start 5200/5201/5202`; inside CT: `systemctl start nginx` or `apache2` |
| mim4u.org, www/secure/training.mim4u.org | Container 7810 stopped or nothing on port 80 | On r630-02: `pct start 7810`; inside CT: `systemctl start nginx` or run python stub on 80 (see fix-all-502s-comprehensive.sh) |
### VMID 2400 (ThirdWeb RPC primary, 192.168.11.240)
**Host:** ml110 (192.168.11.10). Service: `besu-rpc` (config: `/etc/besu/config-rpc-thirdweb.toml`). Nginx on 443/80.
**Intermittent RPC timeouts:** If `eth_chainId` to :8545 sometimes fails, Besu may be hitting Vert.x **BlockedThreadChecker** (worker thread blocked >60s during heavy ops). **Fix applied:** In `/etc/systemd/system/besu-rpc.service`, `BESU_OPTS` was extended with `-Dvertx.options.blockedThreadCheckInterval=120000` (120s) so occasional slow operations (e.g. trace, compaction) dont trigger warnings as quickly. Restart: `pct exec 2400 -- systemctl restart besu-rpc.service`. After a restart, Besu may run **RocksDB compaction** before binding 8545; allow 515 minutes then re-check RPC. Config already has `host-allowlist=["*"]`. If the node is down, check: `pct exec 2400 -- journalctl -u besu-rpc -n 30` (look for "Compacting database" or "JSON-RPC service started").
## If 502 persists after running the script
1. **Backends verified in-container but public still 502 (dbis-admin, secure, dbis-api, dbis-api-2):**
The origin (76.53.10.36) routes by hostname. Refresh NPMplus proxy targets from LAN so the proxy forwards to 130:80 and 155/156:3000:
`NPM_PASSWORD=xxx ./scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh`
Then purge Cloudflare cache for those hostnames if needed.
2. **From the Proxmox host** (e.g. SSH to 192.168.11.11):
- `pct exec <vmid> -- ss -tlnp` — see what is listening.
- `pct exec <vmid> -- systemctl status nginx` (or `node`, `besu`) — check unit name and errors.
3. **NPMplus** must be able to reach the backend IP. From the NPMplus host: `curl -s -o /dev/null -w '%{http_code}' http://<backend_ip>:<port>/`.
4. **RPC (2101, 25002505):** If Besu still does not respond after 90s:
- Run `./scripts/maintenance/diagnose-rpc-502s.sh` and check the report (or `pct exec <vmid> -- ss -tlnp` and `journalctl -u besu-rpc` / `besu`).
- Fix config/nodekey/genesis per journal errors.
- Run `./scripts/besu/fix-all-besu-nodes.sh` from project root (optionally `--no-restart` first to only fix configs), or use `./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix`.
**Known infrastructure causes and fixes:**
- **2101:** If journal shows `NoClassDefFoundError: com.sun.jna.Native` or "JNA/Udev" or "Read-only file system" for JNA/libjnidispatch, run from project root (LAN):
`./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` — reinstalls Besu and sets JNA to use `/data/besu/tmp`. If the script exits with "Container … /tmp is not writable", make the CT writable (see **Read-only CT** below) then re-run.
The fix script also sets **p2p-host** in `/etc/besu/config-rpc.toml` to **192.168.11.211** (RPC_CORE_1). If 2101 had `p2p-host="192.168.11.250"` (RPC_ALLTRA_1), other nodes would see the wrong advertised address; correct node lists are in repo `config/besu-node-lists/static-nodes.json` and `permissions-nodes.toml` (2101 = .211).
- **25002505:** If journal shows "Failed to locate executable /opt/besu/bin/besu", install Besu in each CT:
`./scripts/besu/install-besu-permanent-on-missing-nodes.sh` — installs Besu (23.10.3) in 15051508 and 25002505 where missing, deploys config/genesis/node lists, enables and starts the service. Allow ~510 minutes per node. Use `--dry-run` to see which VMIDs would be updated. If install fails with "Read-only file system", make the CT writable first.
### VMID 2101: checklist of causes
When 2101 (Core RPC at 192.168.11.211) is down or crash-looping, check in order:
| Cause | What to check | Fix |
|-------|----------------|-----|
| **Read-only root (emergency_ro)** | `pct exec 2101 -- mount \| grep 'on / '` — if `ro` or `emergency_ro`, root is read-only (e.g. after ext4 errors). | Run `./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` (stops 2101, e2fsck on host, starts CT). Or on host: stop 2101, `e2fsck -f -y /dev/pve/vm-2101-disk-0`, start 2101. |
| **Wrong p2p-host** | `pct exec 2101 -- grep p2p-host /etc/besu/config-rpc.toml` — must be `192.168.11.211` (not .250). | Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (it sets p2p-host to RPC_CORE_1). Or manually: `sed -i 's|p2p-host=.*|p2p-host="192.168.11.211"|' /etc/besu/config-rpc.toml` in CT. |
| **Static / permissioned node lists** | In CT: `/etc/besu/static-nodes.json` and `/etc/besu/permissions-nodes.toml` should list 2101 as `...@192.168.11.211:30303`. Repo: `config/besu-node-lists/`. | Deploy from repo: the fix script copies `static-nodes.json` and `permissions-nodes.toml` when present. Or run `./scripts/deploy-besu-node-lists-to-all.sh`. |
| **No space / RocksDB compaction** | Journal: "No space left on device" during "Compacting database". Host thin pool: `lvs` on r630-01. | Free thin pool (see **LVM thin pool full** below). If root was emergency_ro, fix that first; then restart `besu-rpc`. Optionally start with fresh `/data/besu` to resync. |
| **JNA / Besu binary** | Journal: `NoClassDefFoundError: com.sun.jna.Native` or missing `/opt/besu/bin/besu`. | Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (reinstalls Besu in CT). |
After any fix: `pct exec 2101 -- systemctl restart besu-rpc` then wait ~60s and `curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' http://192.168.11.211:8545/`.
### Read-only CT (2101, 25002505)
If fix or install scripts fail with **"Read-only file system"** (e.g. when creating files in `/root`, `/tmp`, or `/opt`), the containers root (or key mounts) are read-only. Besu/JNA also needs a writable `java.io.tmpdir` (e.g. `/data/besu/tmp`); the install and fix scripts set that when they can write to the CT.
**Make all RPC VMIDs writable in one go (from project root, LAN):**
`./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` — SSHs to r630-01 and for each of 2101, 25002505 runs: stop CT, e2fsck -f -y on rootfs LV, start CT. Then re-run the fix or install script. The full maintenance runner (`run-all-maintenance-via-proxmox-ssh.sh`) runs this step first automatically.
**Make a single CT writable (from the Proxmox host):**
1. **Check mount:** `pct exec <vmid> -- mount | grep 'on / '` — if you see `ro,` then root is mounted read-only.
2. **Remount from inside (if allowed):** `pct exec <vmid> -- mount -o remount,rw /`
If that fails (e.g. "Operation not permitted"), the CT may be running with a read-only rootfs by design.
3. **From the host:** Inspect the CT config: `pct config <vmid>`. If `rootfs` has an option making it read-only, remove or change it (Proxmox UI: CT → Hardware → Root disk; or `pct set <vmid> --rootfs <storage>:<size>` to recreate only if you have a backup).
4. **Alternative:** Ensure at least `/tmp` and `/opt` are writable (e.g. bind-mount writable storage or tmpfs for `/tmp`). Then re-run the fix/install script.
After the CT is writable, run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (2101) or `./scripts/besu/install-besu-permanent-on-missing-nodes.sh` (25002505) again.
### LVM thin pool full (2101 / 25002505 "No space left on device")
If Besu fails with **"No space left on device"** on `/data/besu/database/*.dbtmp` while `df` inside the CT shows free space, the **host** LVM thin pool is full. The CTs disk is thin-provisioned; writes fail when the pool has no free space.
**Check on the Proxmox host (e.g. r630-01):**
```bash
lvs -o lv_name,data_percent,metadata_percent # data at 100% = pool full
```
**Fix:** Free space in the thin pool on that host:
- Remove or shrink unused CT/VM disks, or move VMs to another storage.
- Optionally expand the thin pool (add PV or resize).
- After freeing space, restart the affected service: `pct exec <vmid> -- systemctl restart besu-rpc` (or `besu`).
Until the pool has free space, Besu on 2101 (and any other CT on that host that does large writes) will keep failing with "No space left on device".
**2026-02-15 actions on r630-01:** Ran `fstrim` in all running CTs (pool 100% → 98.33%). Destroyed six **stopped** CTs to free thin pool space: **106, 107, 108, 10000, 10001, 10020** (purge). Migrated **52005202, 60006002, 64006402, 5700** to r630-02. Pool **74.48%**. If 2101 still crash-loops during RocksDB compaction, retry `systemctl restart besu-rpc` or start Besu with a fresh `/data/besu` (resync). See [MIGRATE_CT_R630_01_TO_R630_02.md](../03-deployment/MIGRATE_CT_R630_01_TO_R630_02.md).
## Re-run E2E after fixes
```bash
./scripts/verify/verify-end-to-end-routing.sh
```
Report: `docs/04-configuration/verification-evidence/e2e-verification-<timestamp>/verification_report.md`.
To allow exit 0 when only 502s remain (e.g. CI):
`E2E_ACCEPT_502_INTERNAL=1 ./scripts/verify/verify-end-to-end-routing.sh`
**See also:** [NEXT_STEPS_FOR_YOU.md](NEXT_STEPS_FOR_YOU.md) §3 (LAN steps), [STEPS_FROM_PROXMOX_OR_LAN_WITH_SECRETS.md](STEPS_FROM_PROXMOX_OR_LAN_WITH_SECRETS.md) §3 (fix 502s), [NEXT_STEPS_OPERATOR.md](NEXT_STEPS_OPERATOR.md) (quick commands).