- MASTER_INDEX: Last Updated 2026-03-06; status 59/59 contracts; add NEXT_STEPS_LIST, CONTRACT_NEXT_STEPS_LIST - docs/README, NEXT_STEPS_INDEX, 06-besu/MASTER_INDEX: Last Updated 2026-03-06 - Contract check script: 59 addresses (PMM, vault/reserve, CompliantFiatTokens); canonical CCIP/router - New docs: EXECUTION_CHECKLIST, NEXT_STEPS_LIST, DOTENV_AUDIT, ADDITIONAL_PATHS, deployer gas runbook, WEMIX_ACQUISITION_TABLED, etc. - Config: deployer-gas-routes, cro-wemix-swap-routes, routing-registry, token-mapping - Scripts: check-contracts-on-chain-138, check-pmm-pool-balances-chain138, deployer-gas-auto-route, acquire-cro-and-wemix-gas - Operator rule: operator-lan-access-check.mdc Made-with: Cursor
14 KiB
502 Deep Dive: Root Causes and Fixes
Last updated: 2026-02-14
This document maps each E2E 502 to its backend, root cause, and fix. Use from LAN with SSH to Proxmox.
Full maintenance (all RPC + 502 in one run)
From project root on LAN (SSH to r630-01, ml110, r630-02):
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
This runs in order: (0) make RPC VMIDs 2101, 2500–2505 writable (e2fsck); (1) resolve-and-fix (Dev VM IP, start containers, DBIS); (2) fix 2101 JNA reinstall; (3) install Besu on missing nodes (2500–2505, 1505–1508); (4) address-all-502s (backends + NPM + RPC diagnostics); (5) E2E verification. Use --verbose to see all step output; STEP2_TIMEOUT=0 to disable step-2 timeout. See MAINTENANCE_SCRIPTS_REVIEW.md and CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md §9.
Backend map (domain → IP:port → VMID, host)
| Domain(s) | Backend | VMID | Proxmox host | Service to start |
|---|---|---|---|---|
| dbis-admin.d-bis.org, secure.d-bis.org | 192.168.11.130:80 | 10130 | r630-01 (192.168.11.11) | nginx |
| dbis-api.d-bis.org, dbis-api-2.d-bis.org | 192.168.11.155:3000, .156:3000 | 10150, 10151 | r630-01 | node |
| rpc-http-prv.d-bis.org, rpc-ws-prv.d-bis.org | 192.168.11.211:8545/8546 | 2101 | r630-01 | besu |
| mim4u.org, www.mim4u.org, secure.mim4u.org, training.mim4u.org | 192.168.11.37:80 | 7810 | r630-02 (192.168.11.12) | nginx (or python stub in fix-all-502s-comprehensive.sh) |
| rpc-alltra*.d-bis.org (3) | 192.168.11.172/173/174:8545 | 2500, 2501, 2502 | r630-01 | besu |
| rpc-hybx*.d-bis.org (3) | 192.168.11.246/247/248:8545 | 2503, 2504, 2505 | r630-01 or ml110 | besu |
| cacti-1 (if proxied) | 192.168.11.80:80 | 5200 | r630-02 | nginx/apache2 |
| cacti-alltra.d-bis.org | 192.168.11.177:80 | 5201 | r630-02 | nginx/apache2 |
| cacti-hybx.d-bis.org | 192.168.11.251:80 | 5202 | r630-02 | nginx/apache2 |
One-command: address all remaining 502s
From a host on the LAN (can reach NPMplus and Proxmox):
# Full flow: backends + NPMplus proxy update (if NPM_PASSWORD set) + RPC diagnostics
./scripts/maintenance/address-all-remaining-502s.sh
# Skip NPMplus update (e.g. no .env yet)
./scripts/maintenance/address-all-remaining-502s.sh --no-npm
# Also run Besu mass-fix (config + restart) and E2E at the end
./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix --e2e
This runs in order: (1) fix-all-502s-comprehensive.sh, (2) NPMplus proxy update when NPM_PASSWORD is set, (3) diagnose-rpc-502s.sh (saves report under docs/04-configuration/verification-evidence/), (4) optional fix-all-besu-nodes.sh, (5) optional E2E.
Per-step diagnose and fix
From a host that can SSH to Proxmox (r630-01, r630-02, ml110):
# Comprehensive fix (DBIS 10130 Python, dbis-api, 2101, 2500-2505 Besu, Cacti Python)
./scripts/maintenance/fix-all-502s-comprehensive.sh
# RPC diagnostics only (2101, 2500-2505): ss -tlnp + journalctl, to file
./scripts/maintenance/diagnose-rpc-502s.sh | tee docs/04-configuration/verification-evidence/rpc-502-diagnostics.txt
# Diagnose only (no starts)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh --diagnose-only
# Apply fixes per-backend (start containers + nginx/node/besu)
./scripts/maintenance/diagnose-and-fix-502s-via-ssh.sh
The comprehensive fix script will:
- For each backend: SSH to the host, check
pct status <vmid>, start container if stopped. - If container is running: curl from host to backend IP:port; if 000/fail, run
systemctl start nginx/node/besuas appropriate and show in-CTss -tlnp. - HYBX (2503–2505): if ML110 has no such VMID, try r630-01.
- Cacti: VMID 5200 (cacti-1), 5201 (cacti-alltra), 5202 (cacti-hybx) on r630-02 (migrated 2026-02-15).
Root cause summary (typical)
| 502 | Typical cause | Fix |
|---|---|---|
| dbis-admin, secure | Container 10130 stopped or nginx not running | pct start 10130 on r630-01; inside CT: systemctl start nginx |
| dbis-api, dbis-api-2 | Containers 10150/10151 stopped or Node app not running | pct start on r630-01; inside CT: systemctl start node |
| rpc-http-prv | Container 2101 stopped or Besu not listening on 8545 | pct start 2101; inside CT: systemctl start besu (allow 30–60s) |
| rpc-alltra*, rpc-hybx* | Containers 2500–2505 stopped or Besu not running | Same: pct start <vmid>; inside CT: systemctl start besu |
| cacti-alltra, cacti-hybx, cacti-1 | 5200/5201/5202 stopped or web server not running | On r630-02: pct start 5200/5201/5202; inside CT: systemctl start nginx or apache2 |
| mim4u.org, www/secure/training.mim4u.org | Container 7810 stopped or nothing on port 80 | On r630-02: pct start 7810; inside CT: systemctl start nginx or run python stub on 80 (see fix-all-502s-comprehensive.sh) |
VMID 2400 (ThirdWeb RPC primary, 192.168.11.240)
Host: ml110 (192.168.11.10). Service: besu-rpc (config: /etc/besu/config-rpc-thirdweb.toml). Nginx on 443/80.
Intermittent RPC timeouts: If eth_chainId to :8545 sometimes fails, Besu may be hitting Vert.x BlockedThreadChecker (worker thread blocked >60s during heavy ops). Fix applied: In /etc/systemd/system/besu-rpc.service, BESU_OPTS was extended with -Dvertx.options.blockedThreadCheckInterval=120000 (120s) so occasional slow operations (e.g. trace, compaction) don’t trigger warnings as quickly. Restart: pct exec 2400 -- systemctl restart besu-rpc.service. After a restart, Besu may run RocksDB compaction before binding 8545; allow 5–15 minutes then re-check RPC. Config already has host-allowlist=["*"]. If the node is down, check: pct exec 2400 -- journalctl -u besu-rpc -n 30 (look for "Compacting database" or "JSON-RPC service started").
If 502 persists after running the script
-
Backends verified in-container but public still 502 (dbis-admin, secure, dbis-api, dbis-api-2):
The origin (76.53.10.36) routes by hostname. Refresh NPMplus proxy targets from LAN so the proxy forwards to 130:80 and 155/156:3000:
NPM_PASSWORD=xxx ./scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh
Then purge Cloudflare cache for those hostnames if needed. -
From the Proxmox host (e.g. SSH to 192.168.11.11):
pct exec <vmid> -- ss -tlnp— see what is listening.pct exec <vmid> -- systemctl status nginx(ornode,besu) — check unit name and errors.
-
NPMplus must be able to reach the backend IP. From the NPMplus host:
curl -s -o /dev/null -w '%{http_code}' http://<backend_ip>:<port>/. -
RPC (2101, 2500–2505): If Besu still does not respond after 90s:
- Run
./scripts/maintenance/diagnose-rpc-502s.shand check the report (orpct exec <vmid> -- ss -tlnpandjournalctl -u besu-rpc/besu). - Fix config/nodekey/genesis per journal errors.
- Run
./scripts/besu/fix-all-besu-nodes.shfrom project root (optionally--no-restartfirst to only fix configs), or use./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix.
- Run
Known infrastructure causes and fixes:
- 2101: If journal shows
NoClassDefFoundError: com.sun.jna.Nativeor "JNA/Udev" or "Read-only file system" for JNA/libjnidispatch, run from project root (LAN):
./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh— reinstalls Besu and sets JNA to use/data/besu/tmp. If the script exits with "Container … /tmp is not writable", make the CT writable (see Read-only CT below) then re-run.
The fix script also sets p2p-host in/etc/besu/config-rpc.tomlto 192.168.11.211 (RPC_CORE_1). If 2101 hadp2p-host="192.168.11.250"(RPC_ALLTRA_1), other nodes would see the wrong advertised address; correct node lists are in repoconfig/besu-node-lists/static-nodes.jsonandpermissions-nodes.toml(2101 = .211). - 2500–2505: If journal shows "Failed to locate executable /opt/besu/bin/besu", install Besu in each CT:
./scripts/besu/install-besu-permanent-on-missing-nodes.sh— installs Besu (23.10.3) in 1505–1508 and 2500–2505 where missing, deploys config/genesis/node lists, enables and starts the service. Allow ~5–10 minutes per node. Use--dry-runto see which VMIDs would be updated. If install fails with "Read-only file system", make the CT writable first.
VMID 2101: checklist of causes
When 2101 (Core RPC at 192.168.11.211) is down or crash-looping, check in order:
| Cause | What to check | Fix |
|---|---|---|
| Read-only root (emergency_ro) | pct exec 2101 -- mount | grep 'on / ' — if ro or emergency_ro, root is read-only (e.g. after ext4 errors). |
Run ./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh (stops 2101, e2fsck on host, starts CT). Or on host: stop 2101, e2fsck -f -y /dev/pve/vm-2101-disk-0, start 2101. |
| Wrong p2p-host | pct exec 2101 -- grep p2p-host /etc/besu/config-rpc.toml — must be 192.168.11.211 (not .250). |
Run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (it sets p2p-host to RPC_CORE_1). Or manually: `sed -i 's |
| Static / permissioned node lists | In CT: /etc/besu/static-nodes.json and /etc/besu/permissions-nodes.toml should list 2101 as ...@192.168.11.211:30303. Repo: config/besu-node-lists/. |
Deploy from repo: the fix script copies static-nodes.json and permissions-nodes.toml when present. Or run ./scripts/deploy-besu-node-lists-to-all.sh. |
| No space / RocksDB compaction | Journal: "No space left on device" during "Compacting database". Host thin pool: lvs on r630-01. |
Free thin pool (see LVM thin pool full below). If root was emergency_ro, fix that first; then restart besu-rpc. Optionally start with fresh /data/besu to resync. |
| JNA / Besu binary | Journal: NoClassDefFoundError: com.sun.jna.Native or missing /opt/besu/bin/besu. |
Run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (reinstalls Besu in CT). |
After any fix: pct exec 2101 -- systemctl restart besu-rpc then wait ~60s and curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' http://192.168.11.211:8545/.
Read-only CT (2101, 2500–2505)
If fix or install scripts fail with "Read-only file system" (e.g. when creating files in /root, /tmp, or /opt), the container’s root (or key mounts) are read-only. Besu/JNA also needs a writable java.io.tmpdir (e.g. /data/besu/tmp); the install and fix scripts set that when they can write to the CT.
Make all RPC VMIDs writable in one go (from project root, LAN):
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh — SSHs to r630-01 and for each of 2101, 2500–2505 runs: stop CT, e2fsck -f -y on rootfs LV, start CT. Then re-run the fix or install script. The full maintenance runner (run-all-maintenance-via-proxmox-ssh.sh) runs this step first automatically.
Make a single CT writable (from the Proxmox host):
- Check mount:
pct exec <vmid> -- mount | grep 'on / '— if you seero,then root is mounted read-only. - Remount from inside (if allowed):
pct exec <vmid> -- mount -o remount,rw /
If that fails (e.g. "Operation not permitted"), the CT may be running with a read-only rootfs by design. - From the host: Inspect the CT config:
pct config <vmid>. Ifrootfshas an option making it read-only, remove or change it (Proxmox UI: CT → Hardware → Root disk; orpct set <vmid> --rootfs <storage>:<size>to recreate only if you have a backup). - Alternative: Ensure at least
/tmpand/optare writable (e.g. bind-mount writable storage or tmpfs for/tmp). Then re-run the fix/install script.
After the CT is writable, run ./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh (2101) or ./scripts/besu/install-besu-permanent-on-missing-nodes.sh (2500–2505) again.
LVM thin pool full (2101 / 2500–2505 "No space left on device")
If Besu fails with "No space left on device" on /data/besu/database/*.dbtmp while df inside the CT shows free space, the host LVM thin pool is full. The CT’s disk is thin-provisioned; writes fail when the pool has no free space.
Check on the Proxmox host (e.g. r630-01):
lvs -o lv_name,data_percent,metadata_percent # data at 100% = pool full
Fix: Free space in the thin pool on that host:
- Remove or shrink unused CT/VM disks, or move VMs to another storage.
- Optionally expand the thin pool (add PV or resize).
- After freeing space, restart the affected service:
pct exec <vmid> -- systemctl restart besu-rpc(orbesu).
Until the pool has free space, Besu on 2101 (and any other CT on that host that does large writes) will keep failing with "No space left on device".
2026-02-15 actions on r630-01: Ran fstrim in all running CTs (pool 100% → 98.33%). Destroyed six stopped CTs to free thin pool space: 106, 107, 108, 10000, 10001, 10020 (purge). Migrated 5200–5202, 6000–6002, 6400–6402, 5700 to r630-02. Pool 74.48%. If 2101 still crash-loops during RocksDB compaction, retry systemctl restart besu-rpc or start Besu with a fresh /data/besu (resync). See MIGRATE_CT_R630_01_TO_R630_02.md.
Re-run E2E after fixes
./scripts/verify/verify-end-to-end-routing.sh
Report: docs/04-configuration/verification-evidence/e2e-verification-<timestamp>/verification_report.md.
To allow exit 0 when only 502s remain (e.g. CI):
E2E_ACCEPT_502_INTERNAL=1 ./scripts/verify/verify-end-to-end-routing.sh
See also: NEXT_STEPS_FOR_YOU.md §3 (LAN steps), STEPS_FROM_PROXMOX_OR_LAN_WITH_SECRETS.md §3 (fix 502s), NEXT_STEPS_OPERATOR.md (quick commands).