This runs in order: (0) make RPC VMIDs 2101, 2500–2505 writable (e2fsck); (1) resolve-and-fix (Dev VM IP, start containers, DBIS); (2) fix 2101 JNA reinstall; (3) install Besu on missing nodes (2500–2505, 1505–1508); (4) address-all-502s (backends + NPM + RPC diagnostics); (5) E2E verification. Use `--verbose` to see all step output; `STEP2_TIMEOUT=0` to disable step-2 timeout. See [MAINTENANCE_SCRIPTS_REVIEW.md](MAINTENANCE_SCRIPTS_REVIEW.md) and [CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md](../05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md) §9.
## Backend map (domain → IP:port → VMID, host)
| Domain(s) | Backend | VMID | Proxmox host | Service to start |
- For each backend: SSH to the host, check `pct status <vmid>`, start container if stopped.
- If container is running: curl from host to backend IP:port; if 000/fail, run `systemctl start nginx` / `node` / `besu` as appropriate and show in-CT `ss -tlnp`.
- HYBX (2503–2505): if ML110 has no such VMID, try r630-01.
| cacti-alltra, cacti-hybx, cacti-1 | 5200/5201/5202 stopped or web server not running | On r630-02: `pct start 5200/5201/5202`; inside CT: `systemctl start nginx` or `apache2` |
| mim4u.org, www/secure/training.mim4u.org | Container 7810 stopped or nothing on port 80 | On r630-02: `pct start 7810`; inside CT: `systemctl start nginx` or run python stub on 80 (see fix-all-502s-comprehensive.sh) |
**Host:** ml110 (192.168.11.10). Service: `besu-rpc` (config: `/etc/besu/config-rpc-thirdweb.toml`). Nginx on 443/80.
**Intermittent RPC timeouts:** If `eth_chainId` to :8545 sometimes fails, Besu may be hitting Vert.x **BlockedThreadChecker** (worker thread blocked >60s during heavy ops). **Fix applied:** In `/etc/systemd/system/besu-rpc.service`, `BESU_OPTS` was extended with `-Dvertx.options.blockedThreadCheckInterval=120000` (120s) so occasional slow operations (e.g. trace, compaction) don’t trigger warnings as quickly. Restart: `pct exec 2400 -- systemctl restart besu-rpc.service`. After a restart, Besu may run **RocksDB compaction** before binding 8545; allow 5–15 minutes then re-check RPC. Config already has `host-allowlist=["*"]`. If the node is down, check: `pct exec 2400 -- journalctl -u besu-rpc -n 30` (look for "Compacting database" or "JSON-RPC service started").
## If 502 persists after running the script
1.**Backends verified in-container but public still 502 (dbis-admin, secure, dbis-api, dbis-api-2):**
The origin (76.53.10.36) routes by hostname. Refresh NPMplus proxy targets from LAN so the proxy forwards to 130:80 and 155/156:3000:
Then purge Cloudflare cache for those hostnames if needed.
2.**From the Proxmox host** (e.g. SSH to 192.168.11.11):
-`pct exec <vmid> -- ss -tlnp` — see what is listening.
-`pct exec <vmid> -- systemctl status nginx` (or `node`, `besu`) — check unit name and errors.
3.**NPMplus** must be able to reach the backend IP. From the NPMplus host: `curl -s -o /dev/null -w '%{http_code}' http://<backend_ip>:<port>/`.
4.**RPC (2101, 2500–2505):** If Besu still does not respond after 90s:
- Run `./scripts/maintenance/diagnose-rpc-502s.sh` and check the report (or `pct exec <vmid> -- ss -tlnp` and `journalctl -u besu-rpc` / `besu`).
- Fix config/nodekey/genesis per journal errors.
- Run `./scripts/besu/fix-all-besu-nodes.sh` from project root (optionally `--no-restart` first to only fix configs), or use `./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix`.
**Known infrastructure causes and fixes:**
- **2101:** If journal shows `NoClassDefFoundError: com.sun.jna.Native` or "JNA/Udev" or "Read-only file system" for JNA/libjnidispatch, run from project root (LAN):
`./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` — reinstalls Besu and sets JNA to use `/data/besu/tmp`. If the script exits with "Container … /tmp is not writable", make the CT writable (see **Read-only CT** below) then re-run.
The fix script also sets **p2p-host** in `/etc/besu/config-rpc.toml` to **192.168.11.211** (RPC_CORE_1). If 2101 had `p2p-host="192.168.11.250"` (RPC_ALLTRA_1), other nodes would see the wrong advertised address; correct node lists are in repo `config/besu-node-lists/static-nodes.json` and `permissions-nodes.toml` (2101 = .211).
- **2500–2505:** If journal shows "Failed to locate executable /opt/besu/bin/besu", install Besu in each CT:
`./scripts/besu/install-besu-permanent-on-missing-nodes.sh` — installs Besu (23.10.3) in 1505–1508 and 2500–2505 where missing, deploys config/genesis/node lists, enables and starts the service. Allow ~5–10 minutes per node. Use `--dry-run` to see which VMIDs would be updated. If install fails with "Read-only file system", make the CT writable first.
### VMID 2101: checklist of causes
When 2101 (Core RPC at 192.168.11.211) is down or crash-looping, check in order:
| Cause | What to check | Fix |
|-------|----------------|-----|
| **Read-only root (emergency_ro)** | `pct exec 2101 -- mount \| grep 'on / '` — if `ro` or `emergency_ro`, root is read-only (e.g. after ext4 errors). | Run `./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` (stops 2101, e2fsck on host, starts CT). Or on host: stop 2101, `e2fsck -f -y /dev/pve/vm-2101-disk-0`, start 2101. |
| **Wrong p2p-host** | `pct exec 2101 -- grep p2p-host /etc/besu/config-rpc.toml` — must be `192.168.11.211` (not .250). | Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (it sets p2p-host to RPC_CORE_1). Or manually: `sed -i 's|p2p-host=.*|p2p-host="192.168.11.211"|' /etc/besu/config-rpc.toml` in CT. |
| **Static / permissioned node lists** | In CT: `/etc/besu/static-nodes.json` and `/etc/besu/permissions-nodes.toml` should list 2101 as `...@192.168.11.211:30303`. Repo: `config/besu-node-lists/`. | Deploy from repo: the fix script copies `static-nodes.json` and `permissions-nodes.toml` when present. Or run `./scripts/deploy-besu-node-lists-to-all.sh`. |
| **No space / RocksDB compaction** | Journal: "No space left on device" during "Compacting database". Host thin pool: `lvs` on r630-01. | Free thin pool (see **LVM thin pool full** below). If root was emergency_ro, fix that first; then restart `besu-rpc`. Optionally start with fresh `/data/besu` to resync. |
| **JNA / Besu binary** | Journal: `NoClassDefFoundError: com.sun.jna.Native` or missing `/opt/besu/bin/besu`. | Run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (reinstalls Besu in CT). |
After any fix: `pct exec 2101 -- systemctl restart besu-rpc` then wait ~60s and `curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' http://192.168.11.211:8545/`.
### Read-only CT (2101, 2500–2505)
If fix or install scripts fail with **"Read-only file system"** (e.g. when creating files in `/root`, `/tmp`, or `/opt`), the container’s root (or key mounts) are read-only. Besu/JNA also needs a writable `java.io.tmpdir` (e.g. `/data/besu/tmp`); the install and fix scripts set that when they can write to the CT.
**Make all RPC VMIDs writable in one go (from project root, LAN):**
`./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` — SSHs to r630-01 and for each of 2101, 2500–2505 runs: stop CT, e2fsck -f -y on rootfs LV, start CT. Then re-run the fix or install script. The full maintenance runner (`run-all-maintenance-via-proxmox-ssh.sh`) runs this step first automatically.
**Make a single CT writable (from the Proxmox host):**
1.**Check mount:**`pct exec <vmid> -- mount | grep 'on / '` — if you see `ro,` then root is mounted read-only.
2.**Remount from inside (if allowed):**`pct exec <vmid> -- mount -o remount,rw /`
If that fails (e.g. "Operation not permitted"), the CT may be running with a read-only rootfs by design.
3.**From the host:** Inspect the CT config: `pct config <vmid>`. If `rootfs` has an option making it read-only, remove or change it (Proxmox UI: CT → Hardware → Root disk; or `pct set <vmid> --rootfs <storage>:<size>` to recreate only if you have a backup).
4.**Alternative:** Ensure at least `/tmp` and `/opt` are writable (e.g. bind-mount writable storage or tmpfs for `/tmp`). Then re-run the fix/install script.
After the CT is writable, run `./scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` (2101) or `./scripts/besu/install-besu-permanent-on-missing-nodes.sh` (2500–2505) again.
### LVM thin pool full (2101 / 2500–2505 "No space left on device")
If Besu fails with **"No space left on device"** on `/data/besu/database/*.dbtmp` while `df` inside the CT shows free space, the **host** LVM thin pool is full. The CT’s disk is thin-provisioned; writes fail when the pool has no free space.
**Check on the Proxmox host (e.g. r630-01):**
```bash
lvs -o lv_name,data_percent,metadata_percent # data at 100% = pool full
```
**Fix:** Free space in the thin pool on that host:
- Remove or shrink unused CT/VM disks, or move VMs to another storage.
- Optionally expand the thin pool (add PV or resize).
- After freeing space, restart the affected service: `pct exec <vmid> -- systemctl restart besu-rpc` (or `besu`).
Until the pool has free space, Besu on 2101 (and any other CT on that host that does large writes) will keep failing with "No space left on device".
**2026-02-15 actions on r630-01:** Ran `fstrim` in all running CTs (pool 100% → 98.33%). Destroyed six **stopped** CTs to free thin pool space: **106, 107, 108, 10000, 10001, 10020** (purge). Migrated **5200–5202, 6000–6002, 6400–6402, 5700** to r630-02. Pool **74.48%**. If 2101 still crash-loops during RocksDB compaction, retry `systemctl restart besu-rpc` or start Besu with a fresh `/data/besu` (resync). See [MIGRATE_CT_R630_01_TO_R630_02.md](../03-deployment/MIGRATE_CT_R630_01_TO_R630_02.md).