Files
proxmox/docs/00-meta/NEXT_STEPS_OPERATOR.md

258 lines
15 KiB
Markdown
Raw Normal View History

# Next Steps — Operator Runbook
**Last Updated:** 2026-02-20
**Purpose:** Single runbook of copy-paste commands for all remaining operator/LAN/creds steps. Use after automated steps are done.
**References:** [REMAINING_WORK_DETAILED_STEPS.md](REMAINING_WORK_DETAILED_STEPS.md), [WAVE2_WAVE3_OPERATOR_CHECKLIST.md](WAVE2_WAVE3_OPERATOR_CHECKLIST.md), [INFRA_DEPLOYMENT_LOCKED_AND_LOADED.md](../03-deployment/INFRA_DEPLOYMENT_LOCKED_AND_LOADED.md). **Single fixes checklist (required + optional):** [FIXES_PREPARED.md](../04-configuration/FIXES_PREPARED.md). **Full fixes (validators, block/tx, Sentries, RPCs, network, optional):** [FULL_FIXES_PREPARED.md](../04-configuration/FULL_FIXES_PREPARED.md). **All next steps (consolidated):** [NEXT_STEPS_ALL.md](NEXT_STEPS_ALL.md). **Dev/Codespaces (76.53.10.40):** [DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md](../04-configuration/DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md). **Dev/Codespaces completion evidence:** [DEV_CODESPACES_COMPLETION_20260207.md](../04-configuration/verification-evidence/DEV_CODESPACES_COMPLETION_20260207.md).
---
## Completed in this session (2026-02-20)
| Item | Result |
|------|--------|
| Completable tasks | `run-completable-tasks-from-anywhere.sh` — config validation OK, on-chain 45/45, run-all-validation --skip-genesis OK, reconcile-env --print. |
| Doc consolidation | NEXT_STEPS_INDEX, DOCUMENTATION_CONSOLIDATION_PLAN; Batch 4+5 → 00-meta-pruned; root cleanup → archive/root-cleanup-20260220; ARCHIVE_CANDIDATES "Last reviewed" set. |
## Completed in previous session (2026-02-19)
| Item | Result |
|------|--------|
| Completable tasks | `run-completable-tasks-from-anywhere.sh` — config, 46 on-chain, validation passed. |
| Operator script | `run-all-operator-tasks-from-lan.sh` — W0-1 skipped (off-LAN); Blockscout verify attempted (Blockscout unreachable). |
| RPC 2101 verify | `verify-rpc-2101-approve-and-sync.sh` — ✅ Chain 138, 19 peers, 5 validators, blocks advancing. |
| 502 script | `address-all-remaining-502s.sh` — backends 10130/10150/10151 OK; Besu 2101 restarted (finish from LAN for NPMplus). |
| Optional Phase 9 | Smart accounts kit (informational) — ran; next: deploy EntryPoint/AccountFactory/Paymaster. |
| E2E verification | `verify-end-to-end-routing.sh` with E2E_ACCEPT_502_INTERNAL=1 — run (report in verification-evidence). |
**Still from LAN:** NPMplus backup, Blockscout verification, full 502/NPMplus proxy update. See [COMPLETION_STATUS_20260215](../archive/00-meta-pruned/COMPLETION_STATUS_20260215.md).
---
## Completed in previous session (2026-02-06)
| Item | Result |
|------|--------|
| Validation | `run-all-validation.sh --skip-genesis` — passed |
| W1-1 dry-run | `setup-ssh-key-auth.sh --dry-run` — steps printed |
| W1-2 dry-run | `firewall-proxmox-8006.sh --dry-run` — UFW commands printed (ADMIN_CIDR=192.168.11.0/24) |
| NPMplus backup | `backup-npmplus.sh` — ran successfully (local + on host); backup pulled to `backups/npmplus/backup-20260206_171756.tar.gz` |
| Bridge dry-run | `run-send-cross-chain.sh 0.01 --dry-run` — simulated (real run when PRIVATE_KEY/LINK ready) |
| .env NPM | NPM_URL/NPM_HOST set to 192.168.11.167:81 (use .167 if .166 refuses) |
| **Copy to host** | Scripts copied to **root@192.168.11.11:/tmp/proxmox-scripts-run** (wave0, backup, secure-validator-keys, create-missing-containers, schedule cron scripts, daily-weekly-checks) |
| **Wave 0 on host** | Ran on r630-01: W0-1 (19 NPMplus proxy hosts updated), W0-3 (backup); backup also on host at `.../backups/npmplus/backup-20260206_171756.tar.gz` |
| **Backup pulled** | Host backup copied to local `backups/npmplus/backup-20260206_171756.tar.gz` |
| **Validator keys** | `secure-validator-keys.sh --dry-run` run on host — 10001002 would be secured; 10031004 not running, skipped. Use `--apply` on host when ready. |
| **Cron scripts on host** | schedule-npmplus-backup-cron.sh and schedule-daily-weekly-cron.sh (and daily-weekly-checks.sh) copied; use `--show` then `--install` from `/tmp/proxmox-scripts-run` if you want cron there (note: /tmp may be cleared on reboot; for permanent cron, clone repo to a persistent path on the host). |
| **Cron installed on host** | NPMplus backup cron (03:00) and daily/weekly cron (08:00 daily, Sun 09:00 weekly) installed on root@192.168.11.11. Logs: `/tmp/proxmox-scripts-run/logs/npmplus-backup.log`, `daily-weekly-checks.log`. |
| **Validator keys applied** | `secure-validator-keys.sh` run on host (no --dry-run): VMIDs 1000, 1001, 1002 secured (chmod 600/700, chown besu); 1003, 1004 not running, skipped. |
---
## Wave 0 — Gates
### W0-2: sendCrossChain (real)
**When:** PRIVATE_KEY and LINK (or fee token) approved in `.env`; you are ready to broadcast.
```bash
cd /path/to/proxmox
# Optional: dry-run first
bash scripts/bridge/run-send-cross-chain.sh 0.01 --dry-run
# Real (no --dry-run)
bash scripts/bridge/run-send-cross-chain.sh 0.01
# Or with recipient:
bash scripts/bridge/run-send-cross-chain.sh 0.01 0xYourRecipientAddress
```
Bridge contract (reference): `0xcacfd227A040002e49e2e01626363071324f820a`. Ensure `CCIPWETH9_BRIDGE_CHAIN138` and `RPC_URL_138`/`CHAIN138_RPC` in `.env`.
### W0-3: NPMplus backup (re-run anytime)
Backup already ran once; re-run when NPMplus is up and you want a fresh backup:
```bash
cd /path/to/proxmox
bash scripts/verify/backup-npmplus.sh
```
From a host without NPM API access, use: `bash scripts/run-via-proxmox-ssh.sh wave0 --host 192.168.11.11` (r630-01) to run W0-1 + W0-3 on the host.
---
## Crontab (install on jump host or Proxmox node)
```bash
cd /path/to/proxmox
# Show lines
bash scripts/maintenance/schedule-npmplus-backup-cron.sh --show
bash scripts/maintenance/schedule-daily-weekly-cron.sh --show
# Install
bash scripts/maintenance/schedule-npmplus-backup-cron.sh --install
bash scripts/maintenance/schedule-daily-weekly-cron.sh --install
```
---
## Wave 1 — Security (run on each Proxmox host or via SSH)
### W1-1: SSH key-based auth (disable password)
**Pre-requisite:** Deploy SSH keys to all hosts (`ssh-copy-id root@<host>`); test login; have break-glass access.
```bash
cd /path/to/proxmox
# On each Proxmox host (or: ssh root@192.168.11.11 'cd /path/to/proxmox && bash scripts/security/setup-ssh-key-auth.sh --apply')
bash scripts/security/setup-ssh-key-auth.sh --apply
```
### W1-2: Firewall — restrict Proxmox API port 8006
**Pre-requisite:** Run on host where UFW is used (or apply equivalent iptables). Default CIDR: 192.168.11.0/24.
```bash
cd /path/to/proxmox
# Dry-run (already done)
bash scripts/security/firewall-proxmox-8006.sh --dry-run
# Apply (allow only ADMIN_CIDR)
bash scripts/security/firewall-proxmox-8006.sh --apply
# Or with custom CIDR:
bash scripts/security/firewall-proxmox-8006.sh --apply 192.168.11.0/24
```
Then verify: `https://<proxmox-ip>:8006` only from allowed IPs.
### W1-19: Secure validator keys (on Proxmox host as root)
```bash
cd /path/to/proxmox
bash scripts/secure-validator-keys.sh --dry-run # review
bash scripts/secure-validator-keys.sh # apply (chmod 600, chown besu)
```
---
---
## VMIDs 2506, 2507, 2508 — Destroyed 2026-02-08
Containers 2506, 2507, 2508 were **removed and destroyed** on all Proxmox hosts. Script: `scripts/destroy-vmids-2506-2508.sh`. Besu RPC range is **25002505** only. See [MISSING_CONTAINERS_LIST.md](../03-deployment/MISSING_CONTAINERS_LIST.md).
---
## Dev/Codespaces (76.53.10.40) — Full completion
**Single ordered checklist:** [04-configuration/DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md](../04-configuration/DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md) — Phases 17 (fourth NPMplus, dev VM, UDM port forward, Cloudflare tunnel, NPMplus proxy hosts, projects/dotenv, verification).
**Key commands (after fourth NPMplus and dev VM exist):**
| Step | Command |
|------|---------|
| Create fourth NPMplus LXC (10236 @ 192.168.11.170) | `bash scripts/npmplus/create-npmplus-fourth-container.sh` |
| Create dev VM (5700 @ 192.168.11.59) | `bash scripts/create-dev-vm-5700.sh` |
| Setup dev VM users + Gitea | `ssh root@192.168.11.11 "pct exec 5700 -- bash -s" < scripts/setup-dev-vm-users-and-gitea.sh` |
| Tunnel + DNS (set CLOUDFLARE_TUNNEL_ID_DEV_CODESPACES in .env first) | `bash scripts/cloudflare/configure-dev-codespaces-tunnel-and-dns.sh` |
| Fourth NPMplus proxy hosts | `NPM_URL=https://192.168.11.170:81 NPM_PASSWORD='...' bash scripts/nginx-proxy-manager/update-npmplus-fourth-proxy-hosts.sh` |
UDM Pro: add port forward 76.53.10.40 → 192.168.11.170 (80/81/443), optional 22 → 192.168.11.59. See [UDM_PRO_DEV_CODESPACES_PORT_FORWARD.md](../04-configuration/UDM_PRO_DEV_CODESPACES_PORT_FORWARD.md).
---
## Wave 2 & Wave 3 — Full checklist
Use the ordered checklist:
- **[WAVE2_WAVE3_OPERATOR_CHECKLIST.md](WAVE2_WAVE3_OPERATOR_CHECKLIST.md)** — W2-1 (monitoring) through W2-8 (NPMplus HA), then W3-1 (CCIP Fleet), W3-2 (Phase 4 isolation).
Summary:
| Wave | Tasks |
|------|--------|
| W2-1 | Monitoring stack (Prometheus, Grafana, Loki, Alertmanager) |
| W2-2 | Grafana via Cloudflare Access; alerts |
| W2-3 | VLAN enablement (UDM Pro, Proxmox bridge) |
| W2-4 | Phase 3 CCIP: Ops/Admin (54005401); NAT; scripts |
| W2-5 | Phase 4 sovereign tenant VLANs |
| W2-6 | ~~25062508~~ Destroyed 2026-02-08 (RPC 25002505 only) |
| W2-7 | DBIS services (1010010151) |
| W2-8 | NPMplus HA (optional) |
| W3-1 | CCIP Fleet (commit/execute/RMN nodes) |
| W3-2 | Phase 4 tenant isolation enforcement |
---
## Explorer SSL (manual)
If **explorer.d-bis.org** shows "Your connection isn't private":
1. Open NPMplus: **https://192.168.11.167:81** (credentials: `NPM_EMAIL`, `NPM_PASSWORD` from `.env`).
2. SSL Certificates → Add Let's Encrypt for `explorer.d-bis.org` (DNS Challenge + Cloudflare credential if needed).
3. Proxy Hosts → explorer.d-bis.org → SSL tab → assign cert, Force SSL, Save.
See [EXPLORER_TROUBLESHOOTING.md](../04-configuration/EXPLORER_TROUBLESHOOTING.md).
---
## E2E 502s (when public domains return 502)
From **LAN** (SSH to Proxmox + reach NPMplus):
| Goal | Command |
|------|---------|
| Fix all 502 backends + NPMplus proxy + RPC diagnostics | `./scripts/maintenance/address-all-remaining-502s.sh` |
| Also Besu config fix + E2E at end | `./scripts/maintenance/address-all-remaining-502s.sh --run-besu-fix --e2e` |
| Re-run E2E only | `./scripts/verify/verify-end-to-end-routing.sh` |
**Runbook:** [502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md](502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md).
---
## Remaining (operator only)
- **W0-2** — sendCrossChain real (when PRIVATE_KEY/LINK ready).
- **W1-1 / W1-2** — SSH key auth and firewall 8006 `--apply` on each Proxmox host (after keys deployed / CIDR decided).
- **Cron** — ✅ Installed on root@192.168.11.11 (NPMplus 03:00; daily 08:00; weekly Sun 09:00). Re-install if you move repo to a permanent path.
- **Validator keys** — ✅ Applied on host for 10001002; 10031004 skipped (not running). Re-run when 1003/1004 are up if needed.
- **25062508** — Destroyed 2026-02-08; no action.
- **Wave 2 / 3** — Monitoring, VLAN, CCIP, NPMplus HA, Phase 4 per WAVE2_WAVE3_OPERATOR_CHECKLIST.
- **Explorer SSL** — Let's Encrypt for explorer.d-bis.org in NPMplus UI (see above). One-time (and after NPMplus restore if certs lost).
- **Explorer VM 5000 thin pool** — If thin1-r630-02 is >85% or full, migrate VMID 5000 to thin5 per [BLOCKSCOUT_FIX_RUNBOOK.md](../03-deployment/BLOCKSCOUT_FIX_RUNBOOK.md) § "Fix: Migrate VM 5000 to thin5". Weekly cron now checks thin pool (138a); act when it warns or fails.
- **NPMplus cert 134 (cross-all.defi-oracle.io)** — If verification reports "cert files missing" for cert ID 134: in NPMplus at https://192.168.11.167:81 → SSL Certificates → find cross-all.defi-oracle.io → re-save or request Let's Encrypt again to restore cert files on disk.
- **Dev/Codespaces (76.53.10.40)** — Complete all phases in [DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md](../04-configuration/DEV_CODESPACES_NEXT_STEPS_CHECKLIST.md): fourth NPMplus (10236), dev VM (5700), UDM port forward, Cloudflare tunnel, NPMplus fourth proxy hosts, Let's Encrypt, rsync/dotenv, verification.
---
## After running "complete all next steps"
1. **Automated (workspace):** `bash scripts/run-all-next-steps.sh` — report in `docs/04-configuration/verification-evidence/NEXT_STEPS_RUN_*.md`.
2. **Validators + tx-pool:** `bash scripts/fix-all-validators-and-txpool.sh` (requires SSH to .10, .11).
3. **Flush stuck tx (if any):** `bash scripts/flush-stuck-tx-rpc-and-validators.sh --full` (clears RPC 2101 + validators 10001004).
4. **Verify from LAN:** From a host on 192.168.11.x run `bash scripts/monitoring/monitor-blockchain-health.sh` and `bash scripts/skip-stuck-transactions.sh`. See [NEXT_STEPS_COMPLETION_RUN_20260208.md](../04-configuration/verification-evidence/NEXT_STEPS_COMPLETION_RUN_20260208.md) § Verify from LAN.
---
## Quick command index
| Goal | Command |
|------|---------|
| **Run all automated next steps** | `bash scripts/run-all-next-steps.sh` (validation, E2E, explorer check, dry-runs; report in verification-evidence/NEXT_STEPS_RUN_*.md) |
| W0-2 real | `bash scripts/bridge/run-send-cross-chain.sh 0.01` |
| W0-3 backup | `bash scripts/verify/backup-npmplus.sh` |
| W0 from LAN | `bash scripts/run-wave0-from-lan.sh` |
| W1-1 apply | `bash scripts/security/setup-ssh-key-auth.sh --apply` (on each host) |
| W1-2 apply | `bash scripts/security/firewall-proxmox-8006.sh --apply` |
| NPMplus cron | `bash scripts/maintenance/schedule-npmplus-backup-cron.sh --install` |
| Daily/weekly cron | `bash scripts/maintenance/schedule-daily-weekly-cron.sh --install` |
| Validator keys | On Proxmox: `bash scripts/secure-validator-keys.sh` (after --dry-run) |
| Wave 0 via SSH | `bash scripts/run-via-proxmox-ssh.sh wave0 --host 192.168.11.11` |
| Request cert (via SSH) | `bash scripts/run-via-proxmox-ssh.sh request-cert --host 192.168.11.11` |
| Fourth NPMplus container | `bash scripts/npmplus/create-npmplus-fourth-container.sh` |
| Dev VM create | `bash scripts/create-dev-vm-5700.sh` |
| Dev/Codespaces tunnel+DNS | `bash scripts/cloudflare/configure-dev-codespaces-tunnel-and-dns.sh` (set CLOUDFLARE_TUNNEL_ID_DEV_CODESPACES in .env) |
| Fourth NPMplus proxy hosts | `NPM_URL=https://192.168.11.170:81 NPM_PASSWORD='...' bash scripts/nginx-proxy-manager/update-npmplus-fourth-proxy-hosts.sh` |
| **Address all 502s (LAN)** | `./scripts/maintenance/address-all-remaining-502s.sh` (use `--run-besu-fix --e2e` for full flow) |
| E2E routing (after NPMplus/DNS change) | `bash scripts/verify/verify-end-to-end-routing.sh` |
| Explorer E2E from LAN (after frontend/Blockscout deploy) | `bash explorer-monorepo/scripts/e2e-test-explorer.sh` |
| Blockscout migrations (version/config change) | On r630-02: `bash scripts/fix-blockscout-ssl-and-migrations.sh` — see [BLOCKSCOUT_FIX_RUNBOOK.md](../03-deployment/BLOCKSCOUT_FIX_RUNBOOK.md) |
| When decommissioning RPC used by explorer | Update Blockscout RPC URL on VM 5000; restart Blockscout — see [OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md) § "When decommissioning or changing RPC nodes" |