Files
proxmox/docs/03-deployment/OPERATIONAL_RUNBOOKS.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

562 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Operational Runbooks - Master Index
**Navigation:** [Home](../01-getting-started/README.md) > [Deployment](README.md) > Operational Runbooks
**Last Updated:** 2026-02-18
**Document Version:** 1.3
**Status:** Active Documentation
---
## Overview
This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment. For issue-specific troubleshooting (RPC, QBFT, SSH, tunnel, etc.), see **[../09-troubleshooting/README.md](../09-troubleshooting/README.md)** and [TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md).
---
## Quick Reference
### Emergency Procedures
- **[Emergency Access](#emergency-access)** - Break-glass access procedures
- **[Service Recovery](#service-recovery)** - Recovering failed services
- **[Network Recovery](#network-recovery)** - Network connectivity issues
### VM/Container Restart
To restart all stopped containers across Proxmox hosts via SSH:
```bash
# From project root; source config for host IPs
source config/ip-addresses.conf
# List stopped per host
for host in $PROXMOX_HOST_ML110 $PROXMOX_HOST_R630_01 $PROXMOX_HOST_R630_02; do
ssh root@$host "pct list | awk '\$2==\"stopped\" {print \$1}'"
done
# Start each (replace HOST and VMID)
ssh root@HOST "pct start VMID"
```
**Verification:** `scripts/verify/verify-backend-vms.sh` | **Report:** [VM_RESTART_AND_VERIFICATION_20260203.md](../../reports/status/VM_RESTART_AND_VERIFICATION_20260203.md)
**CT 2301 corrupted rootfs:** If besu-rpc-private-1 (ml110) fails with pre-start hook: `scripts/fix-ct-2301-corrupted-rootfs.sh`
### Common Operations
- **[Adding a Validator](#adding-a-validator)** - Add new validator node
- **[Removing a Validator](#removing-a-validator)** - Remove validator node
- **[Upgrading Besu](#upgrading-besu)** - Besu version upgrade
- **[Key Rotation](#key-rotation)** - Validator key rotation
---
## Network Operations
### ER605 Router Configuration
- **[ER605_ROUTER_CONFIGURATION.md](../04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Complete router configuration guide
- **VLAN Configuration** - Setting up VLANs on ER605
- **NAT Pool Configuration** - Configuring role-based egress NAT
- **Failover Configuration** - Setting up WAN failover
### VLAN Management
- **VLAN Migration** - Migrating from flat LAN to VLANs
- **VLAN Troubleshooting** - Common VLAN issues and solutions
- **Inter-VLAN Routing** - Configuring routing between VLANs
### Edge and DNS (Fastly / Direct to NPMplus)
- **[EDGE_PORT_VERIFICATION_RUNBOOK.md](../05-network/EDGE_PORT_VERIFICATION_RUNBOOK.md)** - Phase 0: verify 76.53.10.36:80/443 from internet
- **[CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md)** - Edge routing (Fastly or direct → UDM Pro → NPMplus; Option B for RPC)
- **[OPTION_B_RPC_VIA_TUNNEL_RUNBOOK.md](../05-network/OPTION_B_RPC_VIA_TUNNEL_RUNBOOK.md)** - RPC via Cloudflare Tunnel (6 hostnames → NPMplus); [TUNNEL_SFVALLEY01_INSTALL.md](../04-configuration/cloudflare/TUNNEL_SFVALLEY01_INSTALL.md) - connector install
- **Fastly:** Purge cache, health checks, origin 76.53.10.36 (see Fastly dashboard; optional restrict UDM Pro to Fastly IPs)
- **NPMplus HA failover:** [NPMPLUS_HA_SETUP_GUIDE.md](../04-configuration/NPMPLUS_HA_SETUP_GUIDE.md) - Keepalived/HAProxy; failover to 10234
- **502 runbook:** Check (1) NPMplus (192.168.11.167) up and proxy hosts correct, (2) backend VMID 2201 (RPC) or 5000 (Blockscout) up and reachable, (3) if using Fastly, origin reachability from Fastly to 76.53.10.36; if Option B RPC, tunnel connector (e.g. VMID 102) running. Blockscout 502: [BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)
### Cloudflare (DNS and optional Access)
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup (DNS retained; Option B tunnel for RPC only)
- **Application Publishing** - Publishing applications via Cloudflare Access (optional)
- **Access Policy Management** - Managing access policies
---
## Smart Accounts (Chain 138 / ERC-4337)
- **Location:** `smom-dbis-138/script/smart-accounts/DeploySmartAccountsKit.s.sol`
- **Env (required for deploy/use):** `PRIVATE_KEY`, `RPC_URL_138`. Optional: `ENTRY_POINT`, `SMART_ACCOUNT_FACTORY`, `PAYMASTER` — set to deployed addresses to use existing contracts; otherwise deploy EntryPoint (ERC-4337), AccountFactory (e.g. MetaMask Smart Accounts Kit), and optionally Paymaster, then set in `.env` and re-run.
- **Run:** `forge script script/smart-accounts/DeploySmartAccountsKit.s.sol --rpc-url $RPC_URL_138 --broadcast` (from `smom-dbis-138`). If addresses are in env, script logs them; else it logs next steps.
- **See:** [PLACEHOLDERS_AND_TBD.md](../PLACEHOLDERS_AND_TBD.md) — Smart Accounts Kit.
---
## Besu Operations
### Node Management
#### Adding a Validator
**Prerequisites:**
- Validator key generated
- VMID allocated (1000-1499 range)
- VLAN 110 configured (if migrated)
**Steps:**
1. Create LXC container with VMID
2. Install Besu
3. Configure validator key
4. Add to static-nodes.json on all nodes
5. Update allowlist (if using permissioning)
6. Start Besu service
7. Verify validator is participating
**See:** [VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)
#### Removing a Validator
**Prerequisites:**
- Validator is not critical (check quorum requirements)
- Backup validator key
**Steps:**
1. Stop Besu service
2. Remove from static-nodes.json on all nodes
3. Update allowlist (if using permissioning)
4. Remove container (optional)
5. Document removal
#### Upgrading Besu
**Prerequisites:**
- Backup current configuration
- Test upgrade in dev environment
- Create snapshot before upgrade
**Steps:**
1. Create snapshot: `pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)`
2. Stop Besu service
3. Backup configuration and keys
4. Install new Besu version
5. Update configuration if needed
6. Start Besu service
7. Verify node is syncing
8. Monitor for issues
**Rollback:**
- If issues occur: `pct rollback <vmid> pre-upgrade-YYYYMMDD`
### Node list deploy and verify (static-nodes.json / permissions-nodes.toml)
**Canonical source:** `config/besu-node-lists/` (single source of truth; 30 nodes in allowlist after 203/204 removed; 32 Besu nodes total).
- **Deploy** to all nodes: `scripts/deploy-besu-node-lists-to-all.sh` (optionally `--dry-run`). Pushes `static-nodes.json` and `permissions-nodes.toml` to `/etc/besu/` on every validator, sentry, and RPC (VMIDs 10001004, 15001508, 2101, 2102, 2201, 2301, 23032308, 24002403, 25002505).
- **Verify** presence and match canonical: `scripts/verify/verify-static-permissions-on-all-besu-nodes.sh --checksum`.
- **Restart Besu** to reload lists: `scripts/besu/restart-besu-reload-node-lists.sh` (optional; lists are read at startup).
- **Full-mesh peering (all 32 nodes):** Every node needs **max-peers=32**. Repo configs updated; to apply on running nodes run `scripts/maintenance/set-all-besu-max-peers-32.sh` then restart. See [08-monitoring/PEER_CONNECTIONS_PLAN.md](../08-monitoring/PEER_CONNECTIONS_PLAN.md).
**See:** [06-besu/BESU_NODES_FILE_REFERENCE.md](../06-besu/BESU_NODES_FILE_REFERENCE.md), [08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md](../08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md).
### RPC block production (chain 138 / current block)
If an RPC node returns wrong chain ID or block 0 / no block: use the dedicated runbook for status checks and common fixes (host-allowlist, tx-pool-min-score, permissions/static-nodes paths, discovery, Besu binary/genesis).
- **Runbook:** [09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md](../09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md)
### Allowlist Management
- **[BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** - Complete allowlist guide
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Quick start for allowlist issues
**Common Operations:**
- Generate allowlist from nodekeys
- Update allowlist on all nodes
- Verify allowlist is correct
- Troubleshoot allowlist issues
### Consensus Troubleshooting
- **[QBFT_TROUBLESHOOTING.md](../09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
- **Block Production Issues** - [BLOCK_PRODUCTION_FIX_RUNBOOK.md](../08-monitoring/BLOCK_PRODUCTION_FIX_RUNBOOK.md) — restore block production (permissioning TOML, tx-pool, restart validators 10001004)
- **Validator Recognition** - Validator not being recognized
---
## Liquidity & Multi-Chain (cUSDT/cUSDC)
- **[CUSDT_CUSDC_MULTICHAIN_LIQUIDITY_RUNBOOK.md](../../smom-dbis-138/docs/deployment/CUSDT_CUSDC_MULTICHAIN_LIQUIDITY_RUNBOOK.md)** — Deploy cUSDT/cUSDC to other chains (Ethereum, BSC, Polygon, Base, etc.); create Dodo PMM and Uniswap pools; add to Balancer, Curve. Scripts: `deploy-cusdt-cusdc-all-chains.sh`, `deploy-pmm-all-l2s.sh`, `create-uniswap-v3-pool-cusdt-cusdc.sh`.
- **[LIQUIDITY_POOL_CONTROLS_RUNBOOK.md](LIQUIDITY_POOL_CONTROLS_RUNBOOK.md)** — Trustless LiquidityPoolETH, DODO PMM, PoolManager, LiquidityManager controls and funding.
- **Runbooks master index:** [../RUNBOOKS_MASTER_INDEX.md](../RUNBOOKS_MASTER_INDEX.md) — All runbooks across the repo.
---
## GRU M1 Listing Operations
### GRU M1 Listing Dry-Run
- **[GRU_M1_LISTING_DRY_RUN_RUNBOOK.md](../runbooks/GRU_M1_LISTING_DRY_RUN_RUNBOOK.md)** - Procedural runbook for cUSDC/cUSDT listing dry-runs, dominance simulation, peg stress-tests, CMC/CG submission
**See also:** [docs/gru-m1/](../gru-m1/)
---
## Blockscout & Contract Verification
### Blockscout (VMID 5000)
- **[BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)** — Troubleshooting, migration from thin1, 502/DB issues
- **IP:** 192.168.11.140 (fixed; see [VMID_IP_FIXED_REFERENCE.md](../11-references/VMID_IP_FIXED_REFERENCE.md))
### Forge Contract Verification
Forge `verify-contract` fails against Blockscout with "Params 'module' and 'action' are required". Use the dedicated proxy.
**Preferred (orchestrated; starts proxy if needed):**
```bash
source smom-dbis-138/.env 2>/dev/null
./scripts/verify/run-contract-verification-with-proxy.sh
```
**Manual (proxy + verify):**
1. Start proxy: `BLOCKSCOUT_URL=http://192.168.11.140:4000 node forge-verification-proxy/server.js`
2. Run: `./scripts/verify-contracts-blockscout.sh`
**Alternative:** Nginx fix (`scripts/fix-blockscout-forge-verification.sh`) or manual verification at https://explorer.d-bis.org/address/<ADDR>#verify-contract
**See:**
- **[BLOCKSCOUT_FORGE_VERIFICATION_EVALUATION.md](BLOCKSCOUT_FORGE_VERIFICATION_EVALUATION.md)** — Evaluation and design
- **[forge-verification-proxy/README.md](../../forge-verification-proxy/README.md)** — Proxy usage
- **[CONTRACT_DEPLOYMENT_RUNBOOK.md](CONTRACT_DEPLOYMENT_RUNBOOK.md)** — Deploy and verify workflow
---
## CCIP Operations
### CCIP Relay Service (Chain 138 → Mainnet)
**Status:** ✅ Deployed on r630-01 (192.168.11.11) at `/opt/smom-dbis-138/services/relay`
- **[CCIP_RELAY_DEPLOYMENT.md](../07-ccip/CCIP_RELAY_DEPLOYMENT.md)** - Relay deployment, config, start/restart/logs, troubleshooting
**Quick commands:**
```bash
# View logs
ssh root@192.168.11.11 "tail -f /opt/smom-dbis-138/services/relay/relay-service.log"
# Restart
ssh root@192.168.11.11 "pkill -f 'node index.js' 2>/dev/null; sleep 2; cd /opt/smom-dbis-138/services/relay && nohup ./start-relay.sh >> relay-service.log 2>&1 &"
```
**Configuration:** Uses **RPC_URL_138_PUBLIC** (VMID 2201, 192.168.11.221:8545) for Chain 138; `START_BLOCK=latest`.
### CCIP Deployment
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - Complete CCIP deployment specification
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment orchestration
**WETH9 Bridge (Chain 138) Router mismatch fix:** Run `scripts/deploy-and-configure-weth9-bridge-chain138.sh` (requires `PRIVATE_KEY`); then set `CCIPWETH9_BRIDGE_CHAIN138` to the printed address. Deploy scripts now default to working CCIP router (0x8078A...). See [07-ccip/README.md](../07-ccip/README.md), [COMPREHENSIVE_STATUS_BRIDGE_READY.md](../../COMPREHENSIVE_STATUS_BRIDGE_READY.md), [scripts/README.md](../../scripts/README.md).
**Deployment Phases:**
1. Deploy Ops/Admin nodes (5400-5401)
2. Deploy Monitoring nodes (5402-5403)
3. Deploy Commit nodes (5410-5425)
4. Deploy Execute nodes (5440-5455)
5. Deploy RMN nodes (5470-5476)
### CCIP Node Management
- **Adding CCIP Node** - Add new CCIP node to fleet
- **Removing CCIP Node** - Remove CCIP node from fleet
- **CCIP Node Troubleshooting** - Common CCIP issues
---
## Admin Runner (Scripts / MCP) — Phase 4.4
**Purpose:** Run admin scripts and MCP tooling with central audit (who ran what, when, outcome). Design and implementation when infra admin view is built.
- **Design:** Runner service or wrapper that (1) authenticates (e.g. JWT or API key), (2) executes script/MCP action, (3) appends to central audit (dbis_core POST `/api/admin/central/audit`) with actor, action, resource, outcome.
- **Docs:** [MASTER_PLAN.md](../00-meta/MASTER_PLAN.md) §4.4; [admin-console-frontend-plan.md](../../dbis_core/docs/admin-console-frontend-plan.md).
- **When:** Implement with org-level panel and infra admin view.
---
## Phase 2 & 3 Deployment (Infrastructure)
**Phase 2 — Monitoring stack:** Deploy Prometheus, Grafana, Loki, Alertmanager; configure Cloudflare Access; enable health-check alerting. See [MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md), [MASTER_PLAN.md](../00-meta/MASTER_PLAN.md) §5.
**Phase 2 — Security:** SSH key-based auth (disable password); firewall Proxmox API (port 8006); secure validator keys; audits VLT-024, ISO-024; bridge integrations BRG-VLT, BRG-ISO. See [SECRETS_KEYS_CONFIGURATION.md](../04-configuration/SECRETS_KEYS_CONFIGURATION.md), [IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md).
**Phase 2 — Backups:** Automated backup script; encrypted validator keys; NPMplus backup (NPM_PASSWORD); config backup. See [BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md), `scripts/backup-proxmox-configs.sh`, `scripts/verify/backup-npmplus.sh`.
**Phase 3 — CCIP fleet:** Ops/Admin nodes (5400-5401), commit/execute/RMN nodes, NAT pools. See [CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md), [OPERATIONAL_RUNBOOKS.md § CCIP Operations](OPERATIONAL_RUNBOOKS.md#ccip-operations).
**Phase 4 — Sovereign tenants (docs/runbook):** VLANs 200203 (Phoenix Sovereign Cloud Band), Block #6 egress NAT, tenant isolation. **Script:** `scripts/deployment/phase4-sovereign-tenants.sh [--show-steps|--dry-run]`. **Docs:** [ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md) § Phase 4, [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md) (VLAN 200203), [UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md](../04-configuration/UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md) (sovereign tenant isolation rules).
---
## Monitoring & Observability
### Monitoring Setup
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
- **[BLOCK_PRODUCTION_FIX_RUNBOOK.md](../08-monitoring/BLOCK_PRODUCTION_FIX_RUNBOOK.md)** - Restore block production (permissioning, tx-pool, validators 10001004)
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
**Components:**
- Prometheus metrics collection
- Grafana dashboards
- Loki log aggregation
- Alertmanager alerting
### Health Checks
- **Node Health Checks** - Check individual node health
- **Service Health Checks** - Check service status
- **Network Health Checks** - Check network connectivity
**Scripts:**
- `check-node-health.sh` - Node health check script
- `check-service-status.sh` - Service status check
---
## Backup & Recovery
### Backup Procedures
- **Configuration Backup** - Backup all configuration files
- **Validator Key Backup** - Encrypted backup of validator keys
- **Container Backup** - Backup container configurations
**Automated Backups:**
- Scheduled daily backups
- Encrypted storage
- Multiple locations
- 30-day retention
### Disaster Recovery
- **Service Recovery** - Recover failed services
- **Network Recovery** - Recover network connectivity
- **Full System Recovery** - Complete system recovery
**Recovery Procedures:**
1. Identify failure point
2. Restore from backup
3. Verify service status
4. Monitor for issues
---
## Maintenance (ALL_IMPROVEMENTS 135139)
| # | Task | Frequency | Command / Script |
|---|------|------------|------------------|
| 135 | Monitor explorer sync status | Daily | `curl -s http://192.168.11.140:4000/api/v1/stats | jq .indexer` or Blockscout admin; check indexer lag |
| 136 | Monitor RPC node health (e.g. VMID 2201) | Daily | `bash scripts/verify/verify-backend-vms.sh`; `curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' $RPC_URL_138_PUBLIC` (or http://192.168.11.221:8545) |
| 137 | Check config API uptime | Weekly | `curl -sI https://dbis-api.d-bis.org/health` or target config API URL |
| 138 | Review explorer logs **(O-4)** | Weekly | See **O-4** below. `ssh root@<explorer-host> "journalctl -u blockscout -n 200 --no-pager"` or `pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager`. Explorer: VMID 5000 (r630-02, 192.168.11.140). |
| 139 | Update token list **(O-5)** | As needed | See **O-5** below. Canonical list: `token-lists/lists/dbis-138.tokenlist.json`. Guide: [TOKEN_LIST_AUTHORING_GUIDE.md](../11-references/TOKEN_LIST_AUTHORING_GUIDE.md). Bump `version` and `timestamp`; validate schema; deploy/public URL per runbook. |
**O-4 (Review explorer logs, weekly):** Run weekly or after incidents. From a host with SSH to the Blockscout node: `ssh root@192.168.11.XX "journalctl -u blockscout -n 200 --no-pager"` (replace with actual Proxmox/container host for VMID 5000), or from Proxmox host: `pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager`. Check for indexer errors, DB connection issues, OOM.
**O-5 (Update token list, as needed):** Edit `token-lists/lists/dbis-138.tokenlist.json`; bump `version.major|minor|patch` and `timestamp`; run validation (see TOKEN_LIST_AUTHORING_GUIDE); update any public URL (e.g. tokens.d-bis.org) and explorer/config API token list reference.
**Script:** `scripts/maintenance/daily-weekly-checks.sh [daily|weekly|all]` — daily: explorer, RPC, indexer lag, in-CT disk (138b); weekly: config API, thin pool all hosts (138a), fstrim (138c), journal vacuum (138d). **Cron:** `schedule-daily-weekly-cron.sh --install` (daily 08:00, weekly Sun 09:00). **Storage:** `schedule-storage-growth-cron.sh --install` (collect every 6h, prune snapshots+history Sun 08:00); `schedule-storage-monitor-cron.sh --install` (host alerts daily 07:00). See [04-configuration/STORAGE_GROWTH_AND_HEALTH.md](../04-configuration/STORAGE_GROWTH_AND_HEALTH.md).
### When decommissioning or changing RPC nodes
**Explorer (VMID 5000) depends on RPC** at `ETHEREUM_JSONRPC_HTTP_URL` (use **RPC_URL_138_PUBLIC** = VMID 2201, 192.168.11.221:8545). When you **decommission or change the IP of an RPC node** that Blockscout might use:
1. **Check** Blockscout env on VM 5000:
`pct exec 5000 -- bash -c 'grep -E "ETHEREUM_JSONRPC|RPC" /opt/blockscout/.env 2>/dev/null || docker inspect blockscout 2>/dev/null | grep -A5 Env'` (run from root@r630-02, 192.168.11.12).
2. **If** it points to the affected node, **update** to a live RPC (set to `$RPC_URL_138_PUBLIC` or http://192.168.11.221:8545) in Blockscout env and **restart** Blockscout.
3. **Update** any script defaults and `config/ip-addresses.conf` / docs that reference the old RPC.
See **[BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)** § "Proactive: When changing RPC or decommissioning nodes" and **[SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md](../04-configuration/verification-evidence/SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md)**.
### After NPMplus or DNS changes
Run **E2E routing** (includes explorer.d-bis.org):
`bash scripts/verify/verify-end-to-end-routing.sh`
### After frontend or Blockscout deploy
From a host on LAN that can reach 192.168.11.140, run **full explorer E2E**:
`bash explorer-monorepo/scripts/e2e-test-explorer.sh`
### Before/after Blockscout version or config change
Run **migrations** (SSL-disabled DB URL):
`bash scripts/fix-blockscout-ssl-and-migrations.sh` (on Proxmox host r630-02 or via SSH).
See [BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md).
---
## Security Operations
### Key Management
- **[SECRETS_KEYS_CONFIGURATION.md](../04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets and keys management
- **Validator Key Rotation** - Rotate validator keys
- **API Token Rotation** - Rotate API tokens
### Access Control (Phase 2 — Security)
- **SSH key-based auth; disable password auth:** On each Proxmox host and key VMs: `sudo sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config`; `sudo systemctl reload sshd`. Ensure SSH keys are deployed first. See [IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md). Scripts: `scripts/security/setup-ssh-key-auth.sh [--dry-run|--apply]`.
- **Firewall: restrict Proxmox API (port 8006):** Allow only admin IPs. Example (iptables): `iptables -A INPUT -p tcp --dport 8006 -s <ADMIN_CIDR> -j ACCEPT`; `iptables -A INPUT -p tcp --dport 8006 -j DROP`. Or use Proxmox firewall / UDM Pro rules. Script: `scripts/security/firewall-proxmox-8006.sh [--dry-run|--apply] [CIDR]`. Document in [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md).
- **Secure validator keys (W1-19):** On Proxmox host as root: `scripts/secure-validator-keys.sh [--dry-run]` — chmod 600/700, chown besu:besu on VMIDs 10001004.
- **Cloudflare Access** - Manage Cloudflare Access policies
---
## Troubleshooting
### Common Issues
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions
- **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT troubleshooting
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
### Diagnostic Procedures
1. **Check Service Status**
```bash
systemctl status besu-validator
```
2. **Check Logs**
```bash
journalctl -u besu-validator -f
```
3. **Check Network Connectivity**
```bash
ping <node-ip>
```
4. **Check Node Health**
```bash
./scripts/health/check-node-health.sh <vmid>
```
---
## Emergency Procedures
### Emergency Access
**Break-glass Access:**
1. Use emergency SSH endpoint (if configured)
2. Access via Cloudflare Access (if available)
3. Physical console access (last resort)
**Emergency Contacts:**
- Infrastructure Team: [contact info]
- On-call Engineer: [contact info]
### Service Recovery
**Priority Order:**
1. Validators (critical for consensus)
2. RPC nodes (critical for access)
3. Monitoring (important for visibility)
4. Other services
**Recovery Steps:**
1. Identify failed service
2. Check service logs
3. Restart service
4. If restart fails, restore from backup
5. Verify service is operational
### Network Recovery
**Network Issues:**
1. Check ER605 router status
2. Check switch status
3. Check VLAN configuration
4. Check firewall rules
5. Test connectivity
**VLAN Issues:**
1. Verify VLAN configuration on switches
2. Verify VLAN configuration on ER605
3. Verify Proxmox bridge configuration
4. Test inter-VLAN routing
---
## Maintenance Windows
### Scheduled Maintenance
- **Weekly:** Health checks, log review
- **Monthly:** Security updates, configuration review
- **Quarterly:** Full system review, backup testing
### Maintenance Procedures
1. **Notify Stakeholders** - Send maintenance notification
2. **Create Snapshots** - Snapshot all containers before changes
3. **Perform Maintenance** - Execute maintenance tasks
4. **Verify Services** - Verify all services are operational
5. **Document Changes** - Document all changes made
### Maintenance procedures (Ongoing)
| Task | Frequency | Reference |
|------|-----------|-----------|
| Monitor explorer sync **(O-1)** | Daily 08:00 | Cron: `schedule-daily-weekly-cron.sh`; script: `daily-weekly-checks.sh daily` |
| Monitor RPC 2201 **(O-2)** | Daily 08:00 | Same cron/script |
| Config API uptime **(O-3)** | Weekly (Sun 09:00) | `daily-weekly-checks.sh weekly` |
| Review explorer logs **(O-4)** | Weekly | Runbook [138] above; `pct exec 5000 -- journalctl -u blockscout -n 200` or SSH to Blockscout host |
| Update token list **(O-5)** | As needed | Runbook [139] above; `token-lists/lists/dbis-138.tokenlist.json`; [TOKEN_LIST_AUTHORING_GUIDE.md](../11-references/TOKEN_LIST_AUTHORING_GUIDE.md) |
| NPMplus backup | When NPMplus is up | `scripts/verify/backup-npmplus.sh` |
| Validator key/config backup | Per backup policy | W1-8; [BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md) |
| Start firefly-ali-1 (6201) | Optional, when needed | `scripts/maintenance/start-firefly-6201.sh` (r630-02) |
---
## Related Documentation
### Troubleshooting
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions - **Start here for problems**
- **[QBFT_TROUBLESHOOTING.md](../09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
### Architecture & Design
- **[NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)** - Network architecture (incl. §7 VMID/network table — service connectivity)
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment guide
- **[VMID_ALLOCATION_FINAL.md](../02-architecture/VMID_ALLOCATION_FINAL.md)** - VMID allocation
- **[MISSING_CONTAINERS_LIST.md](MISSING_CONTAINERS_LIST.md)** - Missing containers and IP assignments
### Configuration
- **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup
- **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets management
### Deployment
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment
- **[DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Deployment readiness
- **[DEPLOYMENT_STATUS_CONSOLIDATED.md](DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status
### Monitoring
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
### Reference
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index
---
**Document Status:** Active
**Maintained By:** Infrastructure Team
**Review Cycle:** Monthly
**Last Updated:** 2026-02-05