Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Co-authored-by: Cursor <cursoragent@cursor.com>
562 lines
27 KiB
Markdown
562 lines
27 KiB
Markdown
# Operational Runbooks - Master Index
|
||
|
||
**Navigation:** [Home](../01-getting-started/README.md) > [Deployment](README.md) > Operational Runbooks
|
||
|
||
**Last Updated:** 2026-02-18
|
||
**Document Version:** 1.3
|
||
**Status:** Active Documentation
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment. For issue-specific troubleshooting (RPC, QBFT, SSH, tunnel, etc.), see **[../09-troubleshooting/README.md](../09-troubleshooting/README.md)** and [TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md).
|
||
|
||
---
|
||
|
||
## Quick Reference
|
||
|
||
### Emergency Procedures
|
||
|
||
- **[Emergency Access](#emergency-access)** - Break-glass access procedures
|
||
- **[Service Recovery](#service-recovery)** - Recovering failed services
|
||
- **[Network Recovery](#network-recovery)** - Network connectivity issues
|
||
|
||
### VM/Container Restart
|
||
|
||
To restart all stopped containers across Proxmox hosts via SSH:
|
||
|
||
```bash
|
||
# From project root; source config for host IPs
|
||
source config/ip-addresses.conf
|
||
|
||
# List stopped per host
|
||
for host in $PROXMOX_HOST_ML110 $PROXMOX_HOST_R630_01 $PROXMOX_HOST_R630_02; do
|
||
ssh root@$host "pct list | awk '\$2==\"stopped\" {print \$1}'"
|
||
done
|
||
|
||
# Start each (replace HOST and VMID)
|
||
ssh root@HOST "pct start VMID"
|
||
```
|
||
|
||
**Verification:** `scripts/verify/verify-backend-vms.sh` | **Report:** [VM_RESTART_AND_VERIFICATION_20260203.md](../../reports/status/VM_RESTART_AND_VERIFICATION_20260203.md)
|
||
|
||
**CT 2301 corrupted rootfs:** If besu-rpc-private-1 (ml110) fails with pre-start hook: `scripts/fix-ct-2301-corrupted-rootfs.sh`
|
||
|
||
### Common Operations
|
||
|
||
- **[Adding a Validator](#adding-a-validator)** - Add new validator node
|
||
- **[Removing a Validator](#removing-a-validator)** - Remove validator node
|
||
- **[Upgrading Besu](#upgrading-besu)** - Besu version upgrade
|
||
- **[Key Rotation](#key-rotation)** - Validator key rotation
|
||
|
||
---
|
||
|
||
## Network Operations
|
||
|
||
### ER605 Router Configuration
|
||
|
||
- **[ER605_ROUTER_CONFIGURATION.md](../04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Complete router configuration guide
|
||
- **VLAN Configuration** - Setting up VLANs on ER605
|
||
- **NAT Pool Configuration** - Configuring role-based egress NAT
|
||
- **Failover Configuration** - Setting up WAN failover
|
||
|
||
### VLAN Management
|
||
|
||
- **VLAN Migration** - Migrating from flat LAN to VLANs
|
||
- **VLAN Troubleshooting** - Common VLAN issues and solutions
|
||
- **Inter-VLAN Routing** - Configuring routing between VLANs
|
||
|
||
### Edge and DNS (Fastly / Direct to NPMplus)
|
||
|
||
- **[EDGE_PORT_VERIFICATION_RUNBOOK.md](../05-network/EDGE_PORT_VERIFICATION_RUNBOOK.md)** - Phase 0: verify 76.53.10.36:80/443 from internet
|
||
- **[CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md)** - Edge routing (Fastly or direct → UDM Pro → NPMplus; Option B for RPC)
|
||
- **[OPTION_B_RPC_VIA_TUNNEL_RUNBOOK.md](../05-network/OPTION_B_RPC_VIA_TUNNEL_RUNBOOK.md)** - RPC via Cloudflare Tunnel (6 hostnames → NPMplus); [TUNNEL_SFVALLEY01_INSTALL.md](../04-configuration/cloudflare/TUNNEL_SFVALLEY01_INSTALL.md) - connector install
|
||
- **Fastly:** Purge cache, health checks, origin 76.53.10.36 (see Fastly dashboard; optional restrict UDM Pro to Fastly IPs)
|
||
- **NPMplus HA failover:** [NPMPLUS_HA_SETUP_GUIDE.md](../04-configuration/NPMPLUS_HA_SETUP_GUIDE.md) - Keepalived/HAProxy; failover to 10234
|
||
- **502 runbook:** Check (1) NPMplus (192.168.11.167) up and proxy hosts correct, (2) backend VMID 2201 (RPC) or 5000 (Blockscout) up and reachable, (3) if using Fastly, origin reachability from Fastly to 76.53.10.36; if Option B RPC, tunnel connector (e.g. VMID 102) running. Blockscout 502: [BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)
|
||
|
||
### Cloudflare (DNS and optional Access)
|
||
|
||
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup (DNS retained; Option B tunnel for RPC only)
|
||
- **Application Publishing** - Publishing applications via Cloudflare Access (optional)
|
||
- **Access Policy Management** - Managing access policies
|
||
|
||
---
|
||
|
||
## Smart Accounts (Chain 138 / ERC-4337)
|
||
|
||
- **Location:** `smom-dbis-138/script/smart-accounts/DeploySmartAccountsKit.s.sol`
|
||
- **Env (required for deploy/use):** `PRIVATE_KEY`, `RPC_URL_138`. Optional: `ENTRY_POINT`, `SMART_ACCOUNT_FACTORY`, `PAYMASTER` — set to deployed addresses to use existing contracts; otherwise deploy EntryPoint (ERC-4337), AccountFactory (e.g. MetaMask Smart Accounts Kit), and optionally Paymaster, then set in `.env` and re-run.
|
||
- **Run:** `forge script script/smart-accounts/DeploySmartAccountsKit.s.sol --rpc-url $RPC_URL_138 --broadcast` (from `smom-dbis-138`). If addresses are in env, script logs them; else it logs next steps.
|
||
- **See:** [PLACEHOLDERS_AND_TBD.md](../PLACEHOLDERS_AND_TBD.md) — Smart Accounts Kit.
|
||
|
||
---
|
||
|
||
## Besu Operations
|
||
|
||
### Node Management
|
||
|
||
#### Adding a Validator
|
||
|
||
**Prerequisites:**
|
||
- Validator key generated
|
||
- VMID allocated (1000-1499 range)
|
||
- VLAN 110 configured (if migrated)
|
||
|
||
**Steps:**
|
||
1. Create LXC container with VMID
|
||
2. Install Besu
|
||
3. Configure validator key
|
||
4. Add to static-nodes.json on all nodes
|
||
5. Update allowlist (if using permissioning)
|
||
6. Start Besu service
|
||
7. Verify validator is participating
|
||
|
||
**See:** [VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)
|
||
|
||
#### Removing a Validator
|
||
|
||
**Prerequisites:**
|
||
- Validator is not critical (check quorum requirements)
|
||
- Backup validator key
|
||
|
||
**Steps:**
|
||
1. Stop Besu service
|
||
2. Remove from static-nodes.json on all nodes
|
||
3. Update allowlist (if using permissioning)
|
||
4. Remove container (optional)
|
||
5. Document removal
|
||
|
||
#### Upgrading Besu
|
||
|
||
**Prerequisites:**
|
||
- Backup current configuration
|
||
- Test upgrade in dev environment
|
||
- Create snapshot before upgrade
|
||
|
||
**Steps:**
|
||
1. Create snapshot: `pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)`
|
||
2. Stop Besu service
|
||
3. Backup configuration and keys
|
||
4. Install new Besu version
|
||
5. Update configuration if needed
|
||
6. Start Besu service
|
||
7. Verify node is syncing
|
||
8. Monitor for issues
|
||
|
||
**Rollback:**
|
||
- If issues occur: `pct rollback <vmid> pre-upgrade-YYYYMMDD`
|
||
|
||
### Node list deploy and verify (static-nodes.json / permissions-nodes.toml)
|
||
|
||
**Canonical source:** `config/besu-node-lists/` (single source of truth; 30 nodes in allowlist after 203/204 removed; 32 Besu nodes total).
|
||
|
||
- **Deploy** to all nodes: `scripts/deploy-besu-node-lists-to-all.sh` (optionally `--dry-run`). Pushes `static-nodes.json` and `permissions-nodes.toml` to `/etc/besu/` on every validator, sentry, and RPC (VMIDs 1000–1004, 1500–1508, 2101, 2102, 2201, 2301, 2303–2308, 2400–2403, 2500–2505).
|
||
- **Verify** presence and match canonical: `scripts/verify/verify-static-permissions-on-all-besu-nodes.sh --checksum`.
|
||
- **Restart Besu** to reload lists: `scripts/besu/restart-besu-reload-node-lists.sh` (optional; lists are read at startup).
|
||
- **Full-mesh peering (all 32 nodes):** Every node needs **max-peers=32**. Repo configs updated; to apply on running nodes run `scripts/maintenance/set-all-besu-max-peers-32.sh` then restart. See [08-monitoring/PEER_CONNECTIONS_PLAN.md](../08-monitoring/PEER_CONNECTIONS_PLAN.md).
|
||
|
||
**See:** [06-besu/BESU_NODES_FILE_REFERENCE.md](../06-besu/BESU_NODES_FILE_REFERENCE.md), [08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md](../08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md).
|
||
|
||
### RPC block production (chain 138 / current block)
|
||
|
||
If an RPC node returns wrong chain ID or block 0 / no block: use the dedicated runbook for status checks and common fixes (host-allowlist, tx-pool-min-score, permissions/static-nodes paths, discovery, Besu binary/genesis).
|
||
|
||
- **Runbook:** [09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md](../09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md)
|
||
|
||
### Allowlist Management
|
||
|
||
- **[BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** - Complete allowlist guide
|
||
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Quick start for allowlist issues
|
||
|
||
**Common Operations:**
|
||
- Generate allowlist from nodekeys
|
||
- Update allowlist on all nodes
|
||
- Verify allowlist is correct
|
||
- Troubleshoot allowlist issues
|
||
|
||
### Consensus Troubleshooting
|
||
|
||
- **[QBFT_TROUBLESHOOTING.md](../09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
|
||
- **Block Production Issues** - [BLOCK_PRODUCTION_FIX_RUNBOOK.md](../08-monitoring/BLOCK_PRODUCTION_FIX_RUNBOOK.md) — restore block production (permissioning TOML, tx-pool, restart validators 1000–1004)
|
||
- **Validator Recognition** - Validator not being recognized
|
||
|
||
---
|
||
|
||
## Liquidity & Multi-Chain (cUSDT/cUSDC)
|
||
|
||
- **[CUSDT_CUSDC_MULTICHAIN_LIQUIDITY_RUNBOOK.md](../../smom-dbis-138/docs/deployment/CUSDT_CUSDC_MULTICHAIN_LIQUIDITY_RUNBOOK.md)** — Deploy cUSDT/cUSDC to other chains (Ethereum, BSC, Polygon, Base, etc.); create Dodo PMM and Uniswap pools; add to Balancer, Curve. Scripts: `deploy-cusdt-cusdc-all-chains.sh`, `deploy-pmm-all-l2s.sh`, `create-uniswap-v3-pool-cusdt-cusdc.sh`.
|
||
- **[LIQUIDITY_POOL_CONTROLS_RUNBOOK.md](LIQUIDITY_POOL_CONTROLS_RUNBOOK.md)** — Trustless LiquidityPoolETH, DODO PMM, PoolManager, LiquidityManager controls and funding.
|
||
- **Runbooks master index:** [../RUNBOOKS_MASTER_INDEX.md](../RUNBOOKS_MASTER_INDEX.md) — All runbooks across the repo.
|
||
|
||
---
|
||
|
||
## GRU M1 Listing Operations
|
||
|
||
### GRU M1 Listing Dry-Run
|
||
|
||
- **[GRU_M1_LISTING_DRY_RUN_RUNBOOK.md](../runbooks/GRU_M1_LISTING_DRY_RUN_RUNBOOK.md)** - Procedural runbook for cUSDC/cUSDT listing dry-runs, dominance simulation, peg stress-tests, CMC/CG submission
|
||
|
||
**See also:** [docs/gru-m1/](../gru-m1/)
|
||
|
||
---
|
||
|
||
## Blockscout & Contract Verification
|
||
|
||
### Blockscout (VMID 5000)
|
||
|
||
- **[BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)** — Troubleshooting, migration from thin1, 502/DB issues
|
||
- **IP:** 192.168.11.140 (fixed; see [VMID_IP_FIXED_REFERENCE.md](../11-references/VMID_IP_FIXED_REFERENCE.md))
|
||
|
||
### Forge Contract Verification
|
||
|
||
Forge `verify-contract` fails against Blockscout with "Params 'module' and 'action' are required". Use the dedicated proxy.
|
||
|
||
**Preferred (orchestrated; starts proxy if needed):**
|
||
```bash
|
||
source smom-dbis-138/.env 2>/dev/null
|
||
./scripts/verify/run-contract-verification-with-proxy.sh
|
||
```
|
||
|
||
**Manual (proxy + verify):**
|
||
1. Start proxy: `BLOCKSCOUT_URL=http://192.168.11.140:4000 node forge-verification-proxy/server.js`
|
||
2. Run: `./scripts/verify-contracts-blockscout.sh`
|
||
|
||
**Alternative:** Nginx fix (`scripts/fix-blockscout-forge-verification.sh`) or manual verification at https://explorer.d-bis.org/address/<ADDR>#verify-contract
|
||
|
||
**See:**
|
||
- **[BLOCKSCOUT_FORGE_VERIFICATION_EVALUATION.md](BLOCKSCOUT_FORGE_VERIFICATION_EVALUATION.md)** — Evaluation and design
|
||
- **[forge-verification-proxy/README.md](../../forge-verification-proxy/README.md)** — Proxy usage
|
||
- **[CONTRACT_DEPLOYMENT_RUNBOOK.md](CONTRACT_DEPLOYMENT_RUNBOOK.md)** — Deploy and verify workflow
|
||
|
||
---
|
||
|
||
## CCIP Operations
|
||
|
||
### CCIP Relay Service (Chain 138 → Mainnet)
|
||
|
||
**Status:** ✅ Deployed on r630-01 (192.168.11.11) at `/opt/smom-dbis-138/services/relay`
|
||
|
||
- **[CCIP_RELAY_DEPLOYMENT.md](../07-ccip/CCIP_RELAY_DEPLOYMENT.md)** - Relay deployment, config, start/restart/logs, troubleshooting
|
||
|
||
**Quick commands:**
|
||
```bash
|
||
# View logs
|
||
ssh root@192.168.11.11 "tail -f /opt/smom-dbis-138/services/relay/relay-service.log"
|
||
|
||
# Restart
|
||
ssh root@192.168.11.11 "pkill -f 'node index.js' 2>/dev/null; sleep 2; cd /opt/smom-dbis-138/services/relay && nohup ./start-relay.sh >> relay-service.log 2>&1 &"
|
||
```
|
||
|
||
**Configuration:** Uses **RPC_URL_138_PUBLIC** (VMID 2201, 192.168.11.221:8545) for Chain 138; `START_BLOCK=latest`.
|
||
|
||
### CCIP Deployment
|
||
|
||
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - Complete CCIP deployment specification
|
||
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment orchestration
|
||
|
||
**WETH9 Bridge (Chain 138) – Router mismatch fix:** Run `scripts/deploy-and-configure-weth9-bridge-chain138.sh` (requires `PRIVATE_KEY`); then set `CCIPWETH9_BRIDGE_CHAIN138` to the printed address. Deploy scripts now default to working CCIP router (0x8078A...). See [07-ccip/README.md](../07-ccip/README.md), [COMPREHENSIVE_STATUS_BRIDGE_READY.md](../../COMPREHENSIVE_STATUS_BRIDGE_READY.md), [scripts/README.md](../../scripts/README.md).
|
||
|
||
**Deployment Phases:**
|
||
1. Deploy Ops/Admin nodes (5400-5401)
|
||
2. Deploy Monitoring nodes (5402-5403)
|
||
3. Deploy Commit nodes (5410-5425)
|
||
4. Deploy Execute nodes (5440-5455)
|
||
5. Deploy RMN nodes (5470-5476)
|
||
|
||
### CCIP Node Management
|
||
|
||
- **Adding CCIP Node** - Add new CCIP node to fleet
|
||
- **Removing CCIP Node** - Remove CCIP node from fleet
|
||
- **CCIP Node Troubleshooting** - Common CCIP issues
|
||
|
||
---
|
||
|
||
## Admin Runner (Scripts / MCP) — Phase 4.4
|
||
|
||
**Purpose:** Run admin scripts and MCP tooling with central audit (who ran what, when, outcome). Design and implementation when infra admin view is built.
|
||
|
||
- **Design:** Runner service or wrapper that (1) authenticates (e.g. JWT or API key), (2) executes script/MCP action, (3) appends to central audit (dbis_core POST `/api/admin/central/audit`) with actor, action, resource, outcome.
|
||
- **Docs:** [MASTER_PLAN.md](../00-meta/MASTER_PLAN.md) §4.4; [admin-console-frontend-plan.md](../../dbis_core/docs/admin-console-frontend-plan.md).
|
||
- **When:** Implement with org-level panel and infra admin view.
|
||
|
||
---
|
||
|
||
## Phase 2 & 3 Deployment (Infrastructure)
|
||
|
||
**Phase 2 — Monitoring stack:** Deploy Prometheus, Grafana, Loki, Alertmanager; configure Cloudflare Access; enable health-check alerting. See [MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md), [MASTER_PLAN.md](../00-meta/MASTER_PLAN.md) §5.
|
||
|
||
**Phase 2 — Security:** SSH key-based auth (disable password); firewall Proxmox API (port 8006); secure validator keys; audits VLT-024, ISO-024; bridge integrations BRG-VLT, BRG-ISO. See [SECRETS_KEYS_CONFIGURATION.md](../04-configuration/SECRETS_KEYS_CONFIGURATION.md), [IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md).
|
||
|
||
**Phase 2 — Backups:** Automated backup script; encrypted validator keys; NPMplus backup (NPM_PASSWORD); config backup. See [BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md), `scripts/backup-proxmox-configs.sh`, `scripts/verify/backup-npmplus.sh`.
|
||
|
||
**Phase 3 — CCIP fleet:** Ops/Admin nodes (5400-5401), commit/execute/RMN nodes, NAT pools. See [CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md), [OPERATIONAL_RUNBOOKS.md § CCIP Operations](OPERATIONAL_RUNBOOKS.md#ccip-operations).
|
||
|
||
**Phase 4 — Sovereign tenants (docs/runbook):** VLANs 200–203 (Phoenix Sovereign Cloud Band), Block #6 egress NAT, tenant isolation. **Script:** `scripts/deployment/phase4-sovereign-tenants.sh [--show-steps|--dry-run]`. **Docs:** [ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md) § Phase 4, [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md) (VLAN 200–203), [UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md](../04-configuration/UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md) (sovereign tenant isolation rules).
|
||
|
||
---
|
||
|
||
## Monitoring & Observability
|
||
|
||
### Monitoring Setup
|
||
|
||
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
|
||
- **[BLOCK_PRODUCTION_FIX_RUNBOOK.md](../08-monitoring/BLOCK_PRODUCTION_FIX_RUNBOOK.md)** - Restore block production (permissioning, tx-pool, validators 1000–1004)
|
||
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
|
||
|
||
**Components:**
|
||
- Prometheus metrics collection
|
||
- Grafana dashboards
|
||
- Loki log aggregation
|
||
- Alertmanager alerting
|
||
|
||
### Health Checks
|
||
|
||
- **Node Health Checks** - Check individual node health
|
||
- **Service Health Checks** - Check service status
|
||
- **Network Health Checks** - Check network connectivity
|
||
|
||
**Scripts:**
|
||
- `check-node-health.sh` - Node health check script
|
||
- `check-service-status.sh` - Service status check
|
||
|
||
---
|
||
|
||
## Backup & Recovery
|
||
|
||
### Backup Procedures
|
||
|
||
- **Configuration Backup** - Backup all configuration files
|
||
- **Validator Key Backup** - Encrypted backup of validator keys
|
||
- **Container Backup** - Backup container configurations
|
||
|
||
**Automated Backups:**
|
||
- Scheduled daily backups
|
||
- Encrypted storage
|
||
- Multiple locations
|
||
- 30-day retention
|
||
|
||
### Disaster Recovery
|
||
|
||
- **Service Recovery** - Recover failed services
|
||
- **Network Recovery** - Recover network connectivity
|
||
- **Full System Recovery** - Complete system recovery
|
||
|
||
**Recovery Procedures:**
|
||
1. Identify failure point
|
||
2. Restore from backup
|
||
3. Verify service status
|
||
4. Monitor for issues
|
||
|
||
---
|
||
|
||
## Maintenance (ALL_IMPROVEMENTS 135–139)
|
||
|
||
| # | Task | Frequency | Command / Script |
|
||
|---|------|------------|------------------|
|
||
| 135 | Monitor explorer sync status | Daily | `curl -s http://192.168.11.140:4000/api/v1/stats | jq .indexer` or Blockscout admin; check indexer lag |
|
||
| 136 | Monitor RPC node health (e.g. VMID 2201) | Daily | `bash scripts/verify/verify-backend-vms.sh`; `curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' $RPC_URL_138_PUBLIC` (or http://192.168.11.221:8545) |
|
||
| 137 | Check config API uptime | Weekly | `curl -sI https://dbis-api.d-bis.org/health` or target config API URL |
|
||
| 138 | Review explorer logs **(O-4)** | Weekly | See **O-4** below. `ssh root@<explorer-host> "journalctl -u blockscout -n 200 --no-pager"` or `pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager`. Explorer: VMID 5000 (r630-02, 192.168.11.140). |
|
||
| 139 | Update token list **(O-5)** | As needed | See **O-5** below. Canonical list: `token-lists/lists/dbis-138.tokenlist.json`. Guide: [TOKEN_LIST_AUTHORING_GUIDE.md](../11-references/TOKEN_LIST_AUTHORING_GUIDE.md). Bump `version` and `timestamp`; validate schema; deploy/public URL per runbook. |
|
||
|
||
**O-4 (Review explorer logs, weekly):** Run weekly or after incidents. From a host with SSH to the Blockscout node: `ssh root@192.168.11.XX "journalctl -u blockscout -n 200 --no-pager"` (replace with actual Proxmox/container host for VMID 5000), or from Proxmox host: `pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager`. Check for indexer errors, DB connection issues, OOM.
|
||
|
||
**O-5 (Update token list, as needed):** Edit `token-lists/lists/dbis-138.tokenlist.json`; bump `version.major|minor|patch` and `timestamp`; run validation (see TOKEN_LIST_AUTHORING_GUIDE); update any public URL (e.g. tokens.d-bis.org) and explorer/config API token list reference.
|
||
|
||
**Script:** `scripts/maintenance/daily-weekly-checks.sh [daily|weekly|all]` — daily: explorer, RPC, indexer lag, in-CT disk (138b); weekly: config API, thin pool all hosts (138a), fstrim (138c), journal vacuum (138d). **Cron:** `schedule-daily-weekly-cron.sh --install` (daily 08:00, weekly Sun 09:00). **Storage:** `schedule-storage-growth-cron.sh --install` (collect every 6h, prune snapshots+history Sun 08:00); `schedule-storage-monitor-cron.sh --install` (host alerts daily 07:00). See [04-configuration/STORAGE_GROWTH_AND_HEALTH.md](../04-configuration/STORAGE_GROWTH_AND_HEALTH.md).
|
||
|
||
### When decommissioning or changing RPC nodes
|
||
|
||
**Explorer (VMID 5000) depends on RPC** at `ETHEREUM_JSONRPC_HTTP_URL` (use **RPC_URL_138_PUBLIC** = VMID 2201, 192.168.11.221:8545). When you **decommission or change the IP of an RPC node** that Blockscout might use:
|
||
|
||
1. **Check** Blockscout env on VM 5000:
|
||
`pct exec 5000 -- bash -c 'grep -E "ETHEREUM_JSONRPC|RPC" /opt/blockscout/.env 2>/dev/null || docker inspect blockscout 2>/dev/null | grep -A5 Env'` (run from root@r630-02, 192.168.11.12).
|
||
2. **If** it points to the affected node, **update** to a live RPC (set to `$RPC_URL_138_PUBLIC` or http://192.168.11.221:8545) in Blockscout env and **restart** Blockscout.
|
||
3. **Update** any script defaults and `config/ip-addresses.conf` / docs that reference the old RPC.
|
||
|
||
See **[BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md)** § "Proactive: When changing RPC or decommissioning nodes" and **[SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md](../04-configuration/verification-evidence/SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md)**.
|
||
|
||
### After NPMplus or DNS changes
|
||
|
||
Run **E2E routing** (includes explorer.d-bis.org):
|
||
`bash scripts/verify/verify-end-to-end-routing.sh`
|
||
|
||
### After frontend or Blockscout deploy
|
||
|
||
From a host on LAN that can reach 192.168.11.140, run **full explorer E2E**:
|
||
`bash explorer-monorepo/scripts/e2e-test-explorer.sh`
|
||
|
||
### Before/after Blockscout version or config change
|
||
|
||
Run **migrations** (SSL-disabled DB URL):
|
||
`bash scripts/fix-blockscout-ssl-and-migrations.sh` (on Proxmox host r630-02 or via SSH).
|
||
See [BLOCKSCOUT_FIX_RUNBOOK.md](BLOCKSCOUT_FIX_RUNBOOK.md).
|
||
|
||
---
|
||
|
||
## Security Operations
|
||
|
||
### Key Management
|
||
|
||
- **[SECRETS_KEYS_CONFIGURATION.md](../04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets and keys management
|
||
- **Validator Key Rotation** - Rotate validator keys
|
||
- **API Token Rotation** - Rotate API tokens
|
||
|
||
### Access Control (Phase 2 — Security)
|
||
|
||
- **SSH key-based auth; disable password auth:** On each Proxmox host and key VMs: `sudo sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config`; `sudo systemctl reload sshd`. Ensure SSH keys are deployed first. See [IMPLEMENTATION_CHECKLIST.md](../10-best-practices/IMPLEMENTATION_CHECKLIST.md). Scripts: `scripts/security/setup-ssh-key-auth.sh [--dry-run|--apply]`.
|
||
- **Firewall: restrict Proxmox API (port 8006):** Allow only admin IPs. Example (iptables): `iptables -A INPUT -p tcp --dport 8006 -s <ADMIN_CIDR> -j ACCEPT`; `iptables -A INPUT -p tcp --dport 8006 -j DROP`. Or use Proxmox firewall / UDM Pro rules. Script: `scripts/security/firewall-proxmox-8006.sh [--dry-run|--apply] [CIDR]`. Document in [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md).
|
||
- **Secure validator keys (W1-19):** On Proxmox host as root: `scripts/secure-validator-keys.sh [--dry-run]` — chmod 600/700, chown besu:besu on VMIDs 1000–1004.
|
||
- **Cloudflare Access** - Manage Cloudflare Access policies
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Common Issues
|
||
|
||
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions
|
||
- **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT troubleshooting
|
||
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
|
||
|
||
### Diagnostic Procedures
|
||
|
||
1. **Check Service Status**
|
||
```bash
|
||
systemctl status besu-validator
|
||
```
|
||
|
||
2. **Check Logs**
|
||
```bash
|
||
journalctl -u besu-validator -f
|
||
```
|
||
|
||
3. **Check Network Connectivity**
|
||
```bash
|
||
ping <node-ip>
|
||
```
|
||
|
||
4. **Check Node Health**
|
||
```bash
|
||
./scripts/health/check-node-health.sh <vmid>
|
||
```
|
||
|
||
---
|
||
|
||
## Emergency Procedures
|
||
|
||
### Emergency Access
|
||
|
||
**Break-glass Access:**
|
||
1. Use emergency SSH endpoint (if configured)
|
||
2. Access via Cloudflare Access (if available)
|
||
3. Physical console access (last resort)
|
||
|
||
**Emergency Contacts:**
|
||
- Infrastructure Team: [contact info]
|
||
- On-call Engineer: [contact info]
|
||
|
||
### Service Recovery
|
||
|
||
**Priority Order:**
|
||
1. Validators (critical for consensus)
|
||
2. RPC nodes (critical for access)
|
||
3. Monitoring (important for visibility)
|
||
4. Other services
|
||
|
||
**Recovery Steps:**
|
||
1. Identify failed service
|
||
2. Check service logs
|
||
3. Restart service
|
||
4. If restart fails, restore from backup
|
||
5. Verify service is operational
|
||
|
||
### Network Recovery
|
||
|
||
**Network Issues:**
|
||
1. Check ER605 router status
|
||
2. Check switch status
|
||
3. Check VLAN configuration
|
||
4. Check firewall rules
|
||
5. Test connectivity
|
||
|
||
**VLAN Issues:**
|
||
1. Verify VLAN configuration on switches
|
||
2. Verify VLAN configuration on ER605
|
||
3. Verify Proxmox bridge configuration
|
||
4. Test inter-VLAN routing
|
||
|
||
---
|
||
|
||
## Maintenance Windows
|
||
|
||
### Scheduled Maintenance
|
||
|
||
- **Weekly:** Health checks, log review
|
||
- **Monthly:** Security updates, configuration review
|
||
- **Quarterly:** Full system review, backup testing
|
||
|
||
### Maintenance Procedures
|
||
|
||
1. **Notify Stakeholders** - Send maintenance notification
|
||
2. **Create Snapshots** - Snapshot all containers before changes
|
||
3. **Perform Maintenance** - Execute maintenance tasks
|
||
4. **Verify Services** - Verify all services are operational
|
||
5. **Document Changes** - Document all changes made
|
||
|
||
### Maintenance procedures (Ongoing)
|
||
|
||
| Task | Frequency | Reference |
|
||
|------|-----------|-----------|
|
||
| Monitor explorer sync **(O-1)** | Daily 08:00 | Cron: `schedule-daily-weekly-cron.sh`; script: `daily-weekly-checks.sh daily` |
|
||
| Monitor RPC 2201 **(O-2)** | Daily 08:00 | Same cron/script |
|
||
| Config API uptime **(O-3)** | Weekly (Sun 09:00) | `daily-weekly-checks.sh weekly` |
|
||
| Review explorer logs **(O-4)** | Weekly | Runbook [138] above; `pct exec 5000 -- journalctl -u blockscout -n 200` or SSH to Blockscout host |
|
||
| Update token list **(O-5)** | As needed | Runbook [139] above; `token-lists/lists/dbis-138.tokenlist.json`; [TOKEN_LIST_AUTHORING_GUIDE.md](../11-references/TOKEN_LIST_AUTHORING_GUIDE.md) |
|
||
| NPMplus backup | When NPMplus is up | `scripts/verify/backup-npmplus.sh` |
|
||
| Validator key/config backup | Per backup policy | W1-8; [BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md) |
|
||
| Start firefly-ali-1 (6201) | Optional, when needed | `scripts/maintenance/start-firefly-6201.sh` (r630-02) |
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
### Troubleshooting
|
||
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions - **Start here for problems**
|
||
- **[QBFT_TROUBLESHOOTING.md](../09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
|
||
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
|
||
|
||
### Architecture & Design
|
||
- **[NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)** - Network architecture (incl. §7 VMID/network table — service connectivity)
|
||
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment guide
|
||
- **[VMID_ALLOCATION_FINAL.md](../02-architecture/VMID_ALLOCATION_FINAL.md)** - VMID allocation
|
||
- **[MISSING_CONTAINERS_LIST.md](MISSING_CONTAINERS_LIST.md)** - Missing containers and IP assignments
|
||
|
||
### Configuration
|
||
- **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration
|
||
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup
|
||
- **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets management
|
||
|
||
### Deployment
|
||
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment
|
||
- **[CCIP_DEPLOYMENT_SPEC.md](../07-ccip/CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment
|
||
- **[DEPLOYMENT_READINESS.md](../03-deployment/DEPLOYMENT_READINESS.md)** - Deployment readiness
|
||
- **[DEPLOYMENT_STATUS_CONSOLIDATED.md](DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status
|
||
|
||
### Monitoring
|
||
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
|
||
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
|
||
|
||
### Reference
|
||
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index
|
||
|
||
---
|
||
|
||
**Document Status:** Active
|
||
**Maintained By:** Infrastructure Team
|
||
**Review Cycle:** Monthly
|
||
**Last Updated:** 2026-02-05
|
||
|