Files
proxmox/docs/03-deployment/OPERATIONAL_RUNBOOKS.md

27 KiB
Raw Blame History

Operational Runbooks - Master Index

Navigation: Home > Deployment > Operational Runbooks

Last Updated: 2026-02-18
Document Version: 1.3
Status: Active Documentation


Overview

This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment. For issue-specific troubleshooting (RPC, QBFT, SSH, tunnel, etc.), see ../09-troubleshooting/README.md and TROUBLESHOOTING_FAQ.md.


Quick Reference

Emergency Procedures

VM/Container Restart

To restart all stopped containers across Proxmox hosts via SSH:

# From project root; source config for host IPs
source config/ip-addresses.conf

# List stopped per host
for host in $PROXMOX_HOST_ML110 $PROXMOX_HOST_R630_01 $PROXMOX_HOST_R630_02; do
  ssh root@$host "pct list | awk '\$2==\"stopped\" {print \$1}'"
done

# Start each (replace HOST and VMID)
ssh root@HOST "pct start VMID"

Verification: scripts/verify/verify-backend-vms.sh | Report: VM_RESTART_AND_VERIFICATION_20260203.md

CT 2301 corrupted rootfs: If besu-rpc-private-1 (ml110) fails with pre-start hook: scripts/fix-ct-2301-corrupted-rootfs.sh

Common Operations


Network Operations

ER605 Router Configuration

  • ER605_ROUTER_CONFIGURATION.md - Complete router configuration guide
  • VLAN Configuration - Setting up VLANs on ER605
  • NAT Pool Configuration - Configuring role-based egress NAT
  • Failover Configuration - Setting up WAN failover

VLAN Management

  • VLAN Migration - Migrating from flat LAN to VLANs
  • VLAN Troubleshooting - Common VLAN issues and solutions
  • Inter-VLAN Routing - Configuring routing between VLANs

Edge and DNS (Fastly / Direct to NPMplus)

Cloudflare (DNS and optional Access)

  • CLOUDFLARE_ZERO_TRUST_GUIDE.md - Cloudflare setup (DNS retained; Option B tunnel for RPC only)
  • Application Publishing - Publishing applications via Cloudflare Access (optional)
  • Access Policy Management - Managing access policies

Smart Accounts (Chain 138 / ERC-4337)

  • Location: smom-dbis-138/script/smart-accounts/DeploySmartAccountsKit.s.sol
  • Env (required for deploy/use): PRIVATE_KEY, RPC_URL_138. Optional: ENTRY_POINT, SMART_ACCOUNT_FACTORY, PAYMASTER — set to deployed addresses to use existing contracts; otherwise deploy EntryPoint (ERC-4337), AccountFactory (e.g. MetaMask Smart Accounts Kit), and optionally Paymaster, then set in .env and re-run.
  • Run: forge script script/smart-accounts/DeploySmartAccountsKit.s.sol --rpc-url $RPC_URL_138 --broadcast (from smom-dbis-138). If addresses are in env, script logs them; else it logs next steps.
  • See: PLACEHOLDERS_AND_TBD.md — Smart Accounts Kit.

Besu Operations

Node Management

Adding a Validator

Prerequisites:

  • Validator key generated
  • VMID allocated (1000-1499 range)
  • VLAN 110 configured (if migrated)

Steps:

  1. Create LXC container with VMID
  2. Install Besu
  3. Configure validator key
  4. Add to static-nodes.json on all nodes
  5. Update allowlist (if using permissioning)
  6. Start Besu service
  7. Verify validator is participating

See: VALIDATED_SET_DEPLOYMENT_GUIDE.md

Removing a Validator

Prerequisites:

  • Validator is not critical (check quorum requirements)
  • Backup validator key

Steps:

  1. Stop Besu service
  2. Remove from static-nodes.json on all nodes
  3. Update allowlist (if using permissioning)
  4. Remove container (optional)
  5. Document removal

Upgrading Besu

Prerequisites:

  • Backup current configuration
  • Test upgrade in dev environment
  • Create snapshot before upgrade

Steps:

  1. Create snapshot: pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)
  2. Stop Besu service
  3. Backup configuration and keys
  4. Install new Besu version
  5. Update configuration if needed
  6. Start Besu service
  7. Verify node is syncing
  8. Monitor for issues

Rollback:

  • If issues occur: pct rollback <vmid> pre-upgrade-YYYYMMDD

Node list deploy and verify (static-nodes.json / permissions-nodes.toml)

Canonical source: config/besu-node-lists/ (single source of truth; 30 nodes in allowlist after 203/204 removed; 32 Besu nodes total).

  • Deploy to all nodes: scripts/deploy-besu-node-lists-to-all.sh (optionally --dry-run). Pushes static-nodes.json and permissions-nodes.toml to /etc/besu/ on every validator, sentry, and RPC (VMIDs 10001004, 15001508, 2101, 2102, 2201, 2301, 23032308, 24002403, 25002505).
  • Verify presence and match canonical: scripts/verify/verify-static-permissions-on-all-besu-nodes.sh --checksum.
  • Restart Besu to reload lists: scripts/besu/restart-besu-reload-node-lists.sh (optional; lists are read at startup).
  • Full-mesh peering (all 32 nodes): Every node needs max-peers=32. Repo configs updated; to apply on running nodes run scripts/maintenance/set-all-besu-max-peers-32.sh then restart. See 08-monitoring/PEER_CONNECTIONS_PLAN.md.

See: 06-besu/BESU_NODES_FILE_REFERENCE.md, 08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.

RPC block production (chain 138 / current block)

If an RPC node returns wrong chain ID or block 0 / no block: use the dedicated runbook for status checks and common fixes (host-allowlist, tx-pool-min-score, permissions/static-nodes paths, discovery, Besu binary/genesis).

Allowlist Management

Common Operations:

  • Generate allowlist from nodekeys
  • Update allowlist on all nodes
  • Verify allowlist is correct
  • Troubleshoot allowlist issues

Consensus Troubleshooting


Liquidity & Multi-Chain (cUSDT/cUSDC)


GRU M1 Listing Operations

GRU M1 Listing Dry-Run

See also: docs/gru-m1/


Blockscout & Contract Verification

Blockscout (VMID 5000)

Forge Contract Verification

Forge verify-contract fails against Blockscout with "Params 'module' and 'action' are required". Use the dedicated proxy.

Preferred (orchestrated; starts proxy if needed):

source smom-dbis-138/.env 2>/dev/null
./scripts/verify/run-contract-verification-with-proxy.sh

Manual (proxy + verify):

  1. Start proxy: BLOCKSCOUT_URL=http://192.168.11.140:4000 node forge-verification-proxy/server.js
  2. Run: ./scripts/verify-contracts-blockscout.sh

Alternative: Nginx fix (scripts/fix-blockscout-forge-verification.sh) or manual verification at https://explorer.d-bis.org/address/#verify-contract

See:


CCIP Operations

CCIP Relay Service (Chain 138 → Mainnet)

Status: Deployed on r630-01 (192.168.11.11) at /opt/smom-dbis-138/services/relay

Quick commands:

# View logs
ssh root@192.168.11.11 "tail -f /opt/smom-dbis-138/services/relay/relay-service.log"

# Restart
ssh root@192.168.11.11 "pkill -f 'node index.js' 2>/dev/null; sleep 2; cd /opt/smom-dbis-138/services/relay && nohup ./start-relay.sh >> relay-service.log 2>&1 &"

Configuration: Uses RPC_URL_138_PUBLIC (VMID 2201, 192.168.11.221:8545) for Chain 138; START_BLOCK=latest.

CCIP Deployment

WETH9 Bridge (Chain 138) Router mismatch fix: Run scripts/deploy-and-configure-weth9-bridge-chain138.sh (requires PRIVATE_KEY); then set CCIPWETH9_BRIDGE_CHAIN138 to the printed address. Deploy scripts now default to working CCIP router (0x8078A...). See 07-ccip/README.md, COMPREHENSIVE_STATUS_BRIDGE_READY.md, scripts/README.md.

Deployment Phases:

  1. Deploy Ops/Admin nodes (5400-5401)
  2. Deploy Monitoring nodes (5402-5403)
  3. Deploy Commit nodes (5410-5425)
  4. Deploy Execute nodes (5440-5455)
  5. Deploy RMN nodes (5470-5476)

CCIP Node Management

  • Adding CCIP Node - Add new CCIP node to fleet
  • Removing CCIP Node - Remove CCIP node from fleet
  • CCIP Node Troubleshooting - Common CCIP issues

Admin Runner (Scripts / MCP) — Phase 4.4

Purpose: Run admin scripts and MCP tooling with central audit (who ran what, when, outcome). Design and implementation when infra admin view is built.

  • Design: Runner service or wrapper that (1) authenticates (e.g. JWT or API key), (2) executes script/MCP action, (3) appends to central audit (dbis_core POST /api/admin/central/audit) with actor, action, resource, outcome.
  • Docs: MASTER_PLAN.md §4.4; admin-console-frontend-plan.md.
  • When: Implement with org-level panel and infra admin view.

Phase 2 & 3 Deployment (Infrastructure)

Phase 2 — Monitoring stack: Deploy Prometheus, Grafana, Loki, Alertmanager; configure Cloudflare Access; enable health-check alerting. See MONITORING_SUMMARY.md, MASTER_PLAN.md §5.

Phase 2 — Security: SSH key-based auth (disable password); firewall Proxmox API (port 8006); secure validator keys; audits VLT-024, ISO-024; bridge integrations BRG-VLT, BRG-ISO. See SECRETS_KEYS_CONFIGURATION.md, IMPLEMENTATION_CHECKLIST.md.

Phase 2 — Backups: Automated backup script; encrypted validator keys; NPMplus backup (NPM_PASSWORD); config backup. See BACKUP_AND_RESTORE.md, scripts/backup-proxmox-configs.sh, scripts/verify/backup-npmplus.sh.

Phase 3 — CCIP fleet: Ops/Admin nodes (5400-5401), commit/execute/RMN nodes, NAT pools. See CCIP_DEPLOYMENT_SPEC.md, OPERATIONAL_RUNBOOKS.md § CCIP Operations.

Phase 4 — Sovereign tenants (docs/runbook): VLANs 200203 (Phoenix Sovereign Cloud Band), Block #6 egress NAT, tenant isolation. Script: scripts/deployment/phase4-sovereign-tenants.sh [--show-steps|--dry-run]. Docs: ORCHESTRATION_DEPLOYMENT_GUIDE.md § Phase 4, NETWORK_ARCHITECTURE.md (VLAN 200203), UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md (sovereign tenant isolation rules).


Monitoring & Observability

Monitoring Setup

Components:

  • Prometheus metrics collection
  • Grafana dashboards
  • Loki log aggregation
  • Alertmanager alerting

Health Checks

  • Node Health Checks - Check individual node health
  • Service Health Checks - Check service status
  • Network Health Checks - Check network connectivity

Scripts:

  • check-node-health.sh - Node health check script
  • check-service-status.sh - Service status check

Backup & Recovery

Backup Procedures

  • Configuration Backup - Backup all configuration files
  • Validator Key Backup - Encrypted backup of validator keys
  • Container Backup - Backup container configurations

Automated Backups:

  • Scheduled daily backups
  • Encrypted storage
  • Multiple locations
  • 30-day retention

Disaster Recovery

  • Service Recovery - Recover failed services
  • Network Recovery - Recover network connectivity
  • Full System Recovery - Complete system recovery

Recovery Procedures:

  1. Identify failure point
  2. Restore from backup
  3. Verify service status
  4. Monitor for issues

Maintenance (ALL_IMPROVEMENTS 135139)

# Task Frequency Command / Script
135 Monitor explorer sync status Daily `curl -s http://192.168.11.140:4000/api/v1/stats
136 Monitor RPC node health (e.g. VMID 2201) Daily bash scripts/verify/verify-backend-vms.sh; curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' $RPC_URL_138_PUBLIC (or http://192.168.11.221:8545)
137 Check config API uptime Weekly curl -sI https://dbis-api.d-bis.org/health or target config API URL
138 Review explorer logs (O-4) Weekly See O-4 below. ssh root@<explorer-host> "journalctl -u blockscout -n 200 --no-pager" or pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager. Explorer: VMID 5000 (r630-02, 192.168.11.140).
139 Update token list (O-5) As needed See O-5 below. Canonical list: token-lists/lists/dbis-138.tokenlist.json. Guide: TOKEN_LIST_AUTHORING_GUIDE.md. Bump version and timestamp; validate schema; deploy/public URL per runbook.

O-4 (Review explorer logs, weekly): Run weekly or after incidents. From a host with SSH to the Blockscout node: ssh root@192.168.11.XX "journalctl -u blockscout -n 200 --no-pager" (replace with actual Proxmox/container host for VMID 5000), or from Proxmox host: pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager. Check for indexer errors, DB connection issues, OOM.

O-5 (Update token list, as needed): Edit token-lists/lists/dbis-138.tokenlist.json; bump version.major|minor|patch and timestamp; run validation (see TOKEN_LIST_AUTHORING_GUIDE); update any public URL (e.g. tokens.d-bis.org) and explorer/config API token list reference.

Script: scripts/maintenance/daily-weekly-checks.sh [daily|weekly|all] — daily: explorer, RPC, indexer lag, in-CT disk (138b); weekly: config API, thin pool all hosts (138a), fstrim (138c), journal vacuum (138d). Cron: schedule-daily-weekly-cron.sh --install (daily 08:00, weekly Sun 09:00). Storage: schedule-storage-growth-cron.sh --install (collect every 6h, prune snapshots+history Sun 08:00); schedule-storage-monitor-cron.sh --install (host alerts daily 07:00). See 04-configuration/STORAGE_GROWTH_AND_HEALTH.md.

When decommissioning or changing RPC nodes

Explorer (VMID 5000) depends on RPC at ETHEREUM_JSONRPC_HTTP_URL (use RPC_URL_138_PUBLIC = VMID 2201, 192.168.11.221:8545). When you decommission or change the IP of an RPC node that Blockscout might use:

  1. Check Blockscout env on VM 5000:
    pct exec 5000 -- bash -c 'grep -E "ETHEREUM_JSONRPC|RPC" /opt/blockscout/.env 2>/dev/null || docker inspect blockscout 2>/dev/null | grep -A5 Env' (run from root@r630-02, 192.168.11.12).
  2. If it points to the affected node, update to a live RPC (set to $RPC_URL_138_PUBLIC or http://192.168.11.221:8545) in Blockscout env and restart Blockscout.
  3. Update any script defaults and config/ip-addresses.conf / docs that reference the old RPC.

See BLOCKSCOUT_FIX_RUNBOOK.md § "Proactive: When changing RPC or decommissioning nodes" and SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md.

After NPMplus or DNS changes

Run E2E routing (includes explorer.d-bis.org):
bash scripts/verify/verify-end-to-end-routing.sh --profile=public

After frontend or Blockscout deploy

From a host on LAN that can reach 192.168.11.140, run full explorer E2E:
bash explorer-monorepo/scripts/e2e-test-explorer.sh

Before/after Blockscout version or config change

Run migrations (SSL-disabled DB URL):
bash scripts/fix-blockscout-ssl-and-migrations.sh (on Proxmox host r630-02 or via SSH).
See BLOCKSCOUT_FIX_RUNBOOK.md.


Security Operations

Key Management

  • SECRETS_KEYS_CONFIGURATION.md - Secrets and keys management
  • Validator Key Rotation - Rotate validator keys
  • API Token Rotation - Rotate API tokens

Access Control (Phase 2 — Security)

  • SSH key-based auth; disable password auth: On each Proxmox host and key VMs: sudo sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config; sudo systemctl reload sshd. Ensure SSH keys are deployed first. See IMPLEMENTATION_CHECKLIST.md. Scripts: scripts/security/setup-ssh-key-auth.sh [--dry-run|--apply].
  • Firewall: restrict Proxmox API (port 8006): Allow only admin IPs. Example (iptables): iptables -A INPUT -p tcp --dport 8006 -s <ADMIN_CIDR> -j ACCEPT; iptables -A INPUT -p tcp --dport 8006 -j DROP. Or use Proxmox firewall / UDM Pro rules. Script: scripts/security/firewall-proxmox-8006.sh [--dry-run|--apply] [CIDR]. Document in NETWORK_ARCHITECTURE.md.
  • Secure validator keys (W1-19): On Proxmox host as root: scripts/secure-validator-keys.sh [--dry-run] — chmod 600/700, chown besu:besu on VMIDs 10001004.
  • Cloudflare Access - Manage Cloudflare Access policies

Troubleshooting

Common Issues

Diagnostic Procedures

  1. Check Service Status

    systemctl status besu-validator
    
  2. Check Logs

    journalctl -u besu-validator -f
    
  3. Check Network Connectivity

    ping <node-ip>
    
  4. Check Node Health

    ./scripts/health/check-node-health.sh <vmid>
    

Emergency Procedures

Emergency Access

Break-glass Access:

  1. Use emergency SSH endpoint (if configured)
  2. Access via Cloudflare Access (if available)
  3. Physical console access (last resort)

Emergency Contacts:

  • Infrastructure Team: [contact info]
  • On-call Engineer: [contact info]

Service Recovery

Priority Order:

  1. Validators (critical for consensus)
  2. RPC nodes (critical for access)
  3. Monitoring (important for visibility)
  4. Other services

Recovery Steps:

  1. Identify failed service
  2. Check service logs
  3. Restart service
  4. If restart fails, restore from backup
  5. Verify service is operational

Network Recovery

Network Issues:

  1. Check ER605 router status
  2. Check switch status
  3. Check VLAN configuration
  4. Check firewall rules
  5. Test connectivity

VLAN Issues:

  1. Verify VLAN configuration on switches
  2. Verify VLAN configuration on ER605
  3. Verify Proxmox bridge configuration
  4. Test inter-VLAN routing

Maintenance Windows

Scheduled Maintenance

  • Weekly: Health checks, log review
  • Monthly: Security updates, configuration review
  • Quarterly: Full system review, backup testing

Maintenance Procedures

  1. Notify Stakeholders - Send maintenance notification
  2. Create Snapshots - Snapshot all containers before changes
  3. Perform Maintenance - Execute maintenance tasks
  4. Verify Services - Verify all services are operational
  5. Document Changes - Document all changes made

Maintenance procedures (Ongoing)

Task Frequency Reference
Monitor explorer sync (O-1) Daily 08:00 Cron: schedule-daily-weekly-cron.sh; script: daily-weekly-checks.sh daily
Monitor RPC 2201 (O-2) Daily 08:00 Same cron/script
Config API uptime (O-3) Weekly (Sun 09:00) daily-weekly-checks.sh weekly
Review explorer logs (O-4) Weekly Runbook [138] above; pct exec 5000 -- journalctl -u blockscout -n 200 or SSH to Blockscout host
Update token list (O-5) As needed Runbook [139] above; token-lists/lists/dbis-138.tokenlist.json; TOKEN_LIST_AUTHORING_GUIDE.md
NPMplus backup When NPMplus is up scripts/verify/backup-npmplus.sh
Validator key/config backup Per backup policy W1-8; BACKUP_AND_RESTORE.md
Start firefly-ali-1 (6201) Optional, when needed scripts/maintenance/start-firefly-6201.sh (r630-02)

Troubleshooting

Architecture & Design

Configuration

Deployment

Monitoring

Reference


Document Status: Active
Maintained By: Infrastructure Team
Review Cycle: Monthly
Last Updated: 2026-02-05