d-bis/proxmox

Fork 0

Files

defiQUG d38174dc25 Align E2E profile workflow across scripts and runbooks

2026-03-06 08:46:55 -08:00

27 KiB

Raw Blame History

Operational Runbooks - Master Index

Navigation: Home > Deployment > Operational Runbooks

Last Updated: 2026-02-18
Document Version: 1.3
Status: Active Documentation

Overview

This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment. For issue-specific troubleshooting (RPC, QBFT, SSH, tunnel, etc.), see ../09-troubleshooting/README.md and TROUBLESHOOTING_FAQ.md.

Quick Reference

Emergency Procedures

Emergency Access - Break-glass access procedures
Service Recovery - Recovering failed services
Network Recovery - Network connectivity issues

VM/Container Restart

To restart all stopped containers across Proxmox hosts via SSH:

# From project root; source config for host IPs
source config/ip-addresses.conf

# List stopped per host
for host in $PROXMOX_HOST_ML110 $PROXMOX_HOST_R630_01 $PROXMOX_HOST_R630_02; do
  ssh root@$host "pct list | awk '\$2==\"stopped\" {print \$1}'"
done

# Start each (replace HOST and VMID)
ssh root@HOST "pct start VMID"

Verification: scripts/verify/verify-backend-vms.sh | Report: VM_RESTART_AND_VERIFICATION_20260203.md

CT 2301 corrupted rootfs: If besu-rpc-private-1 (ml110) fails with pre-start hook: scripts/fix-ct-2301-corrupted-rootfs.sh

Common Operations

Adding a Validator - Add new validator node
Removing a Validator - Remove validator node
Upgrading Besu - Besu version upgrade
Key Rotation - Validator key rotation

Network Operations

ER605 Router Configuration

ER605_ROUTER_CONFIGURATION.md - Complete router configuration guide
VLAN Configuration - Setting up VLANs on ER605
NAT Pool Configuration - Configuring role-based egress NAT
Failover Configuration - Setting up WAN failover

VLAN Management

VLAN Migration - Migrating from flat LAN to VLANs
VLAN Troubleshooting - Common VLAN issues and solutions
Inter-VLAN Routing - Configuring routing between VLANs

Edge and DNS (Fastly / Direct to NPMplus)

EDGE_PORT_VERIFICATION_RUNBOOK.md - Phase 0: verify 76.53.10.36:80/443 from internet
CLOUDFLARE_ROUTING_MASTER.md - Edge routing (Fastly or direct → UDM Pro → NPMplus; Option B for RPC)
OPTION_B_RPC_VIA_TUNNEL_RUNBOOK.md - RPC via Cloudflare Tunnel (6 hostnames → NPMplus); TUNNEL_SFVALLEY01_INSTALL.md - connector install
Fastly: Purge cache, health checks, origin 76.53.10.36 (see Fastly dashboard; optional restrict UDM Pro to Fastly IPs)
NPMplus HA failover: NPMPLUS_HA_SETUP_GUIDE.md - Keepalived/HAProxy; failover to 10234
502 runbook: Check (1) NPMplus (192.168.11.167) up and proxy hosts correct, (2) backend VMID 2201 (RPC) or 5000 (Blockscout) up and reachable, (3) if using Fastly, origin reachability from Fastly to 76.53.10.36; if Option B RPC, tunnel connector (e.g. VMID 102) running. Blockscout 502: BLOCKSCOUT_FIX_RUNBOOK.md

Cloudflare (DNS and optional Access)

CLOUDFLARE_ZERO_TRUST_GUIDE.md - Cloudflare setup (DNS retained; Option B tunnel for RPC only)
Application Publishing - Publishing applications via Cloudflare Access (optional)
Access Policy Management - Managing access policies

Smart Accounts (Chain 138 / ERC-4337)

Location: smom-dbis-138/script/smart-accounts/DeploySmartAccountsKit.s.sol
Env (required for deploy/use): PRIVATE_KEY, RPC_URL_138. Optional: ENTRY_POINT, SMART_ACCOUNT_FACTORY, PAYMASTER — set to deployed addresses to use existing contracts; otherwise deploy EntryPoint (ERC-4337), AccountFactory (e.g. MetaMask Smart Accounts Kit), and optionally Paymaster, then set in .env and re-run.
Run: forge script script/smart-accounts/DeploySmartAccountsKit.s.sol --rpc-url $RPC_URL_138 --broadcast (from smom-dbis-138). If addresses are in env, script logs them; else it logs next steps.
See: PLACEHOLDERS_AND_TBD.md — Smart Accounts Kit.

Besu Operations

Node Management

Adding a Validator

Prerequisites:

Validator key generated
VMID allocated (1000-1499 range)
VLAN 110 configured (if migrated)

Steps:

Create LXC container with VMID
Install Besu
Configure validator key
Add to static-nodes.json on all nodes
Update allowlist (if using permissioning)
Start Besu service
Verify validator is participating

See: VALIDATED_SET_DEPLOYMENT_GUIDE.md

Removing a Validator

Prerequisites:

Validator is not critical (check quorum requirements)
Backup validator key

Steps:

Stop Besu service
Remove from static-nodes.json on all nodes
Update allowlist (if using permissioning)
Remove container (optional)
Document removal

Upgrading Besu

Prerequisites:

Backup current configuration
Test upgrade in dev environment
Create snapshot before upgrade

Steps:

Create snapshot: pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)
Stop Besu service
Backup configuration and keys
Install new Besu version
Update configuration if needed
Start Besu service
Verify node is syncing
Monitor for issues

Rollback:

If issues occur: pct rollback <vmid> pre-upgrade-YYYYMMDD

Node list deploy and verify (static-nodes.json / permissions-nodes.toml)

Canonical source: config/besu-node-lists/ (single source of truth; 30 nodes in allowlist after 203/204 removed; 32 Besu nodes total).

Deploy to all nodes: scripts/deploy-besu-node-lists-to-all.sh (optionally --dry-run). Pushes static-nodes.json and permissions-nodes.toml to /etc/besu/ on every validator, sentry, and RPC (VMIDs 1000–1004, 1500–1508, 2101, 2102, 2201, 2301, 2303–2308, 2400–2403, 2500–2505).
Verify presence and match canonical: scripts/verify/verify-static-permissions-on-all-besu-nodes.sh --checksum.
Restart Besu to reload lists: scripts/besu/restart-besu-reload-node-lists.sh (optional; lists are read at startup).
Full-mesh peering (all 32 nodes): Every node needs max-peers=32. Repo configs updated; to apply on running nodes run scripts/maintenance/set-all-besu-max-peers-32.sh then restart. See 08-monitoring/PEER_CONNECTIONS_PLAN.md.

See: 06-besu/BESU_NODES_FILE_REFERENCE.md, 08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.

RPC block production (chain 138 / current block)

If an RPC node returns wrong chain ID or block 0 / no block: use the dedicated runbook for status checks and common fixes (host-allowlist, tx-pool-min-score, permissions/static-nodes paths, discovery, Besu binary/genesis).

Runbook: 09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md

Allowlist Management

BESU_ALLOWLIST_RUNBOOK.md - Complete allowlist guide
BESU_ALLOWLIST_QUICK_START.md - Quick start for allowlist issues

Common Operations:

Generate allowlist from nodekeys
Update allowlist on all nodes
Verify allowlist is correct
Troubleshoot allowlist issues

Consensus Troubleshooting

QBFT_TROUBLESHOOTING.md - QBFT consensus troubleshooting
Block Production Issues - BLOCK_PRODUCTION_FIX_RUNBOOK.md — restore block production (permissioning TOML, tx-pool, restart validators 1000–1004)
Validator Recognition - Validator not being recognized

Liquidity & Multi-Chain (cUSDT/cUSDC)

CUSDT_CUSDC_MULTICHAIN_LIQUIDITY_RUNBOOK.md — Deploy cUSDT/cUSDC to other chains (Ethereum, BSC, Polygon, Base, etc.); create Dodo PMM and Uniswap pools; add to Balancer, Curve. Scripts: deploy-cusdt-cusdc-all-chains.sh, deploy-pmm-all-l2s.sh, create-uniswap-v3-pool-cusdt-cusdc.sh.
LIQUIDITY_POOL_CONTROLS_RUNBOOK.md — Trustless LiquidityPoolETH, DODO PMM, PoolManager, LiquidityManager controls and funding.
Runbooks master index: ../RUNBOOKS_MASTER_INDEX.md — All runbooks across the repo.

GRU M1 Listing Operations

GRU M1 Listing Dry-Run

GRU_M1_LISTING_DRY_RUN_RUNBOOK.md - Procedural runbook for cUSDC/cUSDT listing dry-runs, dominance simulation, peg stress-tests, CMC/CG submission

See also: docs/gru-m1/

Blockscout & Contract Verification

Blockscout (VMID 5000)

BLOCKSCOUT_FIX_RUNBOOK.md — Troubleshooting, migration from thin1, 502/DB issues
IP: 192.168.11.140 (fixed; see VMID_IP_FIXED_REFERENCE.md)

Forge Contract Verification

Forge verify-contract fails against Blockscout with "Params 'module' and 'action' are required". Use the dedicated proxy.

Preferred (orchestrated; starts proxy if needed):

source smom-dbis-138/.env 2>/dev/null
./scripts/verify/run-contract-verification-with-proxy.sh

Manual (proxy + verify):

Start proxy: BLOCKSCOUT_URL=http://192.168.11.140:4000 node forge-verification-proxy/server.js
Run: ./scripts/verify-contracts-blockscout.sh

Alternative: Nginx fix (scripts/fix-blockscout-forge-verification.sh) or manual verification at https://explorer.d-bis.org/address/#verify-contract

See:

BLOCKSCOUT_FORGE_VERIFICATION_EVALUATION.md — Evaluation and design
forge-verification-proxy/README.md — Proxy usage
CONTRACT_DEPLOYMENT_RUNBOOK.md — Deploy and verify workflow

CCIP Operations

CCIP Relay Service (Chain 138 → Mainnet)

Status: ✅ Deployed on r630-01 (192.168.11.11) at /opt/smom-dbis-138/services/relay

CCIP_RELAY_DEPLOYMENT.md - Relay deployment, config, start/restart/logs, troubleshooting

Quick commands:

# View logs
ssh root@192.168.11.11 "tail -f /opt/smom-dbis-138/services/relay/relay-service.log"

# Restart
ssh root@192.168.11.11 "pkill -f 'node index.js' 2>/dev/null; sleep 2; cd /opt/smom-dbis-138/services/relay && nohup ./start-relay.sh >> relay-service.log 2>&1 &"

Configuration: Uses RPC_URL_138_PUBLIC (VMID 2201, 192.168.11.221:8545) for Chain 138; START_BLOCK=latest.

CCIP Deployment

CCIP_DEPLOYMENT_SPEC.md - Complete CCIP deployment specification
ORCHESTRATION_DEPLOYMENT_GUIDE.md - Deployment orchestration

WETH9 Bridge (Chain 138) – Router mismatch fix: Run scripts/deploy-and-configure-weth9-bridge-chain138.sh (requires PRIVATE_KEY); then set CCIPWETH9_BRIDGE_CHAIN138 to the printed address. Deploy scripts now default to working CCIP router (0x8078A...). See 07-ccip/README.md, COMPREHENSIVE_STATUS_BRIDGE_READY.md, scripts/README.md.

Deployment Phases:

Deploy Ops/Admin nodes (5400-5401)
Deploy Monitoring nodes (5402-5403)
Deploy Commit nodes (5410-5425)
Deploy Execute nodes (5440-5455)
Deploy RMN nodes (5470-5476)

CCIP Node Management

Adding CCIP Node - Add new CCIP node to fleet
Removing CCIP Node - Remove CCIP node from fleet
CCIP Node Troubleshooting - Common CCIP issues

Admin Runner (Scripts / MCP) — Phase 4.4

Purpose: Run admin scripts and MCP tooling with central audit (who ran what, when, outcome). Design and implementation when infra admin view is built.

Design: Runner service or wrapper that (1) authenticates (e.g. JWT or API key), (2) executes script/MCP action, (3) appends to central audit (dbis_core POST /api/admin/central/audit) with actor, action, resource, outcome.
Docs: MASTER_PLAN.md §4.4; admin-console-frontend-plan.md.
When: Implement with org-level panel and infra admin view.

Phase 2 & 3 Deployment (Infrastructure)

Phase 2 — Monitoring stack: Deploy Prometheus, Grafana, Loki, Alertmanager; configure Cloudflare Access; enable health-check alerting. See MONITORING_SUMMARY.md, MASTER_PLAN.md §5.

Phase 2 — Security: SSH key-based auth (disable password); firewall Proxmox API (port 8006); secure validator keys; audits VLT-024, ISO-024; bridge integrations BRG-VLT, BRG-ISO. See SECRETS_KEYS_CONFIGURATION.md, IMPLEMENTATION_CHECKLIST.md.

Phase 2 — Backups: Automated backup script; encrypted validator keys; NPMplus backup (NPM_PASSWORD); config backup. See BACKUP_AND_RESTORE.md, scripts/backup-proxmox-configs.sh, scripts/verify/backup-npmplus.sh.

Phase 3 — CCIP fleet: Ops/Admin nodes (5400-5401), commit/execute/RMN nodes, NAT pools. See CCIP_DEPLOYMENT_SPEC.md, OPERATIONAL_RUNBOOKS.md § CCIP Operations.

Phase 4 — Sovereign tenants (docs/runbook): VLANs 200–203 (Phoenix Sovereign Cloud Band), Block #6 egress NAT, tenant isolation. Script: scripts/deployment/phase4-sovereign-tenants.sh [--show-steps|--dry-run]. Docs: ORCHESTRATION_DEPLOYMENT_GUIDE.md § Phase 4, NETWORK_ARCHITECTURE.md (VLAN 200–203), UDM_PRO_FIREWALL_MANUAL_CONFIGURATION.md (sovereign tenant isolation rules).

Monitoring & Observability

Monitoring Setup

MONITORING_SUMMARY.md - Monitoring setup
BLOCK_PRODUCTION_FIX_RUNBOOK.md - Restore block production (permissioning, tx-pool, validators 1000–1004)
BLOCK_PRODUCTION_MONITORING.md - Block production monitoring

Components:

Prometheus metrics collection
Grafana dashboards
Loki log aggregation
Alertmanager alerting

Health Checks

Node Health Checks - Check individual node health
Service Health Checks - Check service status
Network Health Checks - Check network connectivity

Scripts:

check-node-health.sh - Node health check script
check-service-status.sh - Service status check

Backup & Recovery

Backup Procedures

Configuration Backup - Backup all configuration files
Validator Key Backup - Encrypted backup of validator keys
Container Backup - Backup container configurations

Automated Backups:

Scheduled daily backups
Encrypted storage
Multiple locations
30-day retention

Disaster Recovery

Service Recovery - Recover failed services
Network Recovery - Recover network connectivity
Full System Recovery - Complete system recovery

Recovery Procedures:

Identify failure point
Restore from backup
Verify service status
Monitor for issues

Maintenance (ALL_IMPROVEMENTS 135–139)

#	Task	Frequency	Command / Script
135	Monitor explorer sync status	Daily	`curl -s http://192.168.11.140:4000/api/v1/stats
136	Monitor RPC node health (e.g. VMID 2201)	Daily	`bash scripts/verify/verify-backend-vms.sh`; `curl -s -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' $RPC_URL_138_PUBLIC` (or http://192.168.11.221:8545)
137	Check config API uptime	Weekly	`curl -sI https://dbis-api.d-bis.org/health` or target config API URL
138	Review explorer logs (O-4)	Weekly	See O-4 below. `ssh root@<explorer-host> "journalctl -u blockscout -n 200 --no-pager"` or `pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager`. Explorer: VMID 5000 (r630-02, 192.168.11.140).
139	Update token list (O-5)	As needed	See O-5 below. Canonical list: `token-lists/lists/dbis-138.tokenlist.json`. Guide: TOKEN_LIST_AUTHORING_GUIDE.md. Bump `version` and `timestamp`; validate schema; deploy/public URL per runbook.

O-4 (Review explorer logs, weekly): Run weekly or after incidents. From a host with SSH to the Blockscout node: ssh root@192.168.11.XX "journalctl -u blockscout -n 200 --no-pager" (replace with actual Proxmox/container host for VMID 5000), or from Proxmox host: pct exec 5000 -- journalctl -u blockscout -n 200 --no-pager. Check for indexer errors, DB connection issues, OOM.

O-5 (Update token list, as needed): Edit token-lists/lists/dbis-138.tokenlist.json; bump version.major|minor|patch and timestamp; run validation (see TOKEN_LIST_AUTHORING_GUIDE); update any public URL (e.g. tokens.d-bis.org) and explorer/config API token list reference.

Script: scripts/maintenance/daily-weekly-checks.sh [daily|weekly|all] — daily: explorer, RPC, indexer lag, in-CT disk (138b); weekly: config API, thin pool all hosts (138a), fstrim (138c), journal vacuum (138d). Cron: schedule-daily-weekly-cron.sh --install (daily 08:00, weekly Sun 09:00). Storage: schedule-storage-growth-cron.sh --install (collect every 6h, prune snapshots+history Sun 08:00); schedule-storage-monitor-cron.sh --install (host alerts daily 07:00). See 04-configuration/STORAGE_GROWTH_AND_HEALTH.md.

When decommissioning or changing RPC nodes

Explorer (VMID 5000) depends on RPC at ETHEREUM_JSONRPC_HTTP_URL (use RPC_URL_138_PUBLIC = VMID 2201, 192.168.11.221:8545). When you decommission or change the IP of an RPC node that Blockscout might use:

Check Blockscout env on VM 5000:
pct exec 5000 -- bash -c 'grep -E "ETHEREUM_JSONRPC|RPC" /opt/blockscout/.env 2>/dev/null || docker inspect blockscout 2>/dev/null | grep -A5 Env' (run from root@r630-02, 192.168.11.12).
If it points to the affected node, update to a live RPC (set to $RPC_URL_138_PUBLIC or http://192.168.11.221:8545) in Blockscout env and restart Blockscout.
Update any script defaults and config/ip-addresses.conf / docs that reference the old RPC.

See BLOCKSCOUT_FIX_RUNBOOK.md § "Proactive: When changing RPC or decommissioning nodes" and SOLACESCANSCOUT_DEEP_DIVE_FIXES_AND_TIMING.md.

After NPMplus or DNS changes

Run E2E routing (includes explorer.d-bis.org):
bash scripts/verify/verify-end-to-end-routing.sh --profile=public

After frontend or Blockscout deploy

From a host on LAN that can reach 192.168.11.140, run full explorer E2E:
bash explorer-monorepo/scripts/e2e-test-explorer.sh

Before/after Blockscout version or config change

Run migrations (SSL-disabled DB URL):
bash scripts/fix-blockscout-ssl-and-migrations.sh (on Proxmox host r630-02 or via SSH).
See BLOCKSCOUT_FIX_RUNBOOK.md.

Security Operations

Key Management

SECRETS_KEYS_CONFIGURATION.md - Secrets and keys management
Validator Key Rotation - Rotate validator keys
API Token Rotation - Rotate API tokens

Access Control (Phase 2 — Security)

SSH key-based auth; disable password auth: On each Proxmox host and key VMs: sudo sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config; sudo systemctl reload sshd. Ensure SSH keys are deployed first. See IMPLEMENTATION_CHECKLIST.md. Scripts: scripts/security/setup-ssh-key-auth.sh [--dry-run|--apply].
Firewall: restrict Proxmox API (port 8006): Allow only admin IPs. Example (iptables): iptables -A INPUT -p tcp --dport 8006 -s <ADMIN_CIDR> -j ACCEPT; iptables -A INPUT -p tcp --dport 8006 -j DROP. Or use Proxmox firewall / UDM Pro rules. Script: scripts/security/firewall-proxmox-8006.sh [--dry-run|--apply] [CIDR]. Document in NETWORK_ARCHITECTURE.md.
Secure validator keys (W1-19): On Proxmox host as root: scripts/secure-validator-keys.sh [--dry-run] — chmod 600/700, chown besu:besu on VMIDs 1000–1004.
Cloudflare Access - Manage Cloudflare Access policies

Troubleshooting

Common Issues

TROUBLESHOOTING_FAQ.md - Common issues and solutions
QBFT_TROUBLESHOOTING.md - QBFT troubleshooting
BESU_ALLOWLIST_QUICK_START.md - Allowlist troubleshooting

Diagnostic Procedures

Check Service Status
```
systemctl status besu-validator
```
Check Logs
```
journalctl -u besu-validator -f
```
Check Network Connectivity
```
ping <node-ip>
```

Check Node Health

./scripts/health/check-node-health.sh <vmid>

Emergency Procedures

Emergency Access

Break-glass Access:

Use emergency SSH endpoint (if configured)
Access via Cloudflare Access (if available)
Physical console access (last resort)

Emergency Contacts:

Infrastructure Team: [contact info]
On-call Engineer: [contact info]

Service Recovery

Priority Order:

Validators (critical for consensus)
RPC nodes (critical for access)
Monitoring (important for visibility)
Other services

Recovery Steps:

Identify failed service
Check service logs
Restart service
If restart fails, restore from backup
Verify service is operational

Network Recovery

Network Issues:

Check ER605 router status
Check switch status
Check VLAN configuration
Check firewall rules
Test connectivity

VLAN Issues:

Verify VLAN configuration on switches
Verify VLAN configuration on ER605
Verify Proxmox bridge configuration
Test inter-VLAN routing

Maintenance Windows

Scheduled Maintenance

Weekly: Health checks, log review
Monthly: Security updates, configuration review
Quarterly: Full system review, backup testing

Maintenance Procedures

Notify Stakeholders - Send maintenance notification
Create Snapshots - Snapshot all containers before changes
Perform Maintenance - Execute maintenance tasks
Verify Services - Verify all services are operational
Document Changes - Document all changes made

Maintenance procedures (Ongoing)

Task	Frequency	Reference
Monitor explorer sync (O-1)	Daily 08:00	Cron: `schedule-daily-weekly-cron.sh`; script: `daily-weekly-checks.sh daily`
Monitor RPC 2201 (O-2)	Daily 08:00	Same cron/script
Config API uptime (O-3)	Weekly (Sun 09:00)	`daily-weekly-checks.sh weekly`
Review explorer logs (O-4)	Weekly	Runbook [138] above; `pct exec 5000 -- journalctl -u blockscout -n 200` or SSH to Blockscout host
Update token list (O-5)	As needed	Runbook [139] above; `token-lists/lists/dbis-138.tokenlist.json`; TOKEN_LIST_AUTHORING_GUIDE.md
NPMplus backup	When NPMplus is up	`scripts/verify/backup-npmplus.sh`
Validator key/config backup	Per backup policy	W1-8; BACKUP_AND_RESTORE.md
Start firefly-ali-1 (6201)	Optional, when needed	`scripts/maintenance/start-firefly-6201.sh` (r630-02)

Troubleshooting

TROUBLESHOOTING_FAQ.md - Common issues and solutions - Start here for problems
QBFT_TROUBLESHOOTING.md - QBFT consensus troubleshooting
BESU_ALLOWLIST_QUICK_START.md - Allowlist troubleshooting

Architecture & Design

NETWORK_ARCHITECTURE.md - Network architecture (incl. §7 VMID/network table — service connectivity)
ORCHESTRATION_DEPLOYMENT_GUIDE.md - Deployment guide
VMID_ALLOCATION_FINAL.md - VMID allocation
MISSING_CONTAINERS_LIST.md - Missing containers and IP assignments

Configuration

ER605_ROUTER_CONFIGURATION.md - Router configuration
CLOUDFLARE_ZERO_TRUST_GUIDE.md - Cloudflare setup
SECRETS_KEYS_CONFIGURATION.md - Secrets management

Deployment

VALIDATED_SET_DEPLOYMENT_GUIDE.md - Validated set deployment
CCIP_DEPLOYMENT_SPEC.md - CCIP deployment
DEPLOYMENT_READINESS.md - Deployment readiness
DEPLOYMENT_STATUS_CONSOLIDATED.md - Current deployment status

Monitoring

MONITORING_SUMMARY.md - Monitoring setup
BLOCK_PRODUCTION_MONITORING.md - Block production monitoring

Reference

MASTER_INDEX.md - Complete documentation index

Document Status: Active
Maintained By: Infrastructure Team
Review Cycle: Monthly
Last Updated: 2026-02-05

27 KiB Raw Blame History Unescape Escape

Operational Runbooks - Master Index

Overview

Quick Reference

Emergency Procedures

VM/Container Restart

Common Operations

Network Operations

ER605 Router Configuration

VLAN Management

Edge and DNS (Fastly / Direct to NPMplus)

Cloudflare (DNS and optional Access)

Smart Accounts (Chain 138 / ERC-4337)

Besu Operations

Node Management

Adding a Validator

Removing a Validator

Upgrading Besu

Node list deploy and verify (static-nodes.json / permissions-nodes.toml)

RPC block production (chain 138 / current block)

Allowlist Management

Consensus Troubleshooting

Liquidity & Multi-Chain (cUSDT/cUSDC)

GRU M1 Listing Operations

GRU M1 Listing Dry-Run

Blockscout & Contract Verification

Blockscout (VMID 5000)

Forge Contract Verification

CCIP Operations

CCIP Relay Service (Chain 138 → Mainnet)

CCIP Deployment

CCIP Node Management

Admin Runner (Scripts / MCP) — Phase 4.4

Phase 2 & 3 Deployment (Infrastructure)

Monitoring & Observability

Monitoring Setup

Health Checks

Backup & Recovery

Backup Procedures

Disaster Recovery

Maintenance (ALL_IMPROVEMENTS 135–139)

When decommissioning or changing RPC nodes

After NPMplus or DNS changes

After frontend or Blockscout deploy

Before/after Blockscout version or config change

Security Operations

Key Management

Access Control (Phase 2 — Security)

Troubleshooting

Common Issues

Diagnostic Procedures

Emergency Procedures

Emergency Access

Service Recovery

Network Recovery

Maintenance Windows

Scheduled Maintenance

Maintenance Procedures

Maintenance procedures (Ongoing)

Related Documentation

Troubleshooting

Architecture & Design

Configuration

Deployment

Monitoring

Reference

27 KiB

Raw Blame History