# Operational Runbooks - Master Index **Last Updated:** 2025-01-20 **Document Version:** 1.0 --- ## Overview This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment. --- ## Quick Reference ### Emergency Procedures - **[Emergency Access](#emergency-access)** - Break-glass access procedures - **[Service Recovery](#service-recovery)** - Recovering failed services - **[Network Recovery](#network-recovery)** - Network connectivity issues ### Common Operations - **[Adding a Validator](#adding-a-validator)** - Add new validator node - **[Removing a Validator](#removing-a-validator)** - Remove validator node - **[Upgrading Besu](#upgrading-besu)** - Besu version upgrade - **[Key Rotation](#key-rotation)** - Validator key rotation --- ## Network Operations ### ER605 Router Configuration - **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Complete router configuration guide - **VLAN Configuration** - Setting up VLANs on ER605 - **NAT Pool Configuration** - Configuring role-based egress NAT - **Failover Configuration** - Setting up WAN failover ### VLAN Management - **VLAN Migration** - Migrating from flat LAN to VLANs - **VLAN Troubleshooting** - Common VLAN issues and solutions - **Inter-VLAN Routing** - Configuring routing between VLANs ### Cloudflare Zero Trust - **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Complete Cloudflare setup - **Tunnel Management** - Managing cloudflared tunnels - **Application Publishing** - Publishing applications via Cloudflare Access - **Access Policy Management** - Managing access policies --- ## Besu Operations ### Node Management #### Adding a Validator **Prerequisites:** - Validator key generated - VMID allocated (1000-1499 range) - VLAN 110 configured (if migrated) **Steps:** 1. Create LXC container with VMID 2. Install Besu 3. Configure validator key 4. Add to static-nodes.json on all nodes 5. Update allowlist (if using permissioning) 6. Start Besu service 7. Verify validator is participating **See:** [VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md) #### Removing a Validator **Prerequisites:** - Validator is not critical (check quorum requirements) - Backup validator key **Steps:** 1. Stop Besu service 2. Remove from static-nodes.json on all nodes 3. Update allowlist (if using permissioning) 4. Remove container (optional) 5. Document removal #### Upgrading Besu **Prerequisites:** - Backup current configuration - Test upgrade in dev environment - Create snapshot before upgrade **Steps:** 1. Create snapshot: `pct snapshot pre-upgrade-$(date +%Y%m%d)` 2. Stop Besu service 3. Backup configuration and keys 4. Install new Besu version 5. Update configuration if needed 6. Start Besu service 7. Verify node is syncing 8. Monitor for issues **Rollback:** - If issues occur: `pct rollback pre-upgrade-YYYYMMDD` ### Allowlist Management - **[BESU_ALLOWLIST_RUNBOOK.md](BESU_ALLOWLIST_RUNBOOK.md)** - Complete allowlist guide - **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Quick start for allowlist issues **Common Operations:** - Generate allowlist from nodekeys - Update allowlist on all nodes - Verify allowlist is correct - Troubleshoot allowlist issues ### Consensus Troubleshooting - **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting - **Block Production Issues** - Troubleshooting block production - **Validator Recognition** - Validator not being recognized --- ## CCIP Operations ### CCIP Deployment - **[CCIP_DEPLOYMENT_SPEC.md](CCIP_DEPLOYMENT_SPEC.md)** - Complete CCIP deployment specification - **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment orchestration **Deployment Phases:** 1. Deploy Ops/Admin nodes (5400-5401) 2. Deploy Monitoring nodes (5402-5403) 3. Deploy Commit nodes (5410-5425) 4. Deploy Execute nodes (5440-5455) 5. Deploy RMN nodes (5470-5476) ### CCIP Node Management - **Adding CCIP Node** - Add new CCIP node to fleet - **Removing CCIP Node** - Remove CCIP node from fleet - **CCIP Node Troubleshooting** - Common CCIP issues --- ## Monitoring & Observability ### Monitoring Setup - **[MONITORING_SUMMARY.md](MONITORING_SUMMARY.md)** - Monitoring setup - **[BLOCK_PRODUCTION_MONITORING.md](BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring **Components:** - Prometheus metrics collection - Grafana dashboards - Loki log aggregation - Alertmanager alerting ### Health Checks - **Node Health Checks** - Check individual node health - **Service Health Checks** - Check service status - **Network Health Checks** - Check network connectivity **Scripts:** - `check-node-health.sh` - Node health check script - `check-service-status.sh` - Service status check --- ## Backup & Recovery ### Backup Procedures - **Configuration Backup** - Backup all configuration files - **Validator Key Backup** - Encrypted backup of validator keys - **Container Backup** - Backup container configurations **Automated Backups:** - Scheduled daily backups - Encrypted storage - Multiple locations - 30-day retention ### Disaster Recovery - **Service Recovery** - Recover failed services - **Network Recovery** - Recover network connectivity - **Full System Recovery** - Complete system recovery **Recovery Procedures:** 1. Identify failure point 2. Restore from backup 3. Verify service status 4. Monitor for issues --- ## Security Operations ### Key Management - **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets and keys management - **Validator Key Rotation** - Rotate validator keys - **API Token Rotation** - Rotate API tokens ### Access Control - **SSH Key Management** - Manage SSH keys - **Cloudflare Access** - Manage Cloudflare Access policies - **Firewall Rules** - Manage firewall rules --- ## Troubleshooting ### Common Issues - **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions - **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT troubleshooting - **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting ### Diagnostic Procedures 1. **Check Service Status** ```bash systemctl status besu-validator ``` 2. **Check Logs** ```bash journalctl -u besu-validator -f ``` 3. **Check Network Connectivity** ```bash ping ``` 4. **Check Node Health** ```bash ./scripts/health/check-node-health.sh ``` --- ## Emergency Procedures ### Emergency Access **Break-glass Access:** 1. Use emergency SSH endpoint (if configured) 2. Access via Cloudflare Access (if available) 3. Physical console access (last resort) **Emergency Contacts:** - Infrastructure Team: [contact info] - On-call Engineer: [contact info] ### Service Recovery **Priority Order:** 1. Validators (critical for consensus) 2. RPC nodes (critical for access) 3. Monitoring (important for visibility) 4. Other services **Recovery Steps:** 1. Identify failed service 2. Check service logs 3. Restart service 4. If restart fails, restore from backup 5. Verify service is operational ### Network Recovery **Network Issues:** 1. Check ER605 router status 2. Check switch status 3. Check VLAN configuration 4. Check firewall rules 5. Test connectivity **VLAN Issues:** 1. Verify VLAN configuration on switches 2. Verify VLAN configuration on ER605 3. Verify Proxmox bridge configuration 4. Test inter-VLAN routing --- ## Maintenance Windows ### Scheduled Maintenance - **Weekly:** Health checks, log review - **Monthly:** Security updates, configuration review - **Quarterly:** Full system review, backup testing ### Maintenance Procedures 1. **Notify Stakeholders** - Send maintenance notification 2. **Create Snapshots** - Snapshot all containers before changes 3. **Perform Maintenance** - Execute maintenance tasks 4. **Verify Services** - Verify all services are operational 5. **Document Changes** - Document all changes made --- ## Related Documentation ### Troubleshooting - **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions - **Start here for problems** - **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting - **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting ### Architecture & Design - **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md)** - Network architecture - **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment guide - **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)** - VMID allocation ### Configuration - **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration - **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup - **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets management ### Deployment - **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment - **[CCIP_DEPLOYMENT_SPEC.md](CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment - **[DEPLOYMENT_READINESS.md](DEPLOYMENT_READINESS.md)** - Deployment readiness - **[DEPLOYMENT_STATUS_CONSOLIDATED.md](DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status ### Monitoring - **[MONITORING_SUMMARY.md](MONITORING_SUMMARY.md)** - Monitoring setup - **[BLOCK_PRODUCTION_MONITORING.md](BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring ### Reference - **[MASTER_INDEX.md](MASTER_INDEX.md)** - Complete documentation index --- **Document Status:** Active **Maintained By:** Infrastructure Team **Review Cycle:** Monthly **Last Updated:** 2025-01-20