9.6 KiB
9.6 KiB
Operational Runbooks - Master Index
Last Updated: 2025-01-20
Document Version: 1.0
Overview
This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment.
Quick Reference
Emergency Procedures
- Emergency Access - Break-glass access procedures
- Service Recovery - Recovering failed services
- Network Recovery - Network connectivity issues
Common Operations
- Adding a Validator - Add new validator node
- Removing a Validator - Remove validator node
- Upgrading Besu - Besu version upgrade
- Key Rotation - Validator key rotation
Network Operations
ER605 Router Configuration
- ER605_ROUTER_CONFIGURATION.md - Complete router configuration guide
- VLAN Configuration - Setting up VLANs on ER605
- NAT Pool Configuration - Configuring role-based egress NAT
- Failover Configuration - Setting up WAN failover
VLAN Management
- VLAN Migration - Migrating from flat LAN to VLANs
- VLAN Troubleshooting - Common VLAN issues and solutions
- Inter-VLAN Routing - Configuring routing between VLANs
Cloudflare Zero Trust
- CLOUDFLARE_ZERO_TRUST_GUIDE.md - Complete Cloudflare setup
- Tunnel Management - Managing cloudflared tunnels
- Application Publishing - Publishing applications via Cloudflare Access
- Access Policy Management - Managing access policies
Besu Operations
Node Management
Adding a Validator
Prerequisites:
- Validator key generated
- VMID allocated (1000-1499 range)
- VLAN 110 configured (if migrated)
Steps:
- Create LXC container with VMID
- Install Besu
- Configure validator key
- Add to static-nodes.json on all nodes
- Update allowlist (if using permissioning)
- Start Besu service
- Verify validator is participating
See: VALIDATED_SET_DEPLOYMENT_GUIDE.md
Removing a Validator
Prerequisites:
- Validator is not critical (check quorum requirements)
- Backup validator key
Steps:
- Stop Besu service
- Remove from static-nodes.json on all nodes
- Update allowlist (if using permissioning)
- Remove container (optional)
- Document removal
Upgrading Besu
Prerequisites:
- Backup current configuration
- Test upgrade in dev environment
- Create snapshot before upgrade
Steps:
- Create snapshot:
pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d) - Stop Besu service
- Backup configuration and keys
- Install new Besu version
- Update configuration if needed
- Start Besu service
- Verify node is syncing
- Monitor for issues
Rollback:
- If issues occur:
pct rollback <vmid> pre-upgrade-YYYYMMDD
Allowlist Management
- BESU_ALLOWLIST_RUNBOOK.md - Complete allowlist guide
- BESU_ALLOWLIST_QUICK_START.md - Quick start for allowlist issues
Common Operations:
- Generate allowlist from nodekeys
- Update allowlist on all nodes
- Verify allowlist is correct
- Troubleshoot allowlist issues
Consensus Troubleshooting
- QBFT_TROUBLESHOOTING.md - QBFT consensus troubleshooting
- Block Production Issues - Troubleshooting block production
- Validator Recognition - Validator not being recognized
CCIP Operations
CCIP Deployment
- CCIP_DEPLOYMENT_SPEC.md - Complete CCIP deployment specification
- ORCHESTRATION_DEPLOYMENT_GUIDE.md - Deployment orchestration
Deployment Phases:
- Deploy Ops/Admin nodes (5400-5401)
- Deploy Monitoring nodes (5402-5403)
- Deploy Commit nodes (5410-5425)
- Deploy Execute nodes (5440-5455)
- Deploy RMN nodes (5470-5476)
CCIP Node Management
- Adding CCIP Node - Add new CCIP node to fleet
- Removing CCIP Node - Remove CCIP node from fleet
- CCIP Node Troubleshooting - Common CCIP issues
Monitoring & Observability
Monitoring Setup
- MONITORING_SUMMARY.md - Monitoring setup
- BLOCK_PRODUCTION_MONITORING.md - Block production monitoring
Components:
- Prometheus metrics collection
- Grafana dashboards
- Loki log aggregation
- Alertmanager alerting
Health Checks
- Node Health Checks - Check individual node health
- Service Health Checks - Check service status
- Network Health Checks - Check network connectivity
Scripts:
check-node-health.sh- Node health check scriptcheck-service-status.sh- Service status check
Backup & Recovery
Backup Procedures
- Configuration Backup - Backup all configuration files
- Validator Key Backup - Encrypted backup of validator keys
- Container Backup - Backup container configurations
Automated Backups:
- Scheduled daily backups
- Encrypted storage
- Multiple locations
- 30-day retention
Disaster Recovery
- Service Recovery - Recover failed services
- Network Recovery - Recover network connectivity
- Full System Recovery - Complete system recovery
Recovery Procedures:
- Identify failure point
- Restore from backup
- Verify service status
- Monitor for issues
Security Operations
Key Management
- SECRETS_KEYS_CONFIGURATION.md - Secrets and keys management
- Validator Key Rotation - Rotate validator keys
- API Token Rotation - Rotate API tokens
Access Control
- SSH Key Management - Manage SSH keys
- Cloudflare Access - Manage Cloudflare Access policies
- Firewall Rules - Manage firewall rules
Troubleshooting
Common Issues
- TROUBLESHOOTING_FAQ.md - Common issues and solutions
- QBFT_TROUBLESHOOTING.md - QBFT troubleshooting
- BESU_ALLOWLIST_QUICK_START.md - Allowlist troubleshooting
Diagnostic Procedures
-
Check Service Status
systemctl status besu-validator -
Check Logs
journalctl -u besu-validator -f -
Check Network Connectivity
ping <node-ip> -
Check Node Health
./scripts/health/check-node-health.sh <vmid>
Emergency Procedures
Emergency Access
Break-glass Access:
- Use emergency SSH endpoint (if configured)
- Access via Cloudflare Access (if available)
- Physical console access (last resort)
Emergency Contacts:
- Infrastructure Team: [contact info]
- On-call Engineer: [contact info]
Service Recovery
Priority Order:
- Validators (critical for consensus)
- RPC nodes (critical for access)
- Monitoring (important for visibility)
- Other services
Recovery Steps:
- Identify failed service
- Check service logs
- Restart service
- If restart fails, restore from backup
- Verify service is operational
Network Recovery
Network Issues:
- Check ER605 router status
- Check switch status
- Check VLAN configuration
- Check firewall rules
- Test connectivity
VLAN Issues:
- Verify VLAN configuration on switches
- Verify VLAN configuration on ER605
- Verify Proxmox bridge configuration
- Test inter-VLAN routing
Maintenance Windows
Scheduled Maintenance
- Weekly: Health checks, log review
- Monthly: Security updates, configuration review
- Quarterly: Full system review, backup testing
Maintenance Procedures
- Notify Stakeholders - Send maintenance notification
- Create Snapshots - Snapshot all containers before changes
- Perform Maintenance - Execute maintenance tasks
- Verify Services - Verify all services are operational
- Document Changes - Document all changes made
Related Documentation
Troubleshooting
- TROUBLESHOOTING_FAQ.md - Common issues and solutions - Start here for problems
- QBFT_TROUBLESHOOTING.md - QBFT consensus troubleshooting
- BESU_ALLOWLIST_QUICK_START.md - Allowlist troubleshooting
Architecture & Design
- NETWORK_ARCHITECTURE.md - Network architecture
- ORCHESTRATION_DEPLOYMENT_GUIDE.md - Deployment guide
- VMID_ALLOCATION_FINAL.md - VMID allocation
Configuration
- ER605_ROUTER_CONFIGURATION.md - Router configuration
- CLOUDFLARE_ZERO_TRUST_GUIDE.md - Cloudflare setup
- SECRETS_KEYS_CONFIGURATION.md - Secrets management
Deployment
- VALIDATED_SET_DEPLOYMENT_GUIDE.md - Validated set deployment
- CCIP_DEPLOYMENT_SPEC.md - CCIP deployment
- DEPLOYMENT_READINESS.md - Deployment readiness
- DEPLOYMENT_STATUS_CONSOLIDATED.md - Current deployment status
Monitoring
- MONITORING_SUMMARY.md - Monitoring setup
- BLOCK_PRODUCTION_MONITORING.md - Block production monitoring
Reference
- MASTER_INDEX.md - Complete documentation index
Document Status: Active
Maintained By: Infrastructure Team
Review Cycle: Monthly
Last Updated: 2025-01-20