- Fixed 104 broken references in 59 files - Consolidated 40+ duplicate status files - Archived duplicates to reports/archive/duplicates/ - Created scripts for reference fixing and consolidation - Updated content inconsistency reports All optional cleanup tasks complete.
352 lines
9.8 KiB
Markdown
352 lines
9.8 KiB
Markdown
# Operational Runbooks - Master Index
|
|
|
|
**Last Updated:** 2025-01-20
|
|
**Document Version:** 1.0
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment.
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
### Emergency Procedures
|
|
|
|
- **[Emergency Access](#emergency-access)** - Break-glass access procedures
|
|
- **[Service Recovery](#service-recovery)** - Recovering failed services
|
|
- **[Network Recovery](#network-recovery)** - Network connectivity issues
|
|
|
|
### Common Operations
|
|
|
|
- **[Adding a Validator](#adding-a-validator)** - Add new validator node
|
|
- **[Removing a Validator](#removing-a-validator)** - Remove validator node
|
|
- **[Upgrading Besu](#upgrading-besu)** - Besu version upgrade
|
|
- **[Key Rotation](#key-rotation)** - Validator key rotation
|
|
|
|
---
|
|
|
|
## Network Operations
|
|
|
|
### ER605 Router Configuration
|
|
|
|
- **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Complete router configuration guide
|
|
- **VLAN Configuration** - Setting up VLANs on ER605
|
|
- **NAT Pool Configuration** - Configuring role-based egress NAT
|
|
- **Failover Configuration** - Setting up WAN failover
|
|
|
|
### VLAN Management
|
|
|
|
- **VLAN Migration** - Migrating from flat LAN to VLANs
|
|
- **VLAN Troubleshooting** - Common VLAN issues and solutions
|
|
- **Inter-VLAN Routing** - Configuring routing between VLANs
|
|
|
|
### Cloudflare Zero Trust
|
|
|
|
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Complete Cloudflare setup
|
|
- **Tunnel Management** - Managing cloudflared tunnels
|
|
- **Application Publishing** - Publishing applications via Cloudflare Access
|
|
- **Access Policy Management** - Managing access policies
|
|
|
|
---
|
|
|
|
## Besu Operations
|
|
|
|
### Node Management
|
|
|
|
#### Adding a Validator
|
|
|
|
**Prerequisites:**
|
|
- Validator key generated
|
|
- VMID allocated (1000-1499 range)
|
|
- VLAN 110 configured (if migrated)
|
|
|
|
**Steps:**
|
|
1. Create LXC container with VMID
|
|
2. Install Besu
|
|
3. Configure validator key
|
|
4. Add to static-nodes.json on all nodes
|
|
5. Update allowlist (if using permissioning)
|
|
6. Start Besu service
|
|
7. Verify validator is participating
|
|
|
|
**See:** [VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)
|
|
|
|
#### Removing a Validator
|
|
|
|
**Prerequisites:**
|
|
- Validator is not critical (check quorum requirements)
|
|
- Backup validator key
|
|
|
|
**Steps:**
|
|
1. Stop Besu service
|
|
2. Remove from static-nodes.json on all nodes
|
|
3. Update allowlist (if using permissioning)
|
|
4. Remove container (optional)
|
|
5. Document removal
|
|
|
|
#### Upgrading Besu
|
|
|
|
**Prerequisites:**
|
|
- Backup current configuration
|
|
- Test upgrade in dev environment
|
|
- Create snapshot before upgrade
|
|
|
|
**Steps:**
|
|
1. Create snapshot: `pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)`
|
|
2. Stop Besu service
|
|
3. Backup configuration and keys
|
|
4. Install new Besu version
|
|
5. Update configuration if needed
|
|
6. Start Besu service
|
|
7. Verify node is syncing
|
|
8. Monitor for issues
|
|
|
|
**Rollback:**
|
|
- If issues occur: `pct rollback <vmid> pre-upgrade-YYYYMMDD`
|
|
|
|
### Allowlist Management
|
|
|
|
- **[BESU_ALLOWLIST_RUNBOOK.md](BESU_ALLOWLIST_RUNBOOK.md)** - Complete allowlist guide
|
|
- **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Quick start for allowlist issues
|
|
|
|
**Common Operations:**
|
|
- Generate allowlist from nodekeys
|
|
- Update allowlist on all nodes
|
|
- Verify allowlist is correct
|
|
- Troubleshoot allowlist issues
|
|
|
|
### Consensus Troubleshooting
|
|
|
|
- **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
|
|
- **Block Production Issues** - Troubleshooting block production
|
|
- **Validator Recognition** - Validator not being recognized
|
|
|
|
---
|
|
|
|
## CCIP Operations
|
|
|
|
### CCIP Deployment
|
|
|
|
- **[CCIP_DEPLOYMENT_SPEC.md](CCIP_DEPLOYMENT_SPEC.md)** - Complete CCIP deployment specification
|
|
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment orchestration
|
|
|
|
**Deployment Phases:**
|
|
1. Deploy Ops/Admin nodes (5400-5401)
|
|
2. Deploy Monitoring nodes (5402-5403)
|
|
3. Deploy Commit nodes (5410-5425)
|
|
4. Deploy Execute nodes (5440-5455)
|
|
5. Deploy RMN nodes (5470-5476)
|
|
|
|
### CCIP Node Management
|
|
|
|
- **Adding CCIP Node** - Add new CCIP node to fleet
|
|
- **Removing CCIP Node** - Remove CCIP node from fleet
|
|
- **CCIP Node Troubleshooting** - Common CCIP issues
|
|
|
|
---
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Monitoring Setup
|
|
|
|
- **[MONITORING_SUMMARY.md](MONITORING_SUMMARY.md)** - Monitoring setup
|
|
- **[BLOCK_PRODUCTION_MONITORING.md](BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
|
|
|
|
**Components:**
|
|
- Prometheus metrics collection
|
|
- Grafana dashboards
|
|
- Loki log aggregation
|
|
- Alertmanager alerting
|
|
|
|
### Health Checks
|
|
|
|
- **Node Health Checks** - Check individual node health
|
|
- **Service Health Checks** - Check service status
|
|
- **Network Health Checks** - Check network connectivity
|
|
|
|
**Scripts:**
|
|
- `check-node-health.sh` - Node health check script
|
|
- `check-service-status.sh` - Service status check
|
|
|
|
---
|
|
|
|
## Backup & Recovery
|
|
|
|
### Backup Procedures
|
|
|
|
- **Configuration Backup** - Backup all configuration files
|
|
- **Validator Key Backup** - Encrypted backup of validator keys
|
|
- **Container Backup** - Backup container configurations
|
|
|
|
**Automated Backups:**
|
|
- Scheduled daily backups
|
|
- Encrypted storage
|
|
- Multiple locations
|
|
- 30-day retention
|
|
|
|
### Disaster Recovery
|
|
|
|
- **Service Recovery** - Recover failed services
|
|
- **Network Recovery** - Recover network connectivity
|
|
- **Full System Recovery** - Complete system recovery
|
|
|
|
**Recovery Procedures:**
|
|
1. Identify failure point
|
|
2. Restore from backup
|
|
3. Verify service status
|
|
4. Monitor for issues
|
|
|
|
---
|
|
|
|
## Security Operations
|
|
|
|
### Key Management
|
|
|
|
- **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets and keys management
|
|
- **Validator Key Rotation** - Rotate validator keys
|
|
- **API Token Rotation** - Rotate API tokens
|
|
|
|
### Access Control
|
|
|
|
- **SSH Key Management** - Manage SSH keys
|
|
- **Cloudflare Access** - Manage Cloudflare Access policies
|
|
- **Firewall Rules** - Manage firewall rules
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions
|
|
- **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT troubleshooting
|
|
- **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
|
|
|
|
### Diagnostic Procedures
|
|
|
|
1. **Check Service Status**
|
|
```bash
|
|
systemctl status besu-validator
|
|
```
|
|
|
|
2. **Check Logs**
|
|
```bash
|
|
journalctl -u besu-validator -f
|
|
```
|
|
|
|
3. **Check Network Connectivity**
|
|
```bash
|
|
ping <node-ip>
|
|
```
|
|
|
|
4. **Check Node Health**
|
|
```bash
|
|
./scripts/health/check-node-health.sh <vmid>
|
|
```
|
|
|
|
---
|
|
|
|
## Emergency Procedures
|
|
|
|
### Emergency Access
|
|
|
|
**Break-glass Access:**
|
|
1. Use emergency SSH endpoint (if configured)
|
|
2. Access via Cloudflare Access (if available)
|
|
3. Physical console access (last resort)
|
|
|
|
**Emergency Contacts:**
|
|
- Infrastructure Team: [contact info]
|
|
- On-call Engineer: [contact info]
|
|
|
|
### Service Recovery
|
|
|
|
**Priority Order:**
|
|
1. Validators (critical for consensus)
|
|
2. RPC nodes (critical for access)
|
|
3. Monitoring (important for visibility)
|
|
4. Other services
|
|
|
|
**Recovery Steps:**
|
|
1. Identify failed service
|
|
2. Check service logs
|
|
3. Restart service
|
|
4. If restart fails, restore from backup
|
|
5. Verify service is operational
|
|
|
|
### Network Recovery
|
|
|
|
**Network Issues:**
|
|
1. Check ER605 router status
|
|
2. Check switch status
|
|
3. Check VLAN configuration
|
|
4. Check firewall rules
|
|
5. Test connectivity
|
|
|
|
**VLAN Issues:**
|
|
1. Verify VLAN configuration on switches
|
|
2. Verify VLAN configuration on ER605
|
|
3. Verify Proxmox bridge configuration
|
|
4. Test inter-VLAN routing
|
|
|
|
---
|
|
|
|
## Maintenance Windows
|
|
|
|
### Scheduled Maintenance
|
|
|
|
- **Weekly:** Health checks, log review
|
|
- **Monthly:** Security updates, configuration review
|
|
- **Quarterly:** Full system review, backup testing
|
|
|
|
### Maintenance Procedures
|
|
|
|
1. **Notify Stakeholders** - Send maintenance notification
|
|
2. **Create Snapshots** - Snapshot all containers before changes
|
|
3. **Perform Maintenance** - Execute maintenance tasks
|
|
4. **Verify Services** - Verify all services are operational
|
|
5. **Document Changes** - Document all changes made
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
### Troubleshooting
|
|
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Common issues and solutions - **Start here for problems**
|
|
- **[QBFT_TROUBLESHOOTING.md](/docs/09-troubleshooting/QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
|
|
- **[BESU_ALLOWLIST_QUICK_START.md](BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
|
|
|
|
### Architecture & Design
|
|
- **[NETWORK_ARCHITECTURE.md](NETWORK_ARCHITECTURE.md)** - Network architecture
|
|
- **[ORCHESTRATION_DEPLOYMENT_GUIDE.md](ORCHESTRATION_DEPLOYMENT_GUIDE.md)** - Deployment guide
|
|
- **[VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)** - VMID allocation
|
|
|
|
### Configuration
|
|
- **[ER605_ROUTER_CONFIGURATION.md](/docs/04-configuration/ER605_ROUTER_CONFIGURATION.md)** - Router configuration
|
|
- **[CLOUDFLARE_ZERO_TRUST_GUIDE.md](CLOUDFLARE_ZERO_TRUST_GUIDE.md)** - Cloudflare setup
|
|
- **[SECRETS_KEYS_CONFIGURATION.md](/docs/04-configuration/SECRETS_KEYS_CONFIGURATION.md)** - Secrets management
|
|
|
|
### Deployment
|
|
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Validated set deployment
|
|
- **[CCIP_DEPLOYMENT_SPEC.md](CCIP_DEPLOYMENT_SPEC.md)** - CCIP deployment
|
|
- **[DEPLOYMENT_READINESS.md](DEPLOYMENT_READINESS.md)** - Deployment readiness
|
|
- **[DEPLOYMENT_STATUS_CONSOLIDATED.md](DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status
|
|
|
|
### Monitoring
|
|
- **[MONITORING_SUMMARY.md](MONITORING_SUMMARY.md)** - Monitoring setup
|
|
- **[BLOCK_PRODUCTION_MONITORING.md](BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
|
|
|
|
### Reference
|
|
- **[MASTER_INDEX.md](MASTER_INDEX.md)** - Complete documentation index
|
|
|
|
---
|
|
|
|
**Document Status:** Active
|
|
**Maintained By:** Infrastructure Team
|
|
**Review Cycle:** Monthly
|
|
**Last Updated:** 2025-01-20
|
|
|