Files
proxmox/docs/03-deployment/OPERATIONAL_RUNBOOKS.md
defiQUG 9c37af10c0 Complete optional next steps: fix references and consolidate duplicates
- Fixed 104 broken references in 59 files
- Consolidated 40+ duplicate status files
- Archived duplicates to reports/archive/duplicates/
- Created scripts for reference fixing and consolidation
- Updated content inconsistency reports

All optional cleanup tasks complete.
2026-01-06 02:25:38 -08:00

9.8 KiB

Operational Runbooks - Master Index

Last Updated: 2025-01-20
Document Version: 1.0


Overview

This document provides a master index of all operational runbooks and procedures for the Sankofa/Phoenix/PanTel Proxmox deployment.


Quick Reference

Emergency Procedures

Common Operations


Network Operations

ER605 Router Configuration

  • ER605_ROUTER_CONFIGURATION.md - Complete router configuration guide
  • VLAN Configuration - Setting up VLANs on ER605
  • NAT Pool Configuration - Configuring role-based egress NAT
  • Failover Configuration - Setting up WAN failover

VLAN Management

  • VLAN Migration - Migrating from flat LAN to VLANs
  • VLAN Troubleshooting - Common VLAN issues and solutions
  • Inter-VLAN Routing - Configuring routing between VLANs

Cloudflare Zero Trust

  • CLOUDFLARE_ZERO_TRUST_GUIDE.md - Complete Cloudflare setup
  • Tunnel Management - Managing cloudflared tunnels
  • Application Publishing - Publishing applications via Cloudflare Access
  • Access Policy Management - Managing access policies

Besu Operations

Node Management

Adding a Validator

Prerequisites:

  • Validator key generated
  • VMID allocated (1000-1499 range)
  • VLAN 110 configured (if migrated)

Steps:

  1. Create LXC container with VMID
  2. Install Besu
  3. Configure validator key
  4. Add to static-nodes.json on all nodes
  5. Update allowlist (if using permissioning)
  6. Start Besu service
  7. Verify validator is participating

See: VALIDATED_SET_DEPLOYMENT_GUIDE.md

Removing a Validator

Prerequisites:

  • Validator is not critical (check quorum requirements)
  • Backup validator key

Steps:

  1. Stop Besu service
  2. Remove from static-nodes.json on all nodes
  3. Update allowlist (if using permissioning)
  4. Remove container (optional)
  5. Document removal

Upgrading Besu

Prerequisites:

  • Backup current configuration
  • Test upgrade in dev environment
  • Create snapshot before upgrade

Steps:

  1. Create snapshot: pct snapshot <vmid> pre-upgrade-$(date +%Y%m%d)
  2. Stop Besu service
  3. Backup configuration and keys
  4. Install new Besu version
  5. Update configuration if needed
  6. Start Besu service
  7. Verify node is syncing
  8. Monitor for issues

Rollback:

  • If issues occur: pct rollback <vmid> pre-upgrade-YYYYMMDD

Allowlist Management

Common Operations:

  • Generate allowlist from nodekeys
  • Update allowlist on all nodes
  • Verify allowlist is correct
  • Troubleshoot allowlist issues

Consensus Troubleshooting

  • QBFT_TROUBLESHOOTING.md - QBFT consensus troubleshooting
  • Block Production Issues - Troubleshooting block production
  • Validator Recognition - Validator not being recognized

CCIP Operations

CCIP Deployment

Deployment Phases:

  1. Deploy Ops/Admin nodes (5400-5401)
  2. Deploy Monitoring nodes (5402-5403)
  3. Deploy Commit nodes (5410-5425)
  4. Deploy Execute nodes (5440-5455)
  5. Deploy RMN nodes (5470-5476)

CCIP Node Management

  • Adding CCIP Node - Add new CCIP node to fleet
  • Removing CCIP Node - Remove CCIP node from fleet
  • CCIP Node Troubleshooting - Common CCIP issues

Monitoring & Observability

Monitoring Setup

Components:

  • Prometheus metrics collection
  • Grafana dashboards
  • Loki log aggregation
  • Alertmanager alerting

Health Checks

  • Node Health Checks - Check individual node health
  • Service Health Checks - Check service status
  • Network Health Checks - Check network connectivity

Scripts:

  • check-node-health.sh - Node health check script
  • check-service-status.sh - Service status check

Backup & Recovery

Backup Procedures

  • Configuration Backup - Backup all configuration files
  • Validator Key Backup - Encrypted backup of validator keys
  • Container Backup - Backup container configurations

Automated Backups:

  • Scheduled daily backups
  • Encrypted storage
  • Multiple locations
  • 30-day retention

Disaster Recovery

  • Service Recovery - Recover failed services
  • Network Recovery - Recover network connectivity
  • Full System Recovery - Complete system recovery

Recovery Procedures:

  1. Identify failure point
  2. Restore from backup
  3. Verify service status
  4. Monitor for issues

Security Operations

Key Management

  • SECRETS_KEYS_CONFIGURATION.md - Secrets and keys management
  • Validator Key Rotation - Rotate validator keys
  • API Token Rotation - Rotate API tokens

Access Control

  • SSH Key Management - Manage SSH keys
  • Cloudflare Access - Manage Cloudflare Access policies
  • Firewall Rules - Manage firewall rules

Troubleshooting

Common Issues

Diagnostic Procedures

  1. Check Service Status

    systemctl status besu-validator
    
  2. Check Logs

    journalctl -u besu-validator -f
    
  3. Check Network Connectivity

    ping <node-ip>
    
  4. Check Node Health

    ./scripts/health/check-node-health.sh <vmid>
    

Emergency Procedures

Emergency Access

Break-glass Access:

  1. Use emergency SSH endpoint (if configured)
  2. Access via Cloudflare Access (if available)
  3. Physical console access (last resort)

Emergency Contacts:

  • Infrastructure Team: [contact info]
  • On-call Engineer: [contact info]

Service Recovery

Priority Order:

  1. Validators (critical for consensus)
  2. RPC nodes (critical for access)
  3. Monitoring (important for visibility)
  4. Other services

Recovery Steps:

  1. Identify failed service
  2. Check service logs
  3. Restart service
  4. If restart fails, restore from backup
  5. Verify service is operational

Network Recovery

Network Issues:

  1. Check ER605 router status
  2. Check switch status
  3. Check VLAN configuration
  4. Check firewall rules
  5. Test connectivity

VLAN Issues:

  1. Verify VLAN configuration on switches
  2. Verify VLAN configuration on ER605
  3. Verify Proxmox bridge configuration
  4. Test inter-VLAN routing

Maintenance Windows

Scheduled Maintenance

  • Weekly: Health checks, log review
  • Monthly: Security updates, configuration review
  • Quarterly: Full system review, backup testing

Maintenance Procedures

  1. Notify Stakeholders - Send maintenance notification
  2. Create Snapshots - Snapshot all containers before changes
  3. Perform Maintenance - Execute maintenance tasks
  4. Verify Services - Verify all services are operational
  5. Document Changes - Document all changes made

Troubleshooting

Architecture & Design

Configuration

Deployment

Monitoring

Reference


Document Status: Active
Maintained By: Infrastructure Team
Review Cycle: Monthly
Last Updated: 2025-01-20