Files
proxmox/smom-dbis-138-proxmox/docs/BEST_PRACTICES.md

8.1 KiB

Best Practices and Recommendations

Complete guide for production deployment, multi-node setups, elastic storage, and operational excellence.

🏗️ Architecture Recommendations

Multi-Node Deployment

Benefits:

  • High availability and redundancy
  • Load distribution across nodes
  • Disaster recovery capabilities
  • Better resource utilization

Node Assignment Strategies:

  1. Auto (Recommended): Automatically selects nodes with available resources
  2. Round-Robin: Distributes containers evenly across nodes
  3. Manual: Specify node assignments in configuration file

Configuration:

# In config/proxmox.conf
PROXMOX_NODES="pve,pve2,pve3"
NODE_ASSIGNMENT_STRATEGY="auto"

Deployment:

# Deploy with multi-node support
./scripts/manage/deploy-multi-node.sh validators 4

Elastic Storage Configuration

Storage Expansion:

# Expand container storage
./scripts/manage/expand-storage.sh <VMID> <additional_GB>

# Example: Expand validator by 50GB
./scripts/manage/expand-storage.sh 1000 50

Automatic Expansion:

  • Enable AUTO_EXPAND_STORAGE=true in config
  • Set STORAGE_ALERT_THRESHOLD for proactive expansion
  • Monitor storage usage with pvesm status

Storage Pool Types:

  • local-lvm: Fast, local storage (default)
  • local-zfs: Advanced features, snapshots
  • shared-storage: Network storage for HA

Container Migration

Live Migration:

# Migrate container to another node
./scripts/manage/migrate-container.sh <VMID> <target_node>

# Example: Migrate validator to pve2
./scripts/manage/migrate-container.sh 1000 pve2

# With storage migration
./scripts/manage/migrate-container.sh 1000 pve2 local-lvm true

Migration Best Practices:

  1. Perform during maintenance windows
  2. Test migration on non-critical containers first
  3. Ensure network connectivity between nodes
  4. Monitor during and after migration

📊 Resource Management

Resource Allocation

Recommended Allocations:

  • Validators: 8GB RAM, 4 CPU, 100GB disk (expandable)
  • RPC Nodes: 16GB RAM, 4 CPU, 200GB disk (expandable)
  • Services: 2-4GB RAM, 2 CPU, 20-50GB disk
  • Monitoring: 4GB RAM, 4 CPU, 50GB disk (with retention)

Resource Monitoring

Enable Monitoring:

# In config/proxmox.conf
RESOURCE_MONITORING_ENABLED="true"
RESOURCE_ALERT_CPU="90"
RESOURCE_ALERT_MEMORY="85"
RESOURCE_ALERT_DISK="80"

Check Resources:

# Check all nodes
./scripts/manage/deploy-multi-node.sh check-resources

# Check specific container
pct exec <VMID> -- free -h
pct exec <VMID> -- df -h

🔄 High Availability

HA Configuration

Enable HA:

# In config/proxmox.conf
HA_ENABLED="true"
HA_GROUP="smom-dbis-138"

Benefits:

  • Automatic failover
  • Service continuity
  • Reduced downtime

Redundancy

Recommended Redundancy:

  • Validators: Minimum 4 nodes (2/3 consensus requires 3+1)
  • RPC Nodes: 3+ nodes for load balancing
  • Sentries: 3+ nodes for DDoS protection
  • Services: 2+ instances for critical services

💾 Backup Strategy

Backup Configuration

# In config/proxmox.conf
BACKUP_ENABLED="1"
BACKUP_RETENTION_DAYS="30"
BACKUP_SCHEDULE="02:00"

Backup Best Practices

  1. Regular Backups: Daily automated backups
  2. Snapshot Before Changes: Create snapshots before upgrades
  3. Off-Site Storage: Store backups on separate storage
  4. Test Restores: Regularly test backup restoration

Backup Scripts

# Manual backup
./scripts/backup/backup-all.sh

# Restore from backup
./scripts/backup/restore-container.sh <VMID> <backup_file>

🔐 Security Best Practices

Network Security

VLAN Isolation:

  • Validators: VLAN 100 (private)
  • Sentries: VLAN 101 (semi-private)
  • RPC Nodes: VLAN 102 (public)
  • Services: VLAN 103 (internal)
  • Monitoring: VLAN 104 (management)

Firewall Rules:

  • Restrict validator RPC access
  • Limit public RPC access with rate limiting
  • Isolate management networks

Access Control

API Tokens:

  • Use API tokens instead of passwords
  • Rotate tokens regularly
  • Use least privilege principle

Container Security:

  • Use unprivileged containers where possible
  • Enable AppArmor/SELinux
  • Keep containers updated

📈 Scaling Recommendations

Horizontal Scaling

Adding More Nodes:

  1. Add node to cluster
  2. Update PROXMOX_NODES configuration
  3. Migrate containers using migration script
  4. Verify connectivity

Scaling Services:

# Deploy additional validators
./scripts/deployment/deploy-besu-nodes.sh --validators 6

# Deploy additional RPC nodes
./scripts/deployment/deploy-besu-nodes.sh --rpc 5

Vertical Scaling

Increasing Resources:

# Expand storage
./scripts/manage/expand-storage.sh <VMID> <GB>

# Increase memory (requires container restart)
pct set <VMID> -memory <MB>

# Increase CPU
pct set <VMID> -cores <count>

🔍 Monitoring and Alerting

Prometheus Integration

Metrics Collection:

  • Container resource usage
  • Service health metrics
  • Network metrics
  • Storage metrics

Alerting:

  • Configure Alertmanager
  • Set up notification channels
  • Define alert rules

Health Checks

Enable Health Checks:

# In config/proxmox.conf
HEALTH_CHECK_ENABLED="true"
HEALTH_CHECK_INTERVAL="300"

Check Service Health:

# Check container status
pct status <VMID>

# Check service status
pct exec <VMID> -- systemctl status <service>

# Check logs
pct exec <VMID> -- journalctl -u <service> -n 50

🚀 Performance Optimization

Storage Optimization

Use Appropriate Storage:

  • SSD: For validators and RPC nodes (performance)
  • NVMe: For high-performance requirements
  • Network Storage: For shared data (Ceph, NFS)

Disk I/O Optimization:

  • Use separate storage for logs
  • Enable write-back caching where appropriate
  • Monitor disk I/O with iostat

Network Optimization

Network Configuration:

  • Use dedicated network for cluster communication
  • Enable jumbo frames for inter-node communication
  • Configure network bonding for redundancy

Connection Pooling:

  • Configure RPC connection limits
  • Use connection pooling for services
  • Monitor network usage

🔧 Maintenance Procedures

Upgrade Procedure

  1. Create Snapshots:

    ./scripts/backup/backup-all.sh
    
  2. Rolling Upgrades:

    ./scripts/upgrade/upgrade-all.sh
    
  3. Verify Services:

    ./scripts/verify/verify-deployment.sh
    

Maintenance Window

Best Practices:

  • Schedule during low-traffic periods
  • Perform rolling updates
  • Test on non-production first
  • Have rollback plan ready

📝 Documentation

Keep Documentation Updated

  1. Update Configuration:

    • Document all configuration changes
    • Keep network diagrams current
    • Maintain inventory list
  2. Change Log:

    • Track all deployments
    • Document issues and resolutions
    • Maintain runbooks

Checklist

Pre-Deployment

  • Review resource requirements
  • Configure storage pools
  • Set up network VLANs
  • Configure backup storage
  • Test node connectivity

Deployment

  • Deploy to staging first
  • Verify container creation
  • Check network connectivity
  • Verify service health
  • Test failover scenarios

Post-Deployment

  • Configure monitoring
  • Set up alerting
  • Schedule backups
  • Document configuration
  • Train operations team

🆘 Troubleshooting

Common Issues

Storage Full:

# Check storage usage
pvesm status

# Expand storage
./scripts/manage/expand-storage.sh <VMID> <GB>

Container Won't Start:

# Check logs
pct exec <VMID> -- journalctl -xe

# Check resource limits
pct config <VMID>

Network Issues:

# Check network configuration
pct config <VMID> | grep net0

# Test connectivity
pct exec <VMID> -- ping <gateway>

Support Resources