- Fixed 104 broken references in 59 files - Consolidated 40+ duplicate status files - Archived duplicates to reports/archive/duplicates/ - Created scripts for reference fixing and consolidation - Updated content inconsistency reports All optional cleanup tasks complete.
5.9 KiB
Disaster Recovery Procedures
Last Updated: 2025-01-20
Document Version: 1.0
Status: Active Documentation
Overview
This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.
Recovery Scenarios
1. Complete Host Failure
Scenario: A Proxmox host (R630 or ML110) fails completely and cannot be recovered.
Recovery Steps:
-
Assess Impact:
# Check which VMs/containers were running on failed host pvecm status pvecm nodes -
Recover from Backup:
- Identify backup location (Proxmox Backup Server or external storage)
- Restore VMs/containers to another host in the cluster
- Verify network connectivity and services
-
Rejoin Cluster (if host is replaced):
# On new/repaired host pvecm add <cluster-name> -link0 <interface> -
Verify Services:
- Check all critical services are running
- Verify network connectivity
- Test application functionality
Recovery Time Objective (RTO): 4 hours
Recovery Point Objective (RPO): Last backup (typically daily)
2. Storage Failure
Scenario: Storage pool fails (ZFS pool corruption, disk failure, etc.)
Recovery Steps:
-
Immediate Actions:
- Stop all VMs/containers using affected storage
- Assess extent of damage
- Check backup availability
-
Storage Recovery:
# For ZFS pools zpool status zpool import -f <pool-name> zfs scrub <pool-name> -
Data Recovery:
- Restore from backups if pool cannot be recovered
- Use Proxmox Backup Server if available
- Restore individual VMs/containers as needed
-
Verification:
- Verify data integrity
- Test restored VMs/containers
- Document lessons learned
RTO: 8 hours
RPO: Last backup
3. Network Outage
Scenario: Complete network failure or misconfiguration
Recovery Steps:
-
Local Access:
- Use console access (iDRAC, iLO, or physical console)
- Verify Proxmox host is running
- Check network configuration
-
Network Restoration:
# Check network interfaces ip addr show ip link show # Check routing ip route show # Restart networking if needed systemctl restart networking -
VLAN Restoration:
- Verify VLAN configuration on switches
- Check Proxmox bridge configuration
- Test connectivity between VLANs
-
Service Verification:
- Test internal services
- Verify external connectivity (if applicable)
- Check Cloudflare tunnels (if used)
RTO: 2 hours
RPO: No data loss (network issue only)
4. Data Corruption
Scenario: VM/container data corruption or accidental deletion
Recovery Steps:
-
Immediate Actions:
- Stop affected VM/container
- Do not attempt repairs that might worsen corruption
- Document what was lost
-
Recovery Options:
- From Snapshot: Restore from most recent snapshot
- From Backup: Restore from Proxmox Backup Server
- From External Backup: Use external backup solution
-
Restoration:
# Restore from PBS vzdump restore <backup-id> <vmid> --storage <storage> # Or restore from snapshot qm rollback <vmid> <snapshot-name> -
Verification:
- Verify data integrity
- Test application functionality
- Update documentation
RTO: 4 hours
RPO: Last snapshot/backup
5. Security Incident
Scenario: Security breach, unauthorized access, or malware
Recovery Steps:
-
Immediate Containment:
- Isolate affected systems
- Disconnect from network if necessary
- Preserve evidence (logs, snapshots)
-
Assessment:
- Identify scope of breach
- Determine what was accessed/modified
- Check for data exfiltration
-
Recovery:
- Restore from known-good backups (pre-incident)
- Rebuild affected systems if necessary
- Update all credentials and keys
-
Hardening:
- Review and update security policies
- Patch vulnerabilities
- Enhance monitoring
-
Documentation:
- Document incident timeline
- Update security procedures
- Conduct post-incident review
RTO: 24 hours
RPO: Pre-incident state
Backup Strategy
Backup Schedule
- Critical VMs/Containers: Daily backups
- Standard VMs/Containers: Weekly backups
- Configuration: Daily backups of Proxmox configuration
- Network Configuration: Version controlled (Git)
Backup Locations
- Primary: Proxmox Backup Server (if available)
- Secondary: External storage (NFS, SMB, or USB)
- Offsite: Cloud storage or remote location
Backup Verification
- Weekly restore tests
- Monthly full disaster recovery drill
- Quarterly review of backup strategy
Recovery Contacts
Primary Contacts
- Infrastructure Lead: [Contact Information]
- Network Administrator: [Contact Information]
- Security Team: [Contact Information]
Escalation
- Level 1: Infrastructure team (4 hours)
- Level 2: Management (8 hours)
- Level 3: External support (24 hours)
Testing and Maintenance
Quarterly DR Drills
- Test Scenario: Simulate host failure
- Test Scenario: Simulate storage failure
- Test Scenario: Simulate network outage
- Document Results: Update procedures based on findings
Annual Full DR Test
- Complete infrastructure rebuild from backups
- Verify all services
- Update documentation
Related Documentation
- BACKUP_AND_RESTORE.md - Detailed backup procedures
- OPERATIONAL_RUNBOOKS.md - Operational procedures
- ../../09-troubleshooting/TROUBLESHOOTING_FAQ.md - Troubleshooting guide
Last Updated: 2025-01-20
Review Cycle: Quarterly