Files

defiQUG 9c37af10c0 Complete optional next steps: fix references and consolidate duplicates

- Fixed 104 broken references in 59 files
- Consolidated 40+ duplicate status files
- Archived duplicates to reports/archive/duplicates/
- Created scripts for reference fixing and consolidation
- Updated content inconsistency reports

All optional cleanup tasks complete.

2026-01-06 02:25:38 -08:00

5.9 KiB

Raw Blame History

Disaster Recovery Procedures

Last Updated: 2025-01-20
Document Version: 1.0
Status: Active Documentation

Overview

This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.

Recovery Scenarios

1. Complete Host Failure

Scenario: A Proxmox host (R630 or ML110) fails completely and cannot be recovered.

Recovery Steps:

Assess Impact:

# Check which VMs/containers were running on failed host
pvecm status
pvecm nodes

Recover from Backup:
- Identify backup location (Proxmox Backup Server or external storage)
- Restore VMs/containers to another host in the cluster
- Verify network connectivity and services

Rejoin Cluster (if host is replaced):

# On new/repaired host
pvecm add <cluster-name> -link0 <interface>

Verify Services:
- Check all critical services are running
- Verify network connectivity
- Test application functionality

Recovery Time Objective (RTO): 4 hours
Recovery Point Objective (RPO): Last backup (typically daily)

2. Storage Failure

Scenario: Storage pool fails (ZFS pool corruption, disk failure, etc.)

Recovery Steps:

Immediate Actions:
- Stop all VMs/containers using affected storage
- Assess extent of damage
- Check backup availability

Storage Recovery:

# For ZFS pools
zpool status
zpool import -f <pool-name>
zfs scrub <pool-name>

Data Recovery:
- Restore from backups if pool cannot be recovered
- Use Proxmox Backup Server if available
- Restore individual VMs/containers as needed
Verification:
- Verify data integrity
- Test restored VMs/containers
- Document lessons learned

RTO: 8 hours
RPO: Last backup

3. Network Outage

Scenario: Complete network failure or misconfiguration

Recovery Steps:

Local Access:
- Use console access (iDRAC, iLO, or physical console)
- Verify Proxmox host is running
- Check network configuration

Network Restoration:

# Check network interfaces
ip addr show
ip link show

# Check routing
ip route show

# Restart networking if needed
systemctl restart networking

VLAN Restoration:
- Verify VLAN configuration on switches
- Check Proxmox bridge configuration
- Test connectivity between VLANs
Service Verification:
- Test internal services
- Verify external connectivity (if applicable)
- Check Cloudflare tunnels (if used)

RTO: 2 hours
RPO: No data loss (network issue only)

4. Data Corruption

Scenario: VM/container data corruption or accidental deletion

Recovery Steps:

Immediate Actions:
- Stop affected VM/container
- Do not attempt repairs that might worsen corruption
- Document what was lost
Recovery Options:
- From Snapshot: Restore from most recent snapshot
- From Backup: Restore from Proxmox Backup Server
- From External Backup: Use external backup solution

Restoration:

# Restore from PBS
vzdump restore <backup-id> <vmid> --storage <storage>

# Or restore from snapshot
qm rollback <vmid> <snapshot-name>

Verification:
- Verify data integrity
- Test application functionality
- Update documentation

RTO: 4 hours
RPO: Last snapshot/backup

5. Security Incident

Scenario: Security breach, unauthorized access, or malware

Recovery Steps:

Immediate Containment:
- Isolate affected systems
- Disconnect from network if necessary
- Preserve evidence (logs, snapshots)
Assessment:
- Identify scope of breach
- Determine what was accessed/modified
- Check for data exfiltration
Recovery:
- Restore from known-good backups (pre-incident)
- Rebuild affected systems if necessary
- Update all credentials and keys
Hardening:
- Review and update security policies
- Patch vulnerabilities
- Enhance monitoring
Documentation:
- Document incident timeline
- Update security procedures
- Conduct post-incident review

RTO: 24 hours
RPO: Pre-incident state

Backup Strategy

Backup Schedule

Critical VMs/Containers: Daily backups
Standard VMs/Containers: Weekly backups
Configuration: Daily backups of Proxmox configuration
Network Configuration: Version controlled (Git)

Backup Locations

Primary: Proxmox Backup Server (if available)
Secondary: External storage (NFS, SMB, or USB)
Offsite: Cloud storage or remote location

Backup Verification

Weekly restore tests
Monthly full disaster recovery drill
Quarterly review of backup strategy

Recovery Contacts

Primary Contacts

Infrastructure Lead: [Contact Information]
Network Administrator: [Contact Information]
Security Team: [Contact Information]

Escalation

Level 1: Infrastructure team (4 hours)
Level 2: Management (8 hours)
Level 3: External support (24 hours)

Testing and Maintenance

Quarterly DR Drills

Test Scenario: Simulate host failure
Test Scenario: Simulate storage failure
Test Scenario: Simulate network outage
Document Results: Update procedures based on findings

Annual Full DR Test

Complete infrastructure rebuild from backups
Verify all services
Update documentation

BACKUP_AND_RESTORE.md - Detailed backup procedures
OPERATIONAL_RUNBOOKS.md - Operational procedures
../../09-troubleshooting/TROUBLESHOOTING_FAQ.md - Troubleshooting guide

Last Updated: 2025-01-20
Review Cycle: Quarterly

5.9 KiB Raw Blame History

Disaster Recovery Procedures

Overview

Recovery Scenarios

1. Complete Host Failure

2. Storage Failure

3. Network Outage

4. Data Corruption

5. Security Incident

Backup Strategy

Backup Schedule

Backup Locations

Backup Verification

Recovery Contacts

Primary Contacts

Escalation

Testing and Maintenance

Quarterly DR Drills

Annual Full DR Test

Related Documentation

5.9 KiB

Raw Blame History