Files
proxmox/docs/03-deployment/DISASTER_RECOVERY.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

5.9 KiB

Disaster Recovery Procedures

Last Updated: 2025-01-20
Document Version: 1.0
Status: Active Documentation


Overview

This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.


Recovery Scenarios

1. Complete Host Failure

Scenario: A Proxmox host (R630 or ML110) fails completely and cannot be recovered.

Recovery Steps:

  1. Assess Impact:

    # Check which VMs/containers were running on failed host
    pvecm status
    pvecm nodes
    
  2. Recover from Backup:

    • Identify backup location (Proxmox Backup Server or external storage)
    • Restore VMs/containers to another host in the cluster
    • Verify network connectivity and services
  3. Rejoin Cluster (if host is replaced):

    # On new/repaired host
    pvecm add <cluster-name> -link0 <interface>
    
  4. Verify Services:

    • Check all critical services are running
    • Verify network connectivity
    • Test application functionality

Recovery Time Objective (RTO): 4 hours
Recovery Point Objective (RPO): Last backup (typically daily)


2. Storage Failure

Scenario: Storage pool fails (ZFS pool corruption, disk failure, etc.)

Recovery Steps:

  1. Immediate Actions:

    • Stop all VMs/containers using affected storage
    • Assess extent of damage
    • Check backup availability
  2. Storage Recovery:

    # For ZFS pools
    zpool status
    zpool import -f <pool-name>
    zfs scrub <pool-name>
    
  3. Data Recovery:

    • Restore from backups if pool cannot be recovered
    • Use Proxmox Backup Server if available
    • Restore individual VMs/containers as needed
  4. Verification:

    • Verify data integrity
    • Test restored VMs/containers
    • Document lessons learned

RTO: 8 hours
RPO: Last backup


3. Network Outage

Scenario: Complete network failure or misconfiguration

Recovery Steps:

  1. Local Access:

    • Use console access (iDRAC, iLO, or physical console)
    • Verify Proxmox host is running
    • Check network configuration
  2. Network Restoration:

    # Check network interfaces
    ip addr show
    ip link show
    
    # Check routing
    ip route show
    
    # Restart networking if needed
    systemctl restart networking
    
  3. VLAN Restoration:

    • Verify VLAN configuration on switches
    • Check Proxmox bridge configuration
    • Test connectivity between VLANs
  4. Service Verification:

    • Test internal services
    • Verify external connectivity (if applicable)
    • Check Cloudflare tunnels (if used)

RTO: 2 hours
RPO: No data loss (network issue only)


4. Data Corruption

Scenario: VM/container data corruption or accidental deletion

Recovery Steps:

  1. Immediate Actions:

    • Stop affected VM/container
    • Do not attempt repairs that might worsen corruption
    • Document what was lost
  2. Recovery Options:

    • From Snapshot: Restore from most recent snapshot
    • From Backup: Restore from Proxmox Backup Server
    • From External Backup: Use external backup solution
  3. Restoration:

    # Restore from PBS
    vzdump restore <backup-id> <vmid> --storage <storage>
    
    # Or restore from snapshot
    qm rollback <vmid> <snapshot-name>
    
  4. Verification:

    • Verify data integrity
    • Test application functionality
    • Update documentation

RTO: 4 hours
RPO: Last snapshot/backup


5. Security Incident

Scenario: Security breach, unauthorized access, or malware

Recovery Steps:

  1. Immediate Containment:

    • Isolate affected systems
    • Disconnect from network if necessary
    • Preserve evidence (logs, snapshots)
  2. Assessment:

    • Identify scope of breach
    • Determine what was accessed/modified
    • Check for data exfiltration
  3. Recovery:

    • Restore from known-good backups (pre-incident)
    • Rebuild affected systems if necessary
    • Update all credentials and keys
  4. Hardening:

    • Review and update security policies
    • Patch vulnerabilities
    • Enhance monitoring
  5. Documentation:

    • Document incident timeline
    • Update security procedures
    • Conduct post-incident review

RTO: 24 hours
RPO: Pre-incident state


Backup Strategy

Backup Schedule

  • Critical VMs/Containers: Daily backups
  • Standard VMs/Containers: Weekly backups
  • Configuration: Daily backups of Proxmox configuration
  • Network Configuration: Version controlled (Git)

Backup Locations

  1. Primary: Proxmox Backup Server (if available)
  2. Secondary: External storage (NFS, SMB, or USB)
  3. Offsite: Cloud storage or remote location

Backup Verification

  • Weekly restore tests
  • Monthly full disaster recovery drill
  • Quarterly review of backup strategy

Recovery Contacts

Primary Contacts

  • Infrastructure Lead: [Contact Information]
  • Network Administrator: [Contact Information]
  • Security Team: [Contact Information]

Escalation

  • Level 1: Infrastructure team (4 hours)
  • Level 2: Management (8 hours)
  • Level 3: External support (24 hours)

Testing and Maintenance

Quarterly DR Drills

  1. Test Scenario: Simulate host failure
  2. Test Scenario: Simulate storage failure
  3. Test Scenario: Simulate network outage
  4. Document Results: Update procedures based on findings

Annual Full DR Test

  • Complete infrastructure rebuild from backups
  • Verify all services
  • Update documentation


Last Updated: 2025-01-20
Review Cycle: Quarterly