Files
proxmox/reports/r630-02-container-startup-failures-analysis.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

7.8 KiB

R630-02 Container Startup Failures Analysis

Date: January 19, 2026
Node: r630-02 (192.168.11.12)
Status: ⚠️ CRITICAL - 33 CONTAINERS FAILED TO START


Executive Summary

A bulk container startup operation on r630-02 resulted in 33 container failures out of attempted starts. The failures fall into three distinct categories:

  1. Logical Volume Missing (8 containers) - Storage volumes don't exist
  2. Startup Failures (24 containers) - Containers fail to start for unknown reasons
  3. Lock Error (1 container) - Container is locked in "create" state

Total Impact: 33 containers unable to start, affecting multiple services.


Failure Breakdown

Category 1: Missing Logical Volumes (8 containers)

Error Pattern: no such logical volume pve/vm-XXXX-disk-X

Affected Containers:

  • CT 3000: pve/vm-3000-disk-1
  • CT 3001: pve/vm-3001-disk-1
  • CT 3002: pve/vm-3002-disk-2
  • CT 3003: pve/vm-3003-disk-1
  • CT 3500: pve/vm-3500-disk-1
  • CT 3501: pve/vm-3501-disk-2
  • CT 6000: pve/vm-6000-disk-1
  • CT 6400: pve/vm-6400-disk-1

Root Cause Analysis:

  • Storage volumes were likely deleted, migrated, or never created
  • Containers may have been migrated to another node but configs not updated
  • Storage pool may have been recreated/reset, losing volume metadata
  • Containers may reference wrong storage pool (e.g., thin1 vs thin1-r630-02)

Diagnostic Steps:

  1. Check if volumes exist on other storage pools:

    ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
    
  2. Check container storage configuration:

    ssh root@192.168.11.12 "pct config 3000 | grep rootfs"
    
  3. Check available storage pools:

    ssh root@192.168.11.12 "pvesm status"
    

Resolution Options:

  • Option A: Recreate missing volumes if data is not critical
  • Option B: Migrate containers to existing storage pool
  • Option C: Restore volumes from backup if available
  • Option D: Update container configs to point to correct storage

Category 2: Startup Failures (24 containers)

Error Pattern: startup for container 'XXXX' failed

Affected Containers:

  • CT 5200
  • CT 10000, 10001, 10020, 10030, 10040, 10050, 10060
  • CT 10070, 10080, 10090, 10091, 10092
  • CT 10100, 10101, 10120, 10130
  • CT 10150, 10151
  • CT 10200, 10201, 10202, 10210, 10230

Root Cause Analysis: Startup failures can have multiple causes:

  1. Missing configuration files - Container config deleted or not migrated
  2. Storage issues - Storage accessible but corrupted or misconfigured
  3. Network issues - Network configuration problems
  4. Resource constraints - Insufficient memory/CPU
  5. Container corruption - Container filesystem issues
  6. Dependencies - Missing required services or mounts

Diagnostic Steps:

  1. Check if config files exist:

    ssh root@192.168.11.12 "ls -la /etc/pve/lxc/ | grep -E '5200|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230'"
    
  2. Check detailed startup error:

    ssh root@192.168.11.12 "pct start 5200 2>&1"
    
  3. Check container status and locks:

    ssh root@192.168.11.12 "pct list | grep -E '5200|10000|10001'"
    
  4. Check system resources:

    ssh root@192.168.11.12 "free -h; df -h"
    
  5. Check container logs:

    ssh root@192.168.11.12 "journalctl -u pve-container@5200 -n 50 --no-pager"
    

Resolution Options:

  • Option A: Fix configuration issues (network, storage, etc.)
  • Option B: Recreate containers if configs are missing
  • Option C: Check and resolve resource constraints
  • Option D: Restore from backup if corruption detected

Category 3: Lock Error (1 container)

Error Pattern: CT is locked (create)

Affected Container:

  • CT 10232

Root Cause Analysis:

  • Container is stuck in "create" state
  • Previous creation operation may have been interrupted
  • Lock file exists but container creation incomplete

Diagnostic Steps:

  1. Check lock status:

    ssh root@192.168.11.12 "pct list | grep 10232"
    
  2. Check for lock files:

    ssh root@192.168.11.12 "ls -la /var/lock/qemu-server/ | grep 10232"
    
  3. Check Proxmox task queue:

    ssh root@192.168.11.12 "qm list | grep 10232"
    

Resolution Options:

  • Option A: Clear lock manually:
    ssh root@192.168.11.12 "rm -f /var/lock/qemu-server/lock-10232"
    
  • Option B: Complete or cancel the creation task
  • Option C: Delete and recreate container if creation incomplete

Successfully Started Containers

The following containers started successfully:

  • CT 10030, 10040, 10050, 10060, 10070, 10080, 10090, 10091, 10092, 10100, 10101, 10120, 10130, 10150, 10151, 10200, 10201, 10202, 10210, 10230, 10232

Note: Some of these may have started initially but then failed (see failure list above).


Immediate Actions (Priority 1)

  1. Run Diagnostic Script:

    ./scripts/diagnose-r630-02-startup-failures.sh
    

    This will identify the root cause for each failure.

  2. Check Storage Status:

    ssh root@192.168.11.12 "pvesm status; lvs; vgs"
    
  3. Check System Resources:

    ssh root@192.168.11.12 "free -h; df -h; uptime"
    

Short-term Actions (Priority 2)

  1. Fix Logical Volume Issues:

    • Identify where volumes should be or if they need recreation
    • Update container configs to use correct storage pools
    • Recreate volumes if data is not critical
  2. Resolve Startup Failures:

    • Check each container's detailed error message
    • Fix configuration issues
    • Recreate containers if configs are missing
  3. Clear Lock on CT 10232:

    • Remove lock file and retry creation or delete container

Long-term Actions (Priority 3)

  1. Implement Monitoring:

    • Set up alerts for container startup failures
    • Monitor storage pool health
    • Track container status changes
  2. Documentation:

    • Document container dependencies
    • Create runbooks for common failure scenarios
    • Maintain container inventory with storage mappings
  3. Prevention:

    • Implement pre-startup validation
    • Add storage health checks
    • Create backup procedures for container configs

Diagnostic Commands Reference

Check Container Status

ssh root@192.168.11.12 "pct list | grep -E '3000|3001|3002|3003|3500|3501|5200|6000|6400|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230|10232'"

Check Storage Configuration

ssh root@192.168.11.12 "pvesm status"
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"

Check Container Configs

ssh root@192.168.11.12 "for vmid in 3000 3001 3002 3003 3500 3501 5200 6000 6400; do echo \"=== CT \$vmid ===\"; pct config \$vmid 2>&1 | head -5; done"

Check Detailed Errors

ssh root@192.168.11.12 "for vmid in 3000 5200 10000 10232; do echo \"=== CT \$vmid ===\"; pct start \$vmid 2>&1; echo; done"


Next Steps

  1. Run the diagnostic script to gather detailed information
  2. Review diagnostic output and categorize failures
  3. Execute fix script for automated resolution where possible
  4. Manually resolve remaining issues based on diagnostic findings
  5. Verify all containers can start successfully
  6. Document resolution steps for future reference