Files
Sankofa/docs/status/vms/VM_CREATION_FAILURE_ANALYSIS.md
defiQUG a8106e24ee Remove obsolete audit and deployment documentation files
- Deleted outdated files related to repository audit and deployment status, including AUDIT_COMPLETE.md, AUDIT_FIXES_APPLIED.md, FINAL_DEPLOYMENT_STATUS.md, and others.
- Cleaned up documentation to streamline the repository and improve clarity for future maintenance.
- Updated README and other relevant documentation to reflect the removal of these files.
2025-12-12 19:42:31 -08:00

12 KiB

VM Creation Failure Analysis & Prevention Guide

Executive Summary

This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.

Critical Finding: The importdisk API endpoint (POST /nodes/{node}/qemu/{vmid}/importdisk) is NOT IMPLEMENTED in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.


1. Root Cause Analysis

Primary Failure: importdisk API Not Implemented

Location: crossplane-provider-proxmox/pkg/proxmox/client.go:397-400

Error:

501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented

Impact:

  • VM is created successfully (blank disk)
  • Image import fails immediately
  • VM remains in locked state (lock-{vmid}.conf)
  • Controller retries indefinitely (VMID never set in status)
  • Each retry creates a NEW VM (perpetual creation loop)

Code Path:

// Line 350-400: createVM() function
if needsImageImport && imageVolid != "" {
    // ... stops VM ...
    // Line 397: Attempts importdisk API call
    if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
        // Line 399: Returns error, VM already created but orphaned
        return nil, errors.Wrapf(err, "failed to import image...")
    }
}

Controller Behavior:

// Line 142-145: controller.go
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
    // Returns error, but VM already exists in Proxmox
    return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}
// Status never updated (VMID stays 0), causing infinite retry loop

2. Working vs Non-Working Attempts

WORKING Approaches

2.1 VM Deletion (Force Removal)

Script: scripts/force-remove-all-remaining.sh Method:

  • Multiple unlock attempts (10x with delays)
  • Stop VM if running
  • Delete with purge=1&skiplock=1 parameters
  • Wait for task completion (up to 60 seconds)
  • Verify deletion

Success Rate: 100% (all 66 VMs eventually deleted)

Key Success Factors:

  1. Aggressive unlocking: 10 unlock attempts with 1-second delays
  2. Long wait times: 60-second timeout for delete tasks
  3. Verification: Confirms VM is actually deleted before proceeding

2.2 Controller Scaling

Command: kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0 Result: Immediately stops all VM creation processes Status: Effective

NON-WORKING Approaches

2.1 importdisk API Usage

Location: crossplane-provider-proxmox/pkg/proxmox/client.go:397 Problem: API endpoint not implemented in Proxmox version Error: 501 Method not implemented Impact: All VM creations with cloud images fail

2.2 Single Unlock Attempt

Problem: Lock files persist after single unlock Result: Delete operations timeout with "can't lock file" errors Solution: Multiple unlock attempts (10x) required

2.3 Short Timeouts

Problem: 20-second timeout insufficient for delete operations Result: Tasks appear to fail but actually complete later Solution: 60-second timeout with verification

2.4 No Error Recovery

Problem: Controller doesn't handle partial VM creation Result: Orphaned VMs accumulate when importdisk fails Impact: Status never updates, infinite retry loop


3. Codebase Inconsistencies & Repeated Failures

3.1 CRITICAL: No Error Recovery for Partial VM Creation

Location: crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145

Problem:

createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
    // ❌ VM already created in Proxmox, but error returned
    // ❌ No cleanup of orphaned VM
    // ❌ Status never updated (VMID stays 0)
    // ❌ Controller will retry forever, creating new VMs
    return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}

Fix Required:

createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
    // Check if VM was partially created
    if createdVM != nil && createdVM.ID > 0 {
        // Attempt cleanup
        logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
        cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
        if cleanupErr != nil {
            logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
        }
    }
    // Don't requeue immediately - wait longer to prevent rapid retries
    return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
}

3.2 CRITICAL: importdisk API Not Checked Before Use

Location: crossplane-provider-proxmox/pkg/proxmox/client.go:350-400

Problem: Code assumes importdisk API exists without checking Proxmox version or API availability.

Fix Required:

// Before attempting importdisk, check if API is available
// Option 1: Check Proxmox version
pveVersion, err := c.GetPVEVersion(ctx)
if err != nil || !supportsImportDisk(pveVersion) {
    return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
}

// Option 2: Use alternative method (qm disk import via SSH/API)
// Option 3: Require images to be pre-imported as templates

3.3 CRITICAL: No Status Update on Partial Failure

Location: crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156

Problem: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.

Current Flow:

  1. VM created in Proxmox (VMID assigned)
  2. importdisk fails
  3. Error returned, status never updated
  4. vm.Status.VMID == 0 still true
  5. Controller retries, creates new VM

Fix Required: Add intermediate status updates or cleanup on failure.

3.4 Inconsistent Error Handling

Location: Multiple locations

Problem: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.

Examples:

  • Line 53: Credentials error → requeue after 30s
  • Line 60: Site error → requeue after 30s
  • Line 144: VM creation error → no requeue (but should have longer delay)

Fix Required: Define error categories and consistent requeue strategies.

3.5 Lock File Handling Inconsistency

Location: crossplane-provider-proxmox/pkg/proxmox/client.go:803-821 (UnlockVM)

Problem: UnlockVM function exists but is never called during VM creation failure recovery.

Fix Required: Call UnlockVM before DeleteVM in error recovery paths.


4. ml110-01 Node Status: "Unknown" in Web Portal

Investigation Results

API Status Check: Node is healthy

  • CPU: 0.027 (2.7% usage)
  • Memory: 9.2GB used / 270GB total
  • Uptime: 460,486 seconds (~5.3 days)
  • PVE Version: pve-manager/9.1.1/42db4a6cf33dac83
  • Kernel: 6.17.2-1-pve

Web Portal Issue: Likely a display/UI issue, not an actual node problem.

Possible Causes:

  1. Web UI cache issue
  2. Cluster quorum/communication issue (if in cluster)
  3. Web UI version mismatch
  4. Browser cache

Recommendation:

  • Refresh web portal
  • Check cluster status: pvecm status (if in cluster)
  • Verify node is reachable: ping ml110-01
  • Check Proxmox logs: /var/log/pveproxy/access.log

5. Recommendations to Prevent Future Failures

5.1 Immediate Fixes (Critical)

  1. Add Error Recovery for Partial VM Creation

    • Detect when VM is created but import fails
    • Clean up orphaned VMs automatically
    • Update status to prevent infinite retries
  2. Check importdisk API Availability

    • Verify Proxmox version supports importdisk
    • Provide fallback method (template cloning, pre-imported images)
    • Document supported Proxmox versions
  3. Improve Status Update Logic

    • Update status even on partial failures
    • Add conditions to track failure states
    • Prevent infinite retry loops

5.2 Short-term Improvements

  1. Add VM Cleanup on Controller Startup

    • Scan for orphaned VMs (created but no corresponding Kubernetes resource)
    • Clean up VMs with stuck locks
    • Log cleanup actions
  2. Implement Exponential Backoff

    • Current: Fixed 30s requeue
    • Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
    • Prevents rapid retry storms
  3. Add Health Checks

    • Verify Proxmox API endpoints before use
    • Check node status before VM creation
    • Validate image availability

5.3 Long-term Improvements

  1. Alternative Image Import Methods

    • Use qm disk import via SSH (if available)
    • Pre-import images as templates
    • Use Proxmox templates instead of cloud images
  2. Better Observability

    • Add metrics for VM creation success/failure rates
    • Track orphaned VM counts
    • Alert on stuck VM creation loops
  3. Comprehensive Testing

    • Test with different Proxmox versions
    • Test error recovery scenarios
    • Test lock file handling

6. Code Locations Requiring Fixes

High Priority

  1. crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145

    • Add error recovery for partial VM creation
    • Implement cleanup logic
  2. crossplane-provider-proxmox/pkg/proxmox/client.go:350-400

    • Check importdisk API availability
    • Add fallback methods
    • Improve error messages
  3. crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156

    • Add intermediate status updates
    • Prevent infinite retry loops

Medium Priority

  1. crossplane-provider-proxmox/pkg/proxmox/client.go:803-821

    • Use UnlockVM in error recovery paths
  2. Error handling throughout controller

    • Standardize requeue strategies
    • Add error categorization

7. Testing Checklist

Before deploying fixes, test:

  • VM creation with importdisk API (if supported)
  • VM creation with template cloning
  • Error recovery when importdisk fails
  • Cleanup of orphaned VMs
  • Lock file handling
  • Controller retry behavior
  • Status update on partial failures
  • Multiple concurrent VM creations
  • Node status checks
  • Proxmox version compatibility

8. Documentation Updates Needed

  1. README.md: Document supported Proxmox versions
  2. API Compatibility: List which APIs are required
  3. Troubleshooting Guide: Add section on orphaned VMs
  4. Error Recovery: Document automatic cleanup features
  5. Image Requirements: Clarify template vs cloud image usage

9. Lessons Learned

  1. Always verify API availability before using it
  2. Implement error recovery for partial resource creation
  3. Update status early to prevent infinite retry loops
  4. Test with actual infrastructure versions, not just mocks
  5. Monitor for orphaned resources and implement cleanup
  6. Use exponential backoff for retries
  7. Document failure modes and recovery procedures

10. Summary

Primary Issue: importdisk API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop

Root Causes:

  1. No API availability check
  2. No error recovery for partial creation
  3. No status update on failure
  4. No cleanup of orphaned resources

Solutions:

  1. Check API availability before use
  2. Implement error recovery and cleanup
  3. Update status even on partial failures
  4. Add health checks and monitoring

Status: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.


Last Updated: 2025-12-12 Document Version: 1.0