- Deleted outdated files related to repository audit and deployment status, including AUDIT_COMPLETE.md, AUDIT_FIXES_APPLIED.md, FINAL_DEPLOYMENT_STATUS.md, and others. - Cleaned up documentation to streamline the repository and improve clarity for future maintenance. - Updated README and other relevant documentation to reflect the removal of these files.
12 KiB
VM Creation Failure Analysis & Prevention Guide
Executive Summary
This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.
Critical Finding: The importdisk API endpoint (POST /nodes/{node}/qemu/{vmid}/importdisk) is NOT IMPLEMENTED in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.
1. Root Cause Analysis
Primary Failure: importdisk API Not Implemented
Location: crossplane-provider-proxmox/pkg/proxmox/client.go:397-400
Error:
501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented
Impact:
- VM is created successfully (blank disk)
- Image import fails immediately
- VM remains in locked state (
lock-{vmid}.conf) - Controller retries indefinitely (VMID never set in status)
- Each retry creates a NEW VM (perpetual creation loop)
Code Path:
// Line 350-400: createVM() function
if needsImageImport && imageVolid != "" {
// ... stops VM ...
// Line 397: Attempts importdisk API call
if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
// Line 399: Returns error, VM already created but orphaned
return nil, errors.Wrapf(err, "failed to import image...")
}
}
Controller Behavior:
// Line 142-145: controller.go
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// Returns error, but VM already exists in Proxmox
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}
// Status never updated (VMID stays 0), causing infinite retry loop
2. Working vs Non-Working Attempts
✅ WORKING Approaches
2.1 VM Deletion (Force Removal)
Script: scripts/force-remove-all-remaining.sh
Method:
- Multiple unlock attempts (10x with delays)
- Stop VM if running
- Delete with
purge=1&skiplock=1parameters - Wait for task completion (up to 60 seconds)
- Verify deletion
Success Rate: 100% (all 66 VMs eventually deleted)
Key Success Factors:
- Aggressive unlocking: 10 unlock attempts with 1-second delays
- Long wait times: 60-second timeout for delete tasks
- Verification: Confirms VM is actually deleted before proceeding
2.2 Controller Scaling
Command: kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0
Result: Immediately stops all VM creation processes
Status: ✅ Effective
❌ NON-WORKING Approaches
2.1 importdisk API Usage
Location: crossplane-provider-proxmox/pkg/proxmox/client.go:397
Problem: API endpoint not implemented in Proxmox version
Error: 501 Method not implemented
Impact: All VM creations with cloud images fail
2.2 Single Unlock Attempt
Problem: Lock files persist after single unlock Result: Delete operations timeout with "can't lock file" errors Solution: Multiple unlock attempts (10x) required
2.3 Short Timeouts
Problem: 20-second timeout insufficient for delete operations Result: Tasks appear to fail but actually complete later Solution: 60-second timeout with verification
2.4 No Error Recovery
Problem: Controller doesn't handle partial VM creation Result: Orphaned VMs accumulate when importdisk fails Impact: Status never updates, infinite retry loop
3. Codebase Inconsistencies & Repeated Failures
3.1 CRITICAL: No Error Recovery for Partial VM Creation
Location: crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145
Problem:
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// ❌ VM already created in Proxmox, but error returned
// ❌ No cleanup of orphaned VM
// ❌ Status never updated (VMID stays 0)
// ❌ Controller will retry forever, creating new VMs
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}
Fix Required:
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// Check if VM was partially created
if createdVM != nil && createdVM.ID > 0 {
// Attempt cleanup
logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
if cleanupErr != nil {
logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
}
}
// Don't requeue immediately - wait longer to prevent rapid retries
return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
}
3.2 CRITICAL: importdisk API Not Checked Before Use
Location: crossplane-provider-proxmox/pkg/proxmox/client.go:350-400
Problem: Code assumes importdisk API exists without checking Proxmox version or API availability.
Fix Required:
// Before attempting importdisk, check if API is available
// Option 1: Check Proxmox version
pveVersion, err := c.GetPVEVersion(ctx)
if err != nil || !supportsImportDisk(pveVersion) {
return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
}
// Option 2: Use alternative method (qm disk import via SSH/API)
// Option 3: Require images to be pre-imported as templates
3.3 CRITICAL: No Status Update on Partial Failure
Location: crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156
Problem: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.
Current Flow:
- VM created in Proxmox (VMID assigned)
- importdisk fails
- Error returned, status never updated
vm.Status.VMID == 0still true- Controller retries, creates new VM
Fix Required: Add intermediate status updates or cleanup on failure.
3.4 Inconsistent Error Handling
Location: Multiple locations
Problem: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.
Examples:
- Line 53: Credentials error → requeue after 30s
- Line 60: Site error → requeue after 30s
- Line 144: VM creation error → no requeue (but should have longer delay)
Fix Required: Define error categories and consistent requeue strategies.
3.5 Lock File Handling Inconsistency
Location: crossplane-provider-proxmox/pkg/proxmox/client.go:803-821 (UnlockVM)
Problem: UnlockVM function exists but is never called during VM creation failure recovery.
Fix Required: Call UnlockVM before DeleteVM in error recovery paths.
4. ml110-01 Node Status: "Unknown" in Web Portal
Investigation Results
API Status Check: ✅ Node is healthy
- CPU: 0.027 (2.7% usage)
- Memory: 9.2GB used / 270GB total
- Uptime: 460,486 seconds (~5.3 days)
- PVE Version:
pve-manager/9.1.1/42db4a6cf33dac83 - Kernel:
6.17.2-1-pve
Web Portal Issue: Likely a display/UI issue, not an actual node problem.
Possible Causes:
- Web UI cache issue
- Cluster quorum/communication issue (if in cluster)
- Web UI version mismatch
- Browser cache
Recommendation:
- Refresh web portal
- Check cluster status:
pvecm status(if in cluster) - Verify node is reachable:
ping ml110-01 - Check Proxmox logs:
/var/log/pveproxy/access.log
5. Recommendations to Prevent Future Failures
5.1 Immediate Fixes (Critical)
-
Add Error Recovery for Partial VM Creation
- Detect when VM is created but import fails
- Clean up orphaned VMs automatically
- Update status to prevent infinite retries
-
Check importdisk API Availability
- Verify Proxmox version supports importdisk
- Provide fallback method (template cloning, pre-imported images)
- Document supported Proxmox versions
-
Improve Status Update Logic
- Update status even on partial failures
- Add conditions to track failure states
- Prevent infinite retry loops
5.2 Short-term Improvements
-
Add VM Cleanup on Controller Startup
- Scan for orphaned VMs (created but no corresponding Kubernetes resource)
- Clean up VMs with stuck locks
- Log cleanup actions
-
Implement Exponential Backoff
- Current: Fixed 30s requeue
- Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
- Prevents rapid retry storms
-
Add Health Checks
- Verify Proxmox API endpoints before use
- Check node status before VM creation
- Validate image availability
5.3 Long-term Improvements
-
Alternative Image Import Methods
- Use
qm disk importvia SSH (if available) - Pre-import images as templates
- Use Proxmox templates instead of cloud images
- Use
-
Better Observability
- Add metrics for VM creation success/failure rates
- Track orphaned VM counts
- Alert on stuck VM creation loops
-
Comprehensive Testing
- Test with different Proxmox versions
- Test error recovery scenarios
- Test lock file handling
6. Code Locations Requiring Fixes
High Priority
-
crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145- Add error recovery for partial VM creation
- Implement cleanup logic
-
crossplane-provider-proxmox/pkg/proxmox/client.go:350-400- Check importdisk API availability
- Add fallback methods
- Improve error messages
-
crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156- Add intermediate status updates
- Prevent infinite retry loops
Medium Priority
-
crossplane-provider-proxmox/pkg/proxmox/client.go:803-821- Use UnlockVM in error recovery paths
-
Error handling throughout controller
- Standardize requeue strategies
- Add error categorization
7. Testing Checklist
Before deploying fixes, test:
- VM creation with importdisk API (if supported)
- VM creation with template cloning
- Error recovery when importdisk fails
- Cleanup of orphaned VMs
- Lock file handling
- Controller retry behavior
- Status update on partial failures
- Multiple concurrent VM creations
- Node status checks
- Proxmox version compatibility
8. Documentation Updates Needed
- README.md: Document supported Proxmox versions
- API Compatibility: List which APIs are required
- Troubleshooting Guide: Add section on orphaned VMs
- Error Recovery: Document automatic cleanup features
- Image Requirements: Clarify template vs cloud image usage
9. Lessons Learned
- Always verify API availability before using it
- Implement error recovery for partial resource creation
- Update status early to prevent infinite retry loops
- Test with actual infrastructure versions, not just mocks
- Monitor for orphaned resources and implement cleanup
- Use exponential backoff for retries
- Document failure modes and recovery procedures
10. Summary
Primary Issue: importdisk API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop
Root Causes:
- No API availability check
- No error recovery for partial creation
- No status update on failure
- No cleanup of orphaned resources
Solutions:
- Check API availability before use
- Implement error recovery and cleanup
- Update status even on partial failures
- Add health checks and monitoring
Status: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.
Last Updated: 2025-12-12 Document Version: 1.0