Files
Sankofa/docs/VM_CREATION_FAILURE_ANALYSIS.md

370 lines
12 KiB
Markdown
Raw Normal View History

# VM Creation Failure Analysis & Prevention Guide
## Executive Summary
This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.
**Critical Finding**: The `importdisk` API endpoint (`POST /nodes/{node}/qemu/{vmid}/importdisk`) is **NOT IMPLEMENTED** in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.
---
## 1. Root Cause Analysis
### Primary Failure: importdisk API Not Implemented
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397-400`
**Error**:
```
501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented
```
**Impact**:
- VM is created successfully (blank disk)
- Image import fails immediately
- VM remains in locked state (`lock-{vmid}.conf`)
- Controller retries indefinitely (VMID never set in status)
- Each retry creates a NEW VM (perpetual creation loop)
**Code Path**:
```go
// Line 350-400: createVM() function
if needsImageImport && imageVolid != "" {
// ... stops VM ...
// Line 397: Attempts importdisk API call
if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
// Line 399: Returns error, VM already created but orphaned
return nil, errors.Wrapf(err, "failed to import image...")
}
}
```
**Controller Behavior**:
```go
// Line 142-145: controller.go
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// Returns error, but VM already exists in Proxmox
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}
// Status never updated (VMID stays 0), causing infinite retry loop
```
---
## 2. Working vs Non-Working Attempts
### ✅ WORKING Approaches
#### 2.1 VM Deletion (Force Removal)
**Script**: `scripts/force-remove-all-remaining.sh`
**Method**:
- Multiple unlock attempts (10x with delays)
- Stop VM if running
- Delete with `purge=1&skiplock=1` parameters
- Wait for task completion (up to 60 seconds)
- Verify deletion
**Success Rate**: 100% (all 66 VMs eventually deleted)
**Key Success Factors**:
1. **Aggressive unlocking**: 10 unlock attempts with 1-second delays
2. **Long wait times**: 60-second timeout for delete tasks
3. **Verification**: Confirms VM is actually deleted before proceeding
#### 2.2 Controller Scaling
**Command**: `kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0`
**Result**: Immediately stops all VM creation processes
**Status**: ✅ Effective
### ❌ NON-WORKING Approaches
#### 2.1 importdisk API Usage
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397`
**Problem**: API endpoint not implemented in Proxmox version
**Error**: `501 Method not implemented`
**Impact**: All VM creations with cloud images fail
#### 2.2 Single Unlock Attempt
**Problem**: Lock files persist after single unlock
**Result**: Delete operations timeout with "can't lock file" errors
**Solution**: Multiple unlock attempts (10x) required
#### 2.3 Short Timeouts
**Problem**: 20-second timeout insufficient for delete operations
**Result**: Tasks appear to fail but actually complete later
**Solution**: 60-second timeout with verification
#### 2.4 No Error Recovery
**Problem**: Controller doesn't handle partial VM creation
**Result**: Orphaned VMs accumulate when importdisk fails
**Impact**: Status never updates, infinite retry loop
---
## 3. Codebase Inconsistencies & Repeated Failures
### 3.1 CRITICAL: No Error Recovery for Partial VM Creation
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`
**Problem**:
```go
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// ❌ VM already created in Proxmox, but error returned
// ❌ No cleanup of orphaned VM
// ❌ Status never updated (VMID stays 0)
// ❌ Controller will retry forever, creating new VMs
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
}
```
**Fix Required**:
```go
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
if err != nil {
// Check if VM was partially created
if createdVM != nil && createdVM.ID > 0 {
// Attempt cleanup
logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
if cleanupErr != nil {
logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
}
}
// Don't requeue immediately - wait longer to prevent rapid retries
return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
}
```
### 3.2 CRITICAL: importdisk API Not Checked Before Use
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`
**Problem**: Code assumes `importdisk` API exists without checking Proxmox version or API availability.
**Fix Required**:
```go
// Before attempting importdisk, check if API is available
// Option 1: Check Proxmox version
pveVersion, err := c.GetPVEVersion(ctx)
if err != nil || !supportsImportDisk(pveVersion) {
return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
}
// Option 2: Use alternative method (qm disk import via SSH/API)
// Option 3: Require images to be pre-imported as templates
```
### 3.3 CRITICAL: No Status Update on Partial Failure
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`
**Problem**: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.
**Current Flow**:
1. VM created in Proxmox (VMID assigned)
2. importdisk fails
3. Error returned, status never updated
4. `vm.Status.VMID == 0` still true
5. Controller retries, creates new VM
**Fix Required**: Add intermediate status updates or cleanup on failure.
### 3.4 Inconsistent Error Handling
**Location**: Multiple locations
**Problem**: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.
**Examples**:
- Line 53: Credentials error → requeue after 30s
- Line 60: Site error → requeue after 30s
- Line 144: VM creation error → no requeue (but should have longer delay)
**Fix Required**: Define error categories and consistent requeue strategies.
### 3.5 Lock File Handling Inconsistency
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821` (UnlockVM)
**Problem**: UnlockVM function exists but is never called during VM creation failure recovery.
**Fix Required**: Call UnlockVM before DeleteVM in error recovery paths.
---
## 4. ml110-01 Node Status: "Unknown" in Web Portal
### Investigation Results
**API Status Check**: ✅ Node is healthy
- CPU: 0.027 (2.7% usage)
- Memory: 9.2GB used / 270GB total
- Uptime: 460,486 seconds (~5.3 days)
- PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83`
- Kernel: `6.17.2-1-pve`
**Web Portal Issue**: Likely a display/UI issue, not an actual node problem.
**Possible Causes**:
1. Web UI cache issue
2. Cluster quorum/communication issue (if in cluster)
3. Web UI version mismatch
4. Browser cache
**Recommendation**:
- Refresh web portal
- Check cluster status: `pvecm status` (if in cluster)
- Verify node is reachable: `ping ml110-01`
- Check Proxmox logs: `/var/log/pveproxy/access.log`
---
## 5. Recommendations to Prevent Future Failures
### 5.1 Immediate Fixes (Critical)
1. **Add Error Recovery for Partial VM Creation**
- Detect when VM is created but import fails
- Clean up orphaned VMs automatically
- Update status to prevent infinite retries
2. **Check importdisk API Availability**
- Verify Proxmox version supports importdisk
- Provide fallback method (template cloning, pre-imported images)
- Document supported Proxmox versions
3. **Improve Status Update Logic**
- Update status even on partial failures
- Add conditions to track failure states
- Prevent infinite retry loops
### 5.2 Short-term Improvements
1. **Add VM Cleanup on Controller Startup**
- Scan for orphaned VMs (created but no corresponding Kubernetes resource)
- Clean up VMs with stuck locks
- Log cleanup actions
2. **Implement Exponential Backoff**
- Current: Fixed 30s requeue
- Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
- Prevents rapid retry storms
3. **Add Health Checks**
- Verify Proxmox API endpoints before use
- Check node status before VM creation
- Validate image availability
### 5.3 Long-term Improvements
1. **Alternative Image Import Methods**
- Use `qm disk import` via SSH (if available)
- Pre-import images as templates
- Use Proxmox templates instead of cloud images
2. **Better Observability**
- Add metrics for VM creation success/failure rates
- Track orphaned VM counts
- Alert on stuck VM creation loops
3. **Comprehensive Testing**
- Test with different Proxmox versions
- Test error recovery scenarios
- Test lock file handling
---
## 6. Code Locations Requiring Fixes
### High Priority
1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`**
- Add error recovery for partial VM creation
- Implement cleanup logic
2. **`crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`**
- Check importdisk API availability
- Add fallback methods
- Improve error messages
3. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`**
- Add intermediate status updates
- Prevent infinite retry loops
### Medium Priority
4. **`crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`**
- Use UnlockVM in error recovery paths
5. **Error handling throughout controller**
- Standardize requeue strategies
- Add error categorization
---
## 7. Testing Checklist
Before deploying fixes, test:
- [ ] VM creation with importdisk API (if supported)
- [ ] VM creation with template cloning
- [ ] Error recovery when importdisk fails
- [ ] Cleanup of orphaned VMs
- [ ] Lock file handling
- [ ] Controller retry behavior
- [ ] Status update on partial failures
- [ ] Multiple concurrent VM creations
- [ ] Node status checks
- [ ] Proxmox version compatibility
---
## 8. Documentation Updates Needed
1. **README.md**: Document supported Proxmox versions
2. **API Compatibility**: List which APIs are required
3. **Troubleshooting Guide**: Add section on orphaned VMs
4. **Error Recovery**: Document automatic cleanup features
5. **Image Requirements**: Clarify template vs cloud image usage
---
## 9. Lessons Learned
1. **Always verify API availability** before using it
2. **Implement error recovery** for partial resource creation
3. **Update status early** to prevent infinite retry loops
4. **Test with actual infrastructure** versions, not just mocks
5. **Monitor for orphaned resources** and implement cleanup
6. **Use exponential backoff** for retries
7. **Document failure modes** and recovery procedures
---
## 10. Summary
**Primary Issue**: `importdisk` API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop
**Root Causes**:
1. No API availability check
2. No error recovery for partial creation
3. No status update on failure
4. No cleanup of orphaned resources
**Solutions**:
1. Check API availability before use
2. Implement error recovery and cleanup
3. Update status even on partial failures
4. Add health checks and monitoring
**Status**: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.
---
*Last Updated: 2025-12-12*
*Document Version: 1.0*