370 lines
12 KiB
Markdown
370 lines
12 KiB
Markdown
|
|
# VM Creation Failure Analysis & Prevention Guide
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues.
|
||
|
|
|
||
|
|
**Critical Finding**: The `importdisk` API endpoint (`POST /nodes/{node}/qemu/{vmid}/importdisk`) is **NOT IMPLEMENTED** in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Root Cause Analysis
|
||
|
|
|
||
|
|
### Primary Failure: importdisk API Not Implemented
|
||
|
|
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397-400`
|
||
|
|
|
||
|
|
**Error**:
|
||
|
|
```
|
||
|
|
501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented
|
||
|
|
```
|
||
|
|
|
||
|
|
**Impact**:
|
||
|
|
- VM is created successfully (blank disk)
|
||
|
|
- Image import fails immediately
|
||
|
|
- VM remains in locked state (`lock-{vmid}.conf`)
|
||
|
|
- Controller retries indefinitely (VMID never set in status)
|
||
|
|
- Each retry creates a NEW VM (perpetual creation loop)
|
||
|
|
|
||
|
|
**Code Path**:
|
||
|
|
```go
|
||
|
|
// Line 350-400: createVM() function
|
||
|
|
if needsImageImport && imageVolid != "" {
|
||
|
|
// ... stops VM ...
|
||
|
|
// Line 397: Attempts importdisk API call
|
||
|
|
if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil {
|
||
|
|
// Line 399: Returns error, VM already created but orphaned
|
||
|
|
return nil, errors.Wrapf(err, "failed to import image...")
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Controller Behavior**:
|
||
|
|
```go
|
||
|
|
// Line 142-145: controller.go
|
||
|
|
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||
|
|
if err != nil {
|
||
|
|
// Returns error, but VM already exists in Proxmox
|
||
|
|
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||
|
|
}
|
||
|
|
// Status never updated (VMID stays 0), causing infinite retry loop
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Working vs Non-Working Attempts
|
||
|
|
|
||
|
|
### ✅ WORKING Approaches
|
||
|
|
|
||
|
|
#### 2.1 VM Deletion (Force Removal)
|
||
|
|
**Script**: `scripts/force-remove-all-remaining.sh`
|
||
|
|
**Method**:
|
||
|
|
- Multiple unlock attempts (10x with delays)
|
||
|
|
- Stop VM if running
|
||
|
|
- Delete with `purge=1&skiplock=1` parameters
|
||
|
|
- Wait for task completion (up to 60 seconds)
|
||
|
|
- Verify deletion
|
||
|
|
|
||
|
|
**Success Rate**: 100% (all 66 VMs eventually deleted)
|
||
|
|
|
||
|
|
**Key Success Factors**:
|
||
|
|
1. **Aggressive unlocking**: 10 unlock attempts with 1-second delays
|
||
|
|
2. **Long wait times**: 60-second timeout for delete tasks
|
||
|
|
3. **Verification**: Confirms VM is actually deleted before proceeding
|
||
|
|
|
||
|
|
#### 2.2 Controller Scaling
|
||
|
|
**Command**: `kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0`
|
||
|
|
**Result**: Immediately stops all VM creation processes
|
||
|
|
**Status**: ✅ Effective
|
||
|
|
|
||
|
|
### ❌ NON-WORKING Approaches
|
||
|
|
|
||
|
|
#### 2.1 importdisk API Usage
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397`
|
||
|
|
**Problem**: API endpoint not implemented in Proxmox version
|
||
|
|
**Error**: `501 Method not implemented`
|
||
|
|
**Impact**: All VM creations with cloud images fail
|
||
|
|
|
||
|
|
#### 2.2 Single Unlock Attempt
|
||
|
|
**Problem**: Lock files persist after single unlock
|
||
|
|
**Result**: Delete operations timeout with "can't lock file" errors
|
||
|
|
**Solution**: Multiple unlock attempts (10x) required
|
||
|
|
|
||
|
|
#### 2.3 Short Timeouts
|
||
|
|
**Problem**: 20-second timeout insufficient for delete operations
|
||
|
|
**Result**: Tasks appear to fail but actually complete later
|
||
|
|
**Solution**: 60-second timeout with verification
|
||
|
|
|
||
|
|
#### 2.4 No Error Recovery
|
||
|
|
**Problem**: Controller doesn't handle partial VM creation
|
||
|
|
**Result**: Orphaned VMs accumulate when importdisk fails
|
||
|
|
**Impact**: Status never updates, infinite retry loop
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Codebase Inconsistencies & Repeated Failures
|
||
|
|
|
||
|
|
### 3.1 CRITICAL: No Error Recovery for Partial VM Creation
|
||
|
|
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`
|
||
|
|
|
||
|
|
**Problem**:
|
||
|
|
```go
|
||
|
|
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||
|
|
if err != nil {
|
||
|
|
// ❌ VM already created in Proxmox, but error returned
|
||
|
|
// ❌ No cleanup of orphaned VM
|
||
|
|
// ❌ Status never updated (VMID stays 0)
|
||
|
|
// ❌ Controller will retry forever, creating new VMs
|
||
|
|
return ctrl.Result{}, errors.Wrap(err, "cannot create VM")
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fix Required**:
|
||
|
|
```go
|
||
|
|
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
||
|
|
if err != nil {
|
||
|
|
// Check if VM was partially created
|
||
|
|
if createdVM != nil && createdVM.ID > 0 {
|
||
|
|
// Attempt cleanup
|
||
|
|
logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID)
|
||
|
|
cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID)
|
||
|
|
if cleanupErr != nil {
|
||
|
|
logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
// Don't requeue immediately - wait longer to prevent rapid retries
|
||
|
|
return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM")
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.2 CRITICAL: importdisk API Not Checked Before Use
|
||
|
|
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`
|
||
|
|
|
||
|
|
**Problem**: Code assumes `importdisk` API exists without checking Proxmox version or API availability.
|
||
|
|
|
||
|
|
**Fix Required**:
|
||
|
|
```go
|
||
|
|
// Before attempting importdisk, check if API is available
|
||
|
|
// Option 1: Check Proxmox version
|
||
|
|
pveVersion, err := c.GetPVEVersion(ctx)
|
||
|
|
if err != nil || !supportsImportDisk(pveVersion) {
|
||
|
|
return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Option 2: Use alternative method (qm disk import via SSH/API)
|
||
|
|
// Option 3: Require images to be pre-imported as templates
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.3 CRITICAL: No Status Update on Partial Failure
|
||
|
|
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`
|
||
|
|
|
||
|
|
**Problem**: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries.
|
||
|
|
|
||
|
|
**Current Flow**:
|
||
|
|
1. VM created in Proxmox (VMID assigned)
|
||
|
|
2. importdisk fails
|
||
|
|
3. Error returned, status never updated
|
||
|
|
4. `vm.Status.VMID == 0` still true
|
||
|
|
5. Controller retries, creates new VM
|
||
|
|
|
||
|
|
**Fix Required**: Add intermediate status updates or cleanup on failure.
|
||
|
|
|
||
|
|
### 3.4 Inconsistent Error Handling
|
||
|
|
|
||
|
|
**Location**: Multiple locations
|
||
|
|
|
||
|
|
**Problem**: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors.
|
||
|
|
|
||
|
|
**Examples**:
|
||
|
|
- Line 53: Credentials error → requeue after 30s
|
||
|
|
- Line 60: Site error → requeue after 30s
|
||
|
|
- Line 144: VM creation error → no requeue (but should have longer delay)
|
||
|
|
|
||
|
|
**Fix Required**: Define error categories and consistent requeue strategies.
|
||
|
|
|
||
|
|
### 3.5 Lock File Handling Inconsistency
|
||
|
|
|
||
|
|
**Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821` (UnlockVM)
|
||
|
|
|
||
|
|
**Problem**: UnlockVM function exists but is never called during VM creation failure recovery.
|
||
|
|
|
||
|
|
**Fix Required**: Call UnlockVM before DeleteVM in error recovery paths.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. ml110-01 Node Status: "Unknown" in Web Portal
|
||
|
|
|
||
|
|
### Investigation Results
|
||
|
|
|
||
|
|
**API Status Check**: ✅ Node is healthy
|
||
|
|
- CPU: 0.027 (2.7% usage)
|
||
|
|
- Memory: 9.2GB used / 270GB total
|
||
|
|
- Uptime: 460,486 seconds (~5.3 days)
|
||
|
|
- PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83`
|
||
|
|
- Kernel: `6.17.2-1-pve`
|
||
|
|
|
||
|
|
**Web Portal Issue**: Likely a display/UI issue, not an actual node problem.
|
||
|
|
|
||
|
|
**Possible Causes**:
|
||
|
|
1. Web UI cache issue
|
||
|
|
2. Cluster quorum/communication issue (if in cluster)
|
||
|
|
3. Web UI version mismatch
|
||
|
|
4. Browser cache
|
||
|
|
|
||
|
|
**Recommendation**:
|
||
|
|
- Refresh web portal
|
||
|
|
- Check cluster status: `pvecm status` (if in cluster)
|
||
|
|
- Verify node is reachable: `ping ml110-01`
|
||
|
|
- Check Proxmox logs: `/var/log/pveproxy/access.log`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Recommendations to Prevent Future Failures
|
||
|
|
|
||
|
|
### 5.1 Immediate Fixes (Critical)
|
||
|
|
|
||
|
|
1. **Add Error Recovery for Partial VM Creation**
|
||
|
|
- Detect when VM is created but import fails
|
||
|
|
- Clean up orphaned VMs automatically
|
||
|
|
- Update status to prevent infinite retries
|
||
|
|
|
||
|
|
2. **Check importdisk API Availability**
|
||
|
|
- Verify Proxmox version supports importdisk
|
||
|
|
- Provide fallback method (template cloning, pre-imported images)
|
||
|
|
- Document supported Proxmox versions
|
||
|
|
|
||
|
|
3. **Improve Status Update Logic**
|
||
|
|
- Update status even on partial failures
|
||
|
|
- Add conditions to track failure states
|
||
|
|
- Prevent infinite retry loops
|
||
|
|
|
||
|
|
### 5.2 Short-term Improvements
|
||
|
|
|
||
|
|
1. **Add VM Cleanup on Controller Startup**
|
||
|
|
- Scan for orphaned VMs (created but no corresponding Kubernetes resource)
|
||
|
|
- Clean up VMs with stuck locks
|
||
|
|
- Log cleanup actions
|
||
|
|
|
||
|
|
2. **Implement Exponential Backoff**
|
||
|
|
- Current: Fixed 30s requeue
|
||
|
|
- Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m)
|
||
|
|
- Prevents rapid retry storms
|
||
|
|
|
||
|
|
3. **Add Health Checks**
|
||
|
|
- Verify Proxmox API endpoints before use
|
||
|
|
- Check node status before VM creation
|
||
|
|
- Validate image availability
|
||
|
|
|
||
|
|
### 5.3 Long-term Improvements
|
||
|
|
|
||
|
|
1. **Alternative Image Import Methods**
|
||
|
|
- Use `qm disk import` via SSH (if available)
|
||
|
|
- Pre-import images as templates
|
||
|
|
- Use Proxmox templates instead of cloud images
|
||
|
|
|
||
|
|
2. **Better Observability**
|
||
|
|
- Add metrics for VM creation success/failure rates
|
||
|
|
- Track orphaned VM counts
|
||
|
|
- Alert on stuck VM creation loops
|
||
|
|
|
||
|
|
3. **Comprehensive Testing**
|
||
|
|
- Test with different Proxmox versions
|
||
|
|
- Test error recovery scenarios
|
||
|
|
- Test lock file handling
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Code Locations Requiring Fixes
|
||
|
|
|
||
|
|
### High Priority
|
||
|
|
|
||
|
|
1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`**
|
||
|
|
- Add error recovery for partial VM creation
|
||
|
|
- Implement cleanup logic
|
||
|
|
|
||
|
|
2. **`crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`**
|
||
|
|
- Check importdisk API availability
|
||
|
|
- Add fallback methods
|
||
|
|
- Improve error messages
|
||
|
|
|
||
|
|
3. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`**
|
||
|
|
- Add intermediate status updates
|
||
|
|
- Prevent infinite retry loops
|
||
|
|
|
||
|
|
### Medium Priority
|
||
|
|
|
||
|
|
4. **`crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`**
|
||
|
|
- Use UnlockVM in error recovery paths
|
||
|
|
|
||
|
|
5. **Error handling throughout controller**
|
||
|
|
- Standardize requeue strategies
|
||
|
|
- Add error categorization
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Testing Checklist
|
||
|
|
|
||
|
|
Before deploying fixes, test:
|
||
|
|
|
||
|
|
- [ ] VM creation with importdisk API (if supported)
|
||
|
|
- [ ] VM creation with template cloning
|
||
|
|
- [ ] Error recovery when importdisk fails
|
||
|
|
- [ ] Cleanup of orphaned VMs
|
||
|
|
- [ ] Lock file handling
|
||
|
|
- [ ] Controller retry behavior
|
||
|
|
- [ ] Status update on partial failures
|
||
|
|
- [ ] Multiple concurrent VM creations
|
||
|
|
- [ ] Node status checks
|
||
|
|
- [ ] Proxmox version compatibility
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Documentation Updates Needed
|
||
|
|
|
||
|
|
1. **README.md**: Document supported Proxmox versions
|
||
|
|
2. **API Compatibility**: List which APIs are required
|
||
|
|
3. **Troubleshooting Guide**: Add section on orphaned VMs
|
||
|
|
4. **Error Recovery**: Document automatic cleanup features
|
||
|
|
5. **Image Requirements**: Clarify template vs cloud image usage
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Lessons Learned
|
||
|
|
|
||
|
|
1. **Always verify API availability** before using it
|
||
|
|
2. **Implement error recovery** for partial resource creation
|
||
|
|
3. **Update status early** to prevent infinite retry loops
|
||
|
|
4. **Test with actual infrastructure** versions, not just mocks
|
||
|
|
5. **Monitor for orphaned resources** and implement cleanup
|
||
|
|
6. **Use exponential backoff** for retries
|
||
|
|
7. **Document failure modes** and recovery procedures
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10. Summary
|
||
|
|
|
||
|
|
**Primary Issue**: `importdisk` API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop
|
||
|
|
|
||
|
|
**Root Causes**:
|
||
|
|
1. No API availability check
|
||
|
|
2. No error recovery for partial creation
|
||
|
|
3. No status update on failure
|
||
|
|
4. No cleanup of orphaned resources
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
1. Check API availability before use
|
||
|
|
2. Implement error recovery and cleanup
|
||
|
|
3. Update status even on partial failures
|
||
|
|
4. Add health checks and monitoring
|
||
|
|
|
||
|
|
**Status**: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
*Last Updated: 2025-12-12*
|
||
|
|
*Document Version: 1.0*
|
||
|
|
|