- Added lock file exclusions for pnpm in .gitignore. - Removed obsolete package-lock.json from the api and portal directories. - Enhanced Cloudflare adapter with additional interfaces for zones and tunnels. - Improved Proxmox adapter error handling and logging for API requests. - Updated Proxmox VM parameters with validation rules in the API schema. - Enhanced documentation for Proxmox VM specifications and examples.
328 lines
9.4 KiB
Markdown
328 lines
9.4 KiB
Markdown
# Code Review Summary: VM Creation Failures & Inconsistencies
|
|
|
|
**Date**: 2025-12-12
|
|
**Status**: Complete Analysis
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Comprehensive review of VM creation failures, codebase inconsistencies, and recommendations to prevent repeating cycles of failure.
|
|
|
|
**Key Findings**:
|
|
1. ✅ **All orphaned VMs cleaned up** (66 VMs removed)
|
|
2. ✅ **Controller stopped** (no active VM creation processes)
|
|
3. ❌ **Critical bug identified**: importdisk API not implemented, causing all cloud image VM creations to fail
|
|
4. ⚠️ **ml110-01 node status**: API shows healthy, "unknown" in web portal is likely UI issue
|
|
|
|
---
|
|
|
|
## 1. Working vs Non-Working Attempts
|
|
|
|
### ✅ WORKING Methods
|
|
|
|
| Method | Location | Success Rate | Notes |
|
|
|--------|---------|--------------|-------|
|
|
| **Force VM Deletion** | `scripts/force-remove-all-remaining.sh` | 100% | 10 unlock attempts, 60s timeout, verification |
|
|
| **Controller Scaling** | `kubectl scale deployment` | 100% | Immediately stops all processes |
|
|
| **Aggressive Unlocking** | Multiple unlock attempts with delays | 100% | Required for stuck lock files |
|
|
|
|
### ❌ NON-WORKING Methods
|
|
|
|
| Method | Location | Failure Reason | Impact |
|
|
|--------|---------|----------------|--------|
|
|
| **importdisk API** | `pkg/proxmox/client.go:397` | API not implemented (501 error) | All cloud image VMs fail |
|
|
| **Single Unlock** | Initial attempts | Insufficient for stuck locks | Delete operations timeout |
|
|
| **Short Timeouts** | 20-second waits | Tasks complete after timeout | False failure reports |
|
|
| **No Error Recovery** | `pkg/controller/.../controller.go:142` | No cleanup on partial creation | Orphaned VMs accumulate |
|
|
|
|
---
|
|
|
|
## 2. Critical Code Inconsistencies
|
|
|
|
### 2.1 No Error Recovery for Partial VM Creation
|
|
|
|
**File**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`
|
|
|
|
**Problem**: When `CreateVM()` fails after VM is created but before status update:
|
|
- VM exists in Proxmox (orphaned)
|
|
- Status never updated (VMID stays 0)
|
|
- Controller retries forever
|
|
- Each retry creates a NEW VM
|
|
|
|
**Fix Required**: Add cleanup logic in error path.
|
|
|
|
### 2.2 importdisk API Used Without Availability Check
|
|
|
|
**File**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397`
|
|
|
|
**Problem**: Code assumes `importdisk` API exists without checking Proxmox version.
|
|
|
|
**Error**: `501 Method 'POST /nodes/{node}/qemu/{vmid}/importdisk' not implemented`
|
|
|
|
**Fix Required**:
|
|
- Check Proxmox version before use
|
|
- Provide fallback methods (template cloning, pre-imported images)
|
|
- Document supported versions
|
|
|
|
### 2.3 Inconsistent Client Creation
|
|
|
|
**File**: `crossplane-provider-proxmox/pkg/controller/vmscaleset/controller.go:47`
|
|
|
|
**Problem**: Creates client with empty parameters:
|
|
```go
|
|
proxmoxClient := proxmox.NewClient("", "", "")
|
|
```
|
|
|
|
**Fix Required**: Use proper credentials from ProviderConfig.
|
|
|
|
### 2.4 Lock File Handling Not Used
|
|
|
|
**File**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`
|
|
|
|
**Problem**: `UnlockVM()` function exists but never called during error recovery.
|
|
|
|
**Fix Required**: Call `UnlockVM()` before `DeleteVM()` in cleanup operations.
|
|
|
|
---
|
|
|
|
## 3. ml110-01 Node Status Investigation
|
|
|
|
### API Status Check Results
|
|
|
|
**Command**: `curl -k -b "PVEAuthCookie=..." "https://192.168.11.10:8006/api2/json/nodes/ml110-01/status"`
|
|
|
|
**Results**:
|
|
- ✅ **Node is healthy** (API confirms)
|
|
- CPU: 2.7% usage
|
|
- Memory: 9.2GB / 270GB used
|
|
- Uptime: 5.3 days
|
|
- PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83`
|
|
- Kernel: `6.17.2-1-pve`
|
|
|
|
### Web Portal "Unknown" Status
|
|
|
|
**Likely Causes**:
|
|
1. Web UI cache issue
|
|
2. Cluster quorum/communication (if in cluster)
|
|
3. Browser cache
|
|
4. Web UI version mismatch
|
|
|
|
**Recommendations**:
|
|
1. Refresh web portal (hard refresh: Ctrl+F5)
|
|
2. Check cluster status: `pvecm status` (if in cluster)
|
|
3. Verify node reachability: `ping ml110-01`
|
|
4. Check Proxmox logs: `/var/log/pveproxy/access.log`
|
|
5. Restart web UI: `systemctl restart pveproxy`
|
|
|
|
**Conclusion**: Node is healthy per API. Web portal issue is likely cosmetic/UI-related, not a functional problem.
|
|
|
|
---
|
|
|
|
## 4. Failure Cycle Analysis
|
|
|
|
### The Perpetual VM Creation Loop
|
|
|
|
**Sequence of Events**:
|
|
|
|
1. **User creates ProxmoxVM resource** with cloud image (`local:iso/ubuntu-22.04-cloud.img`)
|
|
2. **Controller reconciles** → `vm.Status.VMID == 0` → triggers creation
|
|
3. **VM created in Proxmox** → VMID assigned (e.g., 234)
|
|
4. **importdisk API called** → **FAILS** (501 not implemented)
|
|
5. **Error returned** → Status never updated (VMID still 0)
|
|
6. **Controller retries** → `vm.Status.VMID == 0` still true
|
|
7. **New VM created** → VMID 235
|
|
8. **Loop repeats** → VMs 236, 237, 238... created indefinitely
|
|
|
|
### Why It Happened
|
|
|
|
1. **No API availability check** before using importdisk
|
|
2. **No error recovery** for partial VM creation
|
|
3. **No status update** on failure (VMID stays 0)
|
|
4. **No cleanup** of orphaned VMs
|
|
5. **Immediate retry** (no backoff) → rapid VM creation
|
|
|
|
---
|
|
|
|
## 5. Recommendations to Prevent Repeating Failures
|
|
|
|
### Immediate (Critical)
|
|
|
|
1. **Add Error Recovery**
|
|
```go
|
|
createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec)
|
|
if err != nil {
|
|
// Check if VM was partially created
|
|
if createdVM != nil && createdVM.ID > 0 {
|
|
// Cleanup orphaned VM
|
|
proxmoxClient.DeleteVM(ctx, createdVM.ID)
|
|
}
|
|
// Longer requeue to prevent rapid retries
|
|
return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
|
|
}
|
|
```
|
|
|
|
2. **Check API Availability**
|
|
```go
|
|
// Before using importdisk
|
|
if !c.supportsImportDisk() {
|
|
return errors.New("importdisk API not supported. Use template cloning instead.")
|
|
}
|
|
```
|
|
|
|
3. **Update Status on Partial Failure**
|
|
```go
|
|
// Even if creation fails, update status to prevent infinite retries
|
|
vm.Status.Conditions = append(vm.Status.Conditions, metav1.Condition{
|
|
Type: "Failed",
|
|
Status: "True",
|
|
Reason: "ImportDiskNotSupported",
|
|
Message: err.Error(),
|
|
})
|
|
r.Status().Update(ctx, &vm)
|
|
```
|
|
|
|
### Short-term
|
|
|
|
4. **Implement Exponential Backoff**
|
|
- Current: Fixed 30s requeue
|
|
- Recommended: 30s → 1m → 2m → 5m → 10m
|
|
|
|
5. **Add Health Checks**
|
|
- Verify Proxmox API endpoints before use
|
|
- Check node status before VM creation
|
|
- Validate image availability
|
|
|
|
6. **Cleanup on Startup**
|
|
- Scan for orphaned VMs on controller startup
|
|
- Clean up VMs with stuck locks
|
|
- Log cleanup actions
|
|
|
|
### Long-term
|
|
|
|
7. **Alternative Image Import**
|
|
- Use `qm disk import` via SSH (if available)
|
|
- Pre-import images as templates
|
|
- Use Proxmox templates instead of cloud images
|
|
|
|
8. **Better Observability**
|
|
- Metrics for VM creation success/failure
|
|
- Track orphaned VM counts
|
|
- Alert on stuck creation loops
|
|
|
|
9. **Comprehensive Testing**
|
|
- Test with different Proxmox versions
|
|
- Test error recovery scenarios
|
|
- Test lock file handling
|
|
|
|
---
|
|
|
|
## 6. Files Requiring Fixes
|
|
|
|
### High Priority
|
|
|
|
1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go`**
|
|
- Lines 142-145: Add error recovery
|
|
- Lines 75-156: Add status update on failure
|
|
|
|
2. **`crossplane-provider-proxmox/pkg/proxmox/client.go`**
|
|
- Lines 350-400: Check importdisk availability
|
|
- Lines 803-821: Use UnlockVM in cleanup
|
|
|
|
### Medium Priority
|
|
|
|
3. **`crossplane-provider-proxmox/pkg/controller/vmscaleset/controller.go`**
|
|
- Line 47: Fix client creation
|
|
|
|
4. **Error handling throughout**
|
|
- Standardize requeue strategies
|
|
- Add error categorization
|
|
|
|
---
|
|
|
|
## 7. Documentation Created
|
|
|
|
1. **`docs/VM_CREATION_FAILURE_ANALYSIS.md`** (12KB)
|
|
- Comprehensive failure analysis
|
|
- Working vs non-working attempts
|
|
- Root cause analysis
|
|
- Recommendations
|
|
|
|
2. **`docs/CODE_INCONSISTENCIES.md`** (4KB)
|
|
- Code inconsistencies found
|
|
- Required fixes
|
|
- Priority levels
|
|
|
|
3. **`docs/REVIEW_SUMMARY.md`** (This file)
|
|
- Executive summary
|
|
- Quick reference
|
|
- Action items
|
|
|
|
---
|
|
|
|
## 8. Action Items
|
|
|
|
### Immediate Actions
|
|
|
|
- [ ] Fix error recovery in VM creation controller
|
|
- [ ] Add importdisk API availability check
|
|
- [ ] Implement cleanup on partial VM creation
|
|
- [ ] Fix vmscaleset controller client creation
|
|
|
|
### Short-term Actions
|
|
|
|
- [ ] Implement exponential backoff for retries
|
|
- [ ] Add health checks before VM creation
|
|
- [ ] Add cleanup on controller startup
|
|
- [ ] Standardize error handling patterns
|
|
|
|
### Long-term Actions
|
|
|
|
- [ ] Implement alternative image import methods
|
|
- [ ] Add comprehensive metrics and monitoring
|
|
- [ ] Create test suite for error scenarios
|
|
- [ ] Document supported Proxmox versions
|
|
|
|
---
|
|
|
|
## 9. Testing Checklist
|
|
|
|
Before deploying fixes:
|
|
|
|
- [ ] Test VM creation with importdisk (if supported)
|
|
- [ ] Test VM creation with template cloning
|
|
- [ ] Test error recovery when importdisk fails
|
|
- [ ] Test cleanup of orphaned VMs
|
|
- [ ] Test lock file handling
|
|
- [ ] Test controller retry behavior
|
|
- [ ] Test status update on partial failures
|
|
- [ ] Test multiple concurrent VM creations
|
|
- [ ] Test node status checks
|
|
- [ ] Test Proxmox version compatibility
|
|
|
|
---
|
|
|
|
## 10. Conclusion
|
|
|
|
**Current Status**:
|
|
- ✅ All orphaned VMs cleaned up
|
|
- ✅ Controller stopped (no active processes)
|
|
- ✅ Root cause identified
|
|
- ✅ Inconsistencies documented
|
|
- ⚠️ Fixes required before re-enabling controller
|
|
|
|
**Next Steps**:
|
|
1. Implement error recovery fixes
|
|
2. Add API availability checks
|
|
3. Test thoroughly
|
|
4. Re-enable controller with monitoring
|
|
|
|
**Risk Level**: **HIGH** - Controller should remain scaled to 0 until fixes are deployed.
|
|
|
|
---
|
|
|
|
*Last Updated: 2025-12-12*
|
|
*Reviewer: AI Assistant*
|
|
*Status: Complete*
|
|
|