core: smart recovery for failed installs | extend exit_codes (#11221)
* feat(build.func): smart error recovery menu for failed installations
Replace simple Y/n removal prompt with interactive recovery menu:
- Option 1: Remove container and exit (default, auto after 60s timeout)
- Option 2: Keep container for debugging
- Option 3: Retry installation with verbose mode enabled
- Option 4: Retry with 1.5x RAM and +1 CPU core (OOM errors only)
Improvements:
- Detect OOM errors (exit codes 137, 243) and offer resource increase
- Show human-readable error explanation using explain_exit_code()
- Recursive rebuild preserves ALL settings from advanced/app.vars/default.vars
- Settings preserved: Network (IP, Gateway, VLAN, MTU, Bridge), Features
(Nesting, FUSE, TUN, GPU), Storage, SSH keys, Tags, Hostname, etc.
- Show rebuild summary before retry (old→new CTID, resources, network)
- New container ID generated automatically for rebuilds
This helps users recover from transient failures without re-running
the entire script manually.
* fix(api.func): fix duplicate exit codes and add missing error codes
Exit code fixes:
- Remove duplicate definitions for codes 243, 254 (Node.js vs DB)
- Reassign MySQL/MariaDB to 240-242, 244 (was 241-244)
- Reassign MongoDB to 250-253 (was 251-254)
New exit codes added (based on GitHub issues analysis):
- 6: curl couldn't resolve host (DNS failure)
- 7: curl failed to connect (network unreachable)
- 22: curl HTTP error (404, 429 rate limit, 500)
- 28: curl timeout (very common in download failures)
- 35: curl SSL error
- 102: APT lock held by another process
- 124: Command timeout
- 141: SIGPIPE (broken pipe)
Also update OOM detection to include exit code 134 (SIGABRT)
which is commonly seen in Node.js heap overflow issues.
Fixes based on analysis of ~500 GitHub issues.
* fix(exit-codes): sync error_handler.func and api.func with conflict-free code ranges
- Add curl error codes (6, 7, 22, 28, 35)
- Add APT lock code (102), timeout (124), signals (134, 141)
- Move Python codes: 210-212 → 160-162 (avoid Proxmox conflict)
- Move PostgreSQL codes: 231-234 → 170-173
- Move MySQL/MariaDB codes: 241-244 → 180-183
- Move MongoDB codes: 251-254 → 190-193
- Keep Node.js at 243-249, Proxmox at 200-231
- Both files now synchronized with identical mappings
* feat(exit-codes): add systemd and build error codes (150-154)
- 150: Systemd service failed to start
- 151: Systemd service unit not found
- 152: Permission denied (EACCES)
- 153: Build/compile failed (make/gcc/cmake)
- 154: Node.js native addon build failed (node-gyp)
Based on issue analysis: 57 service failures, 25 build failures, 22 node-gyp issues
* fix(build): restore smart recovery and add OOM/DNS retry paths
* feat(build): APT in-place repair, exit 1 subclassification, new exit codes
- Add APT/DPKG in-place recovery: detects exit 100/101/102/255 and exit 1
with APT log patterns, offers to repair dpkg state and re-run install
script without destroying the container
- Add exit 1 subclassification: analyzes combined log to identify root
cause (APT, OOM, network, command-not-found) and routes to appropriate
recovery option
- Add exit 10 hint: shows privileged mode / nesting suggestion
- Add exit 127 hint: extracts missing command name from logs
- Refactor recovery menu: use named option variables (APT_OPTION,
OOM_OPTION, DNS_OPTION) instead of hardcoded option numbers, supports
up to 6 dynamic options cleanly
- Map missing exit codes in api.func: curl 27/36/45/47/55, signals
129 (SIGHUP) / 131 (SIGQUIT), npm 239
* feat(api+build): map 25 more exit codes, add SIGHUP trap, network/perm hints
api.func:
- Map 25+ new exit codes that were showing as 'Unknown' in telemetry:
curl: 3, 16, 18, 24, 26, 32-34, 39, 44, 46, 48, 51, 52, 57, 59, 61,
63, 79, 92, 95; signals: 125, 132, 144, 146
- Update code 8 description (FTP + apk untrusted key)
- Update header comment with full supported ranges
build.func:
- Add SIGHUP trap: reports 'failed/129' to API when terminal is closed,
should significantly reduce the 2841 stuck 'installing' records
- Add exit 52 (empty reply) and 57 (poll error) to network issue
detection for DNS override recovery option
- Add exit 125/126 hint: suggests privileged mode for permission errors
* fix: sync error_handler fallback, Alpine APK repair, retry limit
error_handler.func:
- Sync fallback explain_exit_code() with api.func: add 25+ codes that
were missing (curl 16/18/24/26/27/32-34/36/39/44-48/51/52/55/57/59/
61/63/79/92/95, signals 125/129/131/132/144/146, npm 239, code 3/8)
- Ensures consistent error descriptions even when api.func isn't loaded
build.func:
- Alpine APK repair: detect var_os=alpine and run 'apk fix && apk
cache clean && apk update' instead of apt-get/dpkg commands
- Show 'Repair APK state' instead of 'APT/DPKG' in menu for Alpine
- Retry safety counter: OOM x2 retry limited to max 2 attempts
(prevents infinite RAM doubling via RECOVERY_ATTEMPT env var)
- Show attempt count in rebuild summary
* fix(build): preserve exit code in ERR trap to prevent false exit_code=0
The ERR trap called ensure_log_on_host before post_update_to_api,
which reset \True to 0 (success). This caused ~15-20 records/day to be
reported as 'failed' with exit_code=0 instead of the actual error code.
Root cause chain:
1. Command fails with exit code N → ERR trap fires (\True = N)
2. ensure_log_on_host succeeds → \True becomes 0
3. post_update_to_api 'failed' '\True' → sends 'failed/0' (wrong!)
4. POST_UPDATE_DONE=true → EXIT trap skips the correct code
Fix: capture \True into _ERR_CODE before ensure_log_on_host runs.
* Implement telemetry settings and repo source detection
Add telemetry configuration and repository source detection function.