Document r630-04 MEV storage remediation plan

2026-04-14 05:13:59 -07:00
parent 76253586e7
commit 367e98446a
2 changed files with 353 additions and 1 deletions
--- a/docs/04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md
+++ b/docs/04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md
@@ -0,0 +1,352 @@
+# r630-04 Storage Remediation And MEV Plan
+
+**Last updated:** 2026-04-14  
+**Purpose:** Surgical remediation plan for `r630-04` local storage after live read errors on `sda`, with exact next steps for either migrating MEV off the node or rebuilding a safe dedicated local MEV pool on the two hidden Samsung SSDs.
+
+## 1. Current facts
+
+### 1.1 Confirmed bad drive
+
+The failing drive is:
+
+- `megaraid,0` / `/dev/sda`
+- Model: `ST9300653SS`
+- Serial: `6XN7PB91`
+
+Observed failure evidence:
+
+- kernel `Medium Error`
+- `Unrecovered read error`
+- read failures on the old swap LV
+- SMART still says `OK`, but:
+  - `Elements in grown defect list: 2804`
+  - `Non-medium error count: 1700390`
+
+Treat `sda` as **degraded / unsafe** for continued production locality.
+
+### 1.2 Other currently visible disks
+
+- `megaraid,1` / `/dev/sdb` — healthy 300G SAS, serial `PQHTWUVB`, currently part of VG `pve`
+- `megaraid,4` / `/dev/sdc` — Crucial MX500 250G, serial `2202E5FB4CC9`, currently used by Ceph
+- `megaraid,5` / `/dev/sdd` — Crucial MX500 250G, serial `2203E5FE0911`, currently used by Ceph
+- `megaraid,6` / `/dev/sde` — Crucial MX500 250G, serial `2203E5FE0912`, currently used by Ceph
+- `megaraid,7` / `/dev/sdf` — Crucial MX500 250G, serial `2202E5FB4CC2`, currently used by Ceph
+
+### 1.3 Hidden controller-visible SSDs
+
+The MegaRAID controller sees two additional healthy SSDs that Linux does not currently expose as `/dev/sd*` devices:
+
+- `megaraid,2` — Samsung SSD 860 EVO 250GB, serial `S3YHNB0K308072M`
+- `megaraid,3` — Samsung SSD 860 EVO 250GB, serial `S3YJNB0K597631B`
+
+Health indicators for both:
+
+- SMART overall health: `PASSED`
+- reallocated sectors: `0`
+- no uncorrectables observed in the SMART summary we pulled
+
+These two drives are the best candidates for a dedicated local MEV storage pool on `r630-04`, but they are currently hidden behind the controller.
+
+## 2. Immediate operating posture
+
+Already applied live:
+
+- host swap disabled
+- `/etc/fstab` swap line commented out
+- `vm.swappiness=1`
+- `vm.vfs_cache_pressure=50`
+- CT `2421` now runs with:
+  - `memory: 49152`
+  - `swap: 0`
+  - `cpuunits: 4096`
+
+These changes reduce the blast radius, but they do **not** make `r630-04` local storage trustworthy while `sda` remains in path for:
+
+- `pve-root`
+- thin-pool metadata
+- part of `pve/data`
+
+## 3. Decision paths
+
+There are two valid paths.
+
+### Path A: Fastest risk reduction
+
+Move CT `2421` off `r630-04` to `r630-03`.
+
+Use this when:
+
+- you want MEV risk reduced immediately
+- you do not want to touch controller config first
+- you are willing to keep the `r630-04` storage redesign as a second phase
+
+### Path B: Keep MEV on r630-04, but move it off the degraded local pool
+
+Expose the two Samsung SSDs to Linux, build a dedicated thinpool on them, and move CT `2421` onto that new storage.
+
+Use this when:
+
+- you want MEV to stay on `r630-04`
+- you are comfortable making controller-level storage changes
+- you want a clean local storage class for MEV that is not mixed with the failing `sda`
+
+## 4. Recommended order
+
+The safest overall sequence is:
+
+1. keep MEV stable with the already-applied tuning
+2. choose one of:
+   - Path A first, then redesign `r630-04`
+   - Path B directly if you want to keep `2421` local to `r630-04`
+3. replace / retire `sda`
+4. only after that, reuse `r630-04` broadly for new CT locality
+
+## 5. Path A: Migrate CT 2421 to r630-03
+
+### 5.1 Why this is still the safest immediate option
+
+- `r630-03` has active `local-lvm`
+- `r630-03` has large free thin capacity
+- `r630-03` has enough available memory for CT `2421`
+- this avoids controller work during a live application incident window
+
+### 5.2 Preflight
+
+```bash
+ssh root@192.168.11.14 'pct status 2421'
+ssh root@192.168.11.13 'pvecm status; pvesm status | egrep "^(data|local-lvm|local)"'
+```
+
+### 5.3 Preferred migration
+
+```bash
+ssh root@192.168.11.14 'pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock'
+ssh root@192.168.11.14 'pct migrate 2421 r630-03 --storage local-lvm --online 0'
+```
+
+### 5.4 Fallback if direct migrate is unhappy
+
+```bash
+ssh root@192.168.11.14 'vzdump 2421 --mode stop --compress zstd --storage local'
+scp root@192.168.11.14:/var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst /tmp/
+scp /tmp/vzdump-lxc-2421-*.tar.zst root@192.168.11.13:/var/lib/vz/dump/
+ssh root@192.168.11.13 'pct restore 2421 /var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst --storage local-lvm'
+```
+
+### 5.5 Post-migration verification
+
+```bash
+ssh root@192.168.11.13 'pct start 2421'
+curl -fsS https://mev.defi-oracle.io/api/health
+curl -fsS https://mev.defi-oracle.io/api/infra
+API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
+bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
+```
+
+## 6. Path B: Build a dedicated MEV thinpool on the hidden Samsung SSDs
+
+### 6.1 Preconditions
+
+You need controller-level access to the MegaRAID state for the two Samsung SSDs:
+
+- `S3YHNB0K308072M`
+- `S3YJNB0K597631B`
+
+The host does **not** currently have `storcli`, `perccli`, `megacli`, or `omreport` installed from apt. That means one of these must be used:
+
+- Dell Lifecycle Controller / iDRAC storage UI
+- vendor-provided `storcli` / `perccli` package copied in manually
+- bootable maintenance environment with MegaRAID tooling
+
+### 6.2 What must be determined first
+
+Before changing anything, identify the controller slot / enclosure and current state for those two serial numbers. Possible states:
+
+- `UGood`
+- `JBOD`
+- `Hot Spare`
+- `Foreign`
+- `Offline`
+- part of an old single-drive virtual disk
+
+### 6.3 If using storcli / perccli
+
+Typical discovery flow:
+
+```bash
+storcli /c0 show
+storcli /c0 /eall /sall show all
+```
+
+Find the rows matching:
+
+- `S3YHNB0K308072M`
+- `S3YJNB0K597631B`
+
+Record:
+
+- enclosure id
+- slot id
+- state
+
+### 6.4 Controller actions by state
+
+These are the safe controller actions by scenario.
+
+If the drives are **global hot spares**:
+
+```bash
+storcli /c0 /e<ENC> /s<SLOT> delete hotsparedrive
+```
+
+If the drives are **foreign**:
+
+```bash
+storcli /c0 /fall del
+```
+
+If the drives are **unconfigured-good** and the controller supports JBOD:
+
+```bash
+storcli /c0 /e<ENC> /s<SLOT> set jbod
+```
+
+If the controller does **not** support JBOD for the chosen mode, create two single-drive RAID0 virtual disks instead. In that case, Linux will see two new logical disks and they can still be used for a dedicated MEV VG.
+
+### 6.5 OS-level confirmation after exposure
+
+After the controller exposes the disks, verify new block devices appear:
+
+```bash
+lsblk -d -o NAME,SIZE,MODEL,SERIAL,ROTA,TYPE,TRAN
+```
+
+You want to see the two Samsung serials appear as Linux devices.
+
+Then verify they are unused:
+
+```bash
+blkid /dev/sdX
+wipefs -n /dev/sdX
+pvs
+```
+
+If they are clean, continue.
+
+### 6.6 Create a dedicated MEV VG and thinpool
+
+Use stable disk identifiers by serial, not guessed `/dev/sdX` names.
+
+Example:
+
+```bash
+pvcreate /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
+vgcreate pve-mev /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
+lvcreate -l 95%VG -T -n data pve-mev
+```
+
+Verify:
+
+```bash
+vgs pve-mev
+lvs pve-mev
+```
+
+### 6.7 Add storage to Proxmox
+
+```bash
+pvesm add lvmthin mev-local-lvm \
+  --vgname pve-mev \
+  --thinpool data \
+  --content images,rootdir \
+  --nodes r630-04
+```
+
+Verify:
+
+```bash
+pvesm status | egrep "mev-local-lvm|local-lvm|data"
+```
+
+### 6.8 Move CT 2421 onto the new storage
+
+The cleanest move is while the CT is stopped.
+
+Preflight backup:
+
+```bash
+vzdump 2421 --mode stop --compress zstd --storage local
+```
+
+Stop the CT:
+
+```bash
+pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock
+```
+
+Preferred move in place:
+
+```bash
+pct move-volume 2421 rootfs mev-local-lvm --delete 1
+```
+
+Then confirm:
+
+```bash
+pct config 2421 | grep '^rootfs:'
+```
+
+Start the CT:
+
+```bash
+pct start 2421
+```
+
+### 6.9 Post-move verification
+
+```bash
+pct status 2421
+curl -fsS https://mev.defi-oracle.io/api/health
+curl -fsS https://mev.defi-oracle.io/api/infra
+curl -fsS https://mev.defi-oracle.io/api/stats/freshness
+API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
+bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
+```
+
+## 7. Rollback points
+
+### Path A rollback
+
+- if migration fails before cutover, keep `2421` stopped on `r630-04` and restart it there
+- if restore-on-target fails, original CT still exists on source until explicitly destroyed
+
+### Path B rollback
+
+- if controller exposure step looks wrong, stop and do **not** create PVs
+- if VG/thinpool creation fails, remove only the new VG and leave CT `2421` where it is
+- if `pct move-volume` fails, the preflight `vzdump` is the safety net
+
+## 8. Recommendation
+
+If the priority is **lowest operational risk**, do:
+
+1. **Path A now** — move CT `2421` to `r630-03`
+2. then repair `r630-04` storage at leisure
+
+If the priority is **keeping MEV on r630-04**, do:
+
+1. expose the two Samsung SSDs from the controller
+2. build `pve-mev`
+3. move CT `2421` to `mev-local-lvm`
+4. then retire / replace `sda`
+
+## 9. Current practical recommendation
+
+Because the two Samsung SSDs are healthy and already identified by serial, `r630-04` does have a viable long-term local storage redesign path.
+
+But because `sda` is already erroring in production, the lowest-risk sequence remains:
+
+1. keep MEV stable with the applied hardening
+2. migrate `2421` to `r630-03` if you want immediate risk removal
+3. redesign `r630-04` local storage afterward
+
--- a/docs/MASTER_INDEX.md
+++ b/docs/MASTER_INDEX.md
@@ -16,7 +16,7 @@
 | **Agent / IDE instructions** | [AGENTS.md](../AGENTS.md) (repo root) |
 | **Local green-path tests** | Root `pnpm test` → [`scripts/verify/run-repo-green-test-path.sh`](../scripts/verify/run-repo-green-test-path.sh) |
 | **Git submodule hygiene + explorer remotes** | [00-meta/SUBMODULE_HYGIENE.md](00-meta/SUBMODULE_HYGIENE.md) — detached HEAD, push order, Gitea/GitHub, `submodules-clean.sh` |
-| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
+| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); `r630-04` storage repair / MEV pool redesign: [04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md](04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
 | **What to do next** | [00-meta/NEXT_STEPS_INDEX.md](00-meta/NEXT_STEPS_INDEX.md) — ordered actions, by audience, execution plan |
 | **Recent cleanup / handoff summary** | [00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md](00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md) |
 | **Live verification evidence (dated)** | [00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md](00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md) |