Document r630-04 MEV storage remediation plan

This commit is contained in:
defiQUG
2026-04-14 05:13:59 -07:00
parent 76253586e7
commit 367e98446a
2 changed files with 353 additions and 1 deletions

View File

@@ -0,0 +1,352 @@
# r630-04 Storage Remediation And MEV Plan
**Last updated:** 2026-04-14
**Purpose:** Surgical remediation plan for `r630-04` local storage after live read errors on `sda`, with exact next steps for either migrating MEV off the node or rebuilding a safe dedicated local MEV pool on the two hidden Samsung SSDs.
## 1. Current facts
### 1.1 Confirmed bad drive
The failing drive is:
- `megaraid,0` / `/dev/sda`
- Model: `ST9300653SS`
- Serial: `6XN7PB91`
Observed failure evidence:
- kernel `Medium Error`
- `Unrecovered read error`
- read failures on the old swap LV
- SMART still says `OK`, but:
- `Elements in grown defect list: 2804`
- `Non-medium error count: 1700390`
Treat `sda` as **degraded / unsafe** for continued production locality.
### 1.2 Other currently visible disks
- `megaraid,1` / `/dev/sdb` — healthy 300G SAS, serial `PQHTWUVB`, currently part of VG `pve`
- `megaraid,4` / `/dev/sdc` — Crucial MX500 250G, serial `2202E5FB4CC9`, currently used by Ceph
- `megaraid,5` / `/dev/sdd` — Crucial MX500 250G, serial `2203E5FE0911`, currently used by Ceph
- `megaraid,6` / `/dev/sde` — Crucial MX500 250G, serial `2203E5FE0912`, currently used by Ceph
- `megaraid,7` / `/dev/sdf` — Crucial MX500 250G, serial `2202E5FB4CC2`, currently used by Ceph
### 1.3 Hidden controller-visible SSDs
The MegaRAID controller sees two additional healthy SSDs that Linux does not currently expose as `/dev/sd*` devices:
- `megaraid,2` — Samsung SSD 860 EVO 250GB, serial `S3YHNB0K308072M`
- `megaraid,3` — Samsung SSD 860 EVO 250GB, serial `S3YJNB0K597631B`
Health indicators for both:
- SMART overall health: `PASSED`
- reallocated sectors: `0`
- no uncorrectables observed in the SMART summary we pulled
These two drives are the best candidates for a dedicated local MEV storage pool on `r630-04`, but they are currently hidden behind the controller.
## 2. Immediate operating posture
Already applied live:
- host swap disabled
- `/etc/fstab` swap line commented out
- `vm.swappiness=1`
- `vm.vfs_cache_pressure=50`
- CT `2421` now runs with:
- `memory: 49152`
- `swap: 0`
- `cpuunits: 4096`
These changes reduce the blast radius, but they do **not** make `r630-04` local storage trustworthy while `sda` remains in path for:
- `pve-root`
- thin-pool metadata
- part of `pve/data`
## 3. Decision paths
There are two valid paths.
### Path A: Fastest risk reduction
Move CT `2421` off `r630-04` to `r630-03`.
Use this when:
- you want MEV risk reduced immediately
- you do not want to touch controller config first
- you are willing to keep the `r630-04` storage redesign as a second phase
### Path B: Keep MEV on r630-04, but move it off the degraded local pool
Expose the two Samsung SSDs to Linux, build a dedicated thinpool on them, and move CT `2421` onto that new storage.
Use this when:
- you want MEV to stay on `r630-04`
- you are comfortable making controller-level storage changes
- you want a clean local storage class for MEV that is not mixed with the failing `sda`
## 4. Recommended order
The safest overall sequence is:
1. keep MEV stable with the already-applied tuning
2. choose one of:
- Path A first, then redesign `r630-04`
- Path B directly if you want to keep `2421` local to `r630-04`
3. replace / retire `sda`
4. only after that, reuse `r630-04` broadly for new CT locality
## 5. Path A: Migrate CT 2421 to r630-03
### 5.1 Why this is still the safest immediate option
- `r630-03` has active `local-lvm`
- `r630-03` has large free thin capacity
- `r630-03` has enough available memory for CT `2421`
- this avoids controller work during a live application incident window
### 5.2 Preflight
```bash
ssh root@192.168.11.14 'pct status 2421'
ssh root@192.168.11.13 'pvecm status; pvesm status | egrep "^(data|local-lvm|local)"'
```
### 5.3 Preferred migration
```bash
ssh root@192.168.11.14 'pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock'
ssh root@192.168.11.14 'pct migrate 2421 r630-03 --storage local-lvm --online 0'
```
### 5.4 Fallback if direct migrate is unhappy
```bash
ssh root@192.168.11.14 'vzdump 2421 --mode stop --compress zstd --storage local'
scp root@192.168.11.14:/var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst /tmp/
scp /tmp/vzdump-lxc-2421-*.tar.zst root@192.168.11.13:/var/lib/vz/dump/
ssh root@192.168.11.13 'pct restore 2421 /var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst --storage local-lvm'
```
### 5.5 Post-migration verification
```bash
ssh root@192.168.11.13 'pct start 2421'
curl -fsS https://mev.defi-oracle.io/api/health
curl -fsS https://mev.defi-oracle.io/api/infra
API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
```
## 6. Path B: Build a dedicated MEV thinpool on the hidden Samsung SSDs
### 6.1 Preconditions
You need controller-level access to the MegaRAID state for the two Samsung SSDs:
- `S3YHNB0K308072M`
- `S3YJNB0K597631B`
The host does **not** currently have `storcli`, `perccli`, `megacli`, or `omreport` installed from apt. That means one of these must be used:
- Dell Lifecycle Controller / iDRAC storage UI
- vendor-provided `storcli` / `perccli` package copied in manually
- bootable maintenance environment with MegaRAID tooling
### 6.2 What must be determined first
Before changing anything, identify the controller slot / enclosure and current state for those two serial numbers. Possible states:
- `UGood`
- `JBOD`
- `Hot Spare`
- `Foreign`
- `Offline`
- part of an old single-drive virtual disk
### 6.3 If using storcli / perccli
Typical discovery flow:
```bash
storcli /c0 show
storcli /c0 /eall /sall show all
```
Find the rows matching:
- `S3YHNB0K308072M`
- `S3YJNB0K597631B`
Record:
- enclosure id
- slot id
- state
### 6.4 Controller actions by state
These are the safe controller actions by scenario.
If the drives are **global hot spares**:
```bash
storcli /c0 /e<ENC> /s<SLOT> delete hotsparedrive
```
If the drives are **foreign**:
```bash
storcli /c0 /fall del
```
If the drives are **unconfigured-good** and the controller supports JBOD:
```bash
storcli /c0 /e<ENC> /s<SLOT> set jbod
```
If the controller does **not** support JBOD for the chosen mode, create two single-drive RAID0 virtual disks instead. In that case, Linux will see two new logical disks and they can still be used for a dedicated MEV VG.
### 6.5 OS-level confirmation after exposure
After the controller exposes the disks, verify new block devices appear:
```bash
lsblk -d -o NAME,SIZE,MODEL,SERIAL,ROTA,TYPE,TRAN
```
You want to see the two Samsung serials appear as Linux devices.
Then verify they are unused:
```bash
blkid /dev/sdX
wipefs -n /dev/sdX
pvs
```
If they are clean, continue.
### 6.6 Create a dedicated MEV VG and thinpool
Use stable disk identifiers by serial, not guessed `/dev/sdX` names.
Example:
```bash
pvcreate /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
vgcreate pve-mev /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
lvcreate -l 95%VG -T -n data pve-mev
```
Verify:
```bash
vgs pve-mev
lvs pve-mev
```
### 6.7 Add storage to Proxmox
```bash
pvesm add lvmthin mev-local-lvm \
--vgname pve-mev \
--thinpool data \
--content images,rootdir \
--nodes r630-04
```
Verify:
```bash
pvesm status | egrep "mev-local-lvm|local-lvm|data"
```
### 6.8 Move CT 2421 onto the new storage
The cleanest move is while the CT is stopped.
Preflight backup:
```bash
vzdump 2421 --mode stop --compress zstd --storage local
```
Stop the CT:
```bash
pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock
```
Preferred move in place:
```bash
pct move-volume 2421 rootfs mev-local-lvm --delete 1
```
Then confirm:
```bash
pct config 2421 | grep '^rootfs:'
```
Start the CT:
```bash
pct start 2421
```
### 6.9 Post-move verification
```bash
pct status 2421
curl -fsS https://mev.defi-oracle.io/api/health
curl -fsS https://mev.defi-oracle.io/api/infra
curl -fsS https://mev.defi-oracle.io/api/stats/freshness
API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
```
## 7. Rollback points
### Path A rollback
- if migration fails before cutover, keep `2421` stopped on `r630-04` and restart it there
- if restore-on-target fails, original CT still exists on source until explicitly destroyed
### Path B rollback
- if controller exposure step looks wrong, stop and do **not** create PVs
- if VG/thinpool creation fails, remove only the new VG and leave CT `2421` where it is
- if `pct move-volume` fails, the preflight `vzdump` is the safety net
## 8. Recommendation
If the priority is **lowest operational risk**, do:
1. **Path A now** — move CT `2421` to `r630-03`
2. then repair `r630-04` storage at leisure
If the priority is **keeping MEV on r630-04**, do:
1. expose the two Samsung SSDs from the controller
2. build `pve-mev`
3. move CT `2421` to `mev-local-lvm`
4. then retire / replace `sda`
## 9. Current practical recommendation
Because the two Samsung SSDs are healthy and already identified by serial, `r630-04` does have a viable long-term local storage redesign path.
But because `sda` is already erroring in production, the lowest-risk sequence remains:
1. keep MEV stable with the applied hardening
2. migrate `2421` to `r630-03` if you want immediate risk removal
3. redesign `r630-04` local storage afterward

View File

@@ -16,7 +16,7 @@
| **Agent / IDE instructions** | [AGENTS.md](../AGENTS.md) (repo root) |
| **Local green-path tests** | Root `pnpm test` → [`scripts/verify/run-repo-green-test-path.sh`](../scripts/verify/run-repo-green-test-path.sh) |
| **Git submodule hygiene + explorer remotes** | [00-meta/SUBMODULE_HYGIENE.md](00-meta/SUBMODULE_HYGIENE.md) — detached HEAD, push order, Gitea/GitHub, `submodules-clean.sh` |
| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); `r630-04` storage repair / MEV pool redesign: [04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md](04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
| **What to do next** | [00-meta/NEXT_STEPS_INDEX.md](00-meta/NEXT_STEPS_INDEX.md) — ordered actions, by audience, execution plan |
| **Recent cleanup / handoff summary** | [00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md](00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md) |
| **Live verification evidence (dated)** | [00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md](00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md) |