Document r630-04 MEV storage remediation plan
This commit is contained in:
@@ -0,0 +1,352 @@
|
||||
# r630-04 Storage Remediation And MEV Plan
|
||||
|
||||
**Last updated:** 2026-04-14
|
||||
**Purpose:** Surgical remediation plan for `r630-04` local storage after live read errors on `sda`, with exact next steps for either migrating MEV off the node or rebuilding a safe dedicated local MEV pool on the two hidden Samsung SSDs.
|
||||
|
||||
## 1. Current facts
|
||||
|
||||
### 1.1 Confirmed bad drive
|
||||
|
||||
The failing drive is:
|
||||
|
||||
- `megaraid,0` / `/dev/sda`
|
||||
- Model: `ST9300653SS`
|
||||
- Serial: `6XN7PB91`
|
||||
|
||||
Observed failure evidence:
|
||||
|
||||
- kernel `Medium Error`
|
||||
- `Unrecovered read error`
|
||||
- read failures on the old swap LV
|
||||
- SMART still says `OK`, but:
|
||||
- `Elements in grown defect list: 2804`
|
||||
- `Non-medium error count: 1700390`
|
||||
|
||||
Treat `sda` as **degraded / unsafe** for continued production locality.
|
||||
|
||||
### 1.2 Other currently visible disks
|
||||
|
||||
- `megaraid,1` / `/dev/sdb` — healthy 300G SAS, serial `PQHTWUVB`, currently part of VG `pve`
|
||||
- `megaraid,4` / `/dev/sdc` — Crucial MX500 250G, serial `2202E5FB4CC9`, currently used by Ceph
|
||||
- `megaraid,5` / `/dev/sdd` — Crucial MX500 250G, serial `2203E5FE0911`, currently used by Ceph
|
||||
- `megaraid,6` / `/dev/sde` — Crucial MX500 250G, serial `2203E5FE0912`, currently used by Ceph
|
||||
- `megaraid,7` / `/dev/sdf` — Crucial MX500 250G, serial `2202E5FB4CC2`, currently used by Ceph
|
||||
|
||||
### 1.3 Hidden controller-visible SSDs
|
||||
|
||||
The MegaRAID controller sees two additional healthy SSDs that Linux does not currently expose as `/dev/sd*` devices:
|
||||
|
||||
- `megaraid,2` — Samsung SSD 860 EVO 250GB, serial `S3YHNB0K308072M`
|
||||
- `megaraid,3` — Samsung SSD 860 EVO 250GB, serial `S3YJNB0K597631B`
|
||||
|
||||
Health indicators for both:
|
||||
|
||||
- SMART overall health: `PASSED`
|
||||
- reallocated sectors: `0`
|
||||
- no uncorrectables observed in the SMART summary we pulled
|
||||
|
||||
These two drives are the best candidates for a dedicated local MEV storage pool on `r630-04`, but they are currently hidden behind the controller.
|
||||
|
||||
## 2. Immediate operating posture
|
||||
|
||||
Already applied live:
|
||||
|
||||
- host swap disabled
|
||||
- `/etc/fstab` swap line commented out
|
||||
- `vm.swappiness=1`
|
||||
- `vm.vfs_cache_pressure=50`
|
||||
- CT `2421` now runs with:
|
||||
- `memory: 49152`
|
||||
- `swap: 0`
|
||||
- `cpuunits: 4096`
|
||||
|
||||
These changes reduce the blast radius, but they do **not** make `r630-04` local storage trustworthy while `sda` remains in path for:
|
||||
|
||||
- `pve-root`
|
||||
- thin-pool metadata
|
||||
- part of `pve/data`
|
||||
|
||||
## 3. Decision paths
|
||||
|
||||
There are two valid paths.
|
||||
|
||||
### Path A: Fastest risk reduction
|
||||
|
||||
Move CT `2421` off `r630-04` to `r630-03`.
|
||||
|
||||
Use this when:
|
||||
|
||||
- you want MEV risk reduced immediately
|
||||
- you do not want to touch controller config first
|
||||
- you are willing to keep the `r630-04` storage redesign as a second phase
|
||||
|
||||
### Path B: Keep MEV on r630-04, but move it off the degraded local pool
|
||||
|
||||
Expose the two Samsung SSDs to Linux, build a dedicated thinpool on them, and move CT `2421` onto that new storage.
|
||||
|
||||
Use this when:
|
||||
|
||||
- you want MEV to stay on `r630-04`
|
||||
- you are comfortable making controller-level storage changes
|
||||
- you want a clean local storage class for MEV that is not mixed with the failing `sda`
|
||||
|
||||
## 4. Recommended order
|
||||
|
||||
The safest overall sequence is:
|
||||
|
||||
1. keep MEV stable with the already-applied tuning
|
||||
2. choose one of:
|
||||
- Path A first, then redesign `r630-04`
|
||||
- Path B directly if you want to keep `2421` local to `r630-04`
|
||||
3. replace / retire `sda`
|
||||
4. only after that, reuse `r630-04` broadly for new CT locality
|
||||
|
||||
## 5. Path A: Migrate CT 2421 to r630-03
|
||||
|
||||
### 5.1 Why this is still the safest immediate option
|
||||
|
||||
- `r630-03` has active `local-lvm`
|
||||
- `r630-03` has large free thin capacity
|
||||
- `r630-03` has enough available memory for CT `2421`
|
||||
- this avoids controller work during a live application incident window
|
||||
|
||||
### 5.2 Preflight
|
||||
|
||||
```bash
|
||||
ssh root@192.168.11.14 'pct status 2421'
|
||||
ssh root@192.168.11.13 'pvecm status; pvesm status | egrep "^(data|local-lvm|local)"'
|
||||
```
|
||||
|
||||
### 5.3 Preferred migration
|
||||
|
||||
```bash
|
||||
ssh root@192.168.11.14 'pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock'
|
||||
ssh root@192.168.11.14 'pct migrate 2421 r630-03 --storage local-lvm --online 0'
|
||||
```
|
||||
|
||||
### 5.4 Fallback if direct migrate is unhappy
|
||||
|
||||
```bash
|
||||
ssh root@192.168.11.14 'vzdump 2421 --mode stop --compress zstd --storage local'
|
||||
scp root@192.168.11.14:/var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst /tmp/
|
||||
scp /tmp/vzdump-lxc-2421-*.tar.zst root@192.168.11.13:/var/lib/vz/dump/
|
||||
ssh root@192.168.11.13 'pct restore 2421 /var/lib/vz/dump/vzdump-lxc-2421-*.tar.zst --storage local-lvm'
|
||||
```
|
||||
|
||||
### 5.5 Post-migration verification
|
||||
|
||||
```bash
|
||||
ssh root@192.168.11.13 'pct start 2421'
|
||||
curl -fsS https://mev.defi-oracle.io/api/health
|
||||
curl -fsS https://mev.defi-oracle.io/api/infra
|
||||
API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
|
||||
bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
|
||||
```
|
||||
|
||||
## 6. Path B: Build a dedicated MEV thinpool on the hidden Samsung SSDs
|
||||
|
||||
### 6.1 Preconditions
|
||||
|
||||
You need controller-level access to the MegaRAID state for the two Samsung SSDs:
|
||||
|
||||
- `S3YHNB0K308072M`
|
||||
- `S3YJNB0K597631B`
|
||||
|
||||
The host does **not** currently have `storcli`, `perccli`, `megacli`, or `omreport` installed from apt. That means one of these must be used:
|
||||
|
||||
- Dell Lifecycle Controller / iDRAC storage UI
|
||||
- vendor-provided `storcli` / `perccli` package copied in manually
|
||||
- bootable maintenance environment with MegaRAID tooling
|
||||
|
||||
### 6.2 What must be determined first
|
||||
|
||||
Before changing anything, identify the controller slot / enclosure and current state for those two serial numbers. Possible states:
|
||||
|
||||
- `UGood`
|
||||
- `JBOD`
|
||||
- `Hot Spare`
|
||||
- `Foreign`
|
||||
- `Offline`
|
||||
- part of an old single-drive virtual disk
|
||||
|
||||
### 6.3 If using storcli / perccli
|
||||
|
||||
Typical discovery flow:
|
||||
|
||||
```bash
|
||||
storcli /c0 show
|
||||
storcli /c0 /eall /sall show all
|
||||
```
|
||||
|
||||
Find the rows matching:
|
||||
|
||||
- `S3YHNB0K308072M`
|
||||
- `S3YJNB0K597631B`
|
||||
|
||||
Record:
|
||||
|
||||
- enclosure id
|
||||
- slot id
|
||||
- state
|
||||
|
||||
### 6.4 Controller actions by state
|
||||
|
||||
These are the safe controller actions by scenario.
|
||||
|
||||
If the drives are **global hot spares**:
|
||||
|
||||
```bash
|
||||
storcli /c0 /e<ENC> /s<SLOT> delete hotsparedrive
|
||||
```
|
||||
|
||||
If the drives are **foreign**:
|
||||
|
||||
```bash
|
||||
storcli /c0 /fall del
|
||||
```
|
||||
|
||||
If the drives are **unconfigured-good** and the controller supports JBOD:
|
||||
|
||||
```bash
|
||||
storcli /c0 /e<ENC> /s<SLOT> set jbod
|
||||
```
|
||||
|
||||
If the controller does **not** support JBOD for the chosen mode, create two single-drive RAID0 virtual disks instead. In that case, Linux will see two new logical disks and they can still be used for a dedicated MEV VG.
|
||||
|
||||
### 6.5 OS-level confirmation after exposure
|
||||
|
||||
After the controller exposes the disks, verify new block devices appear:
|
||||
|
||||
```bash
|
||||
lsblk -d -o NAME,SIZE,MODEL,SERIAL,ROTA,TYPE,TRAN
|
||||
```
|
||||
|
||||
You want to see the two Samsung serials appear as Linux devices.
|
||||
|
||||
Then verify they are unused:
|
||||
|
||||
```bash
|
||||
blkid /dev/sdX
|
||||
wipefs -n /dev/sdX
|
||||
pvs
|
||||
```
|
||||
|
||||
If they are clean, continue.
|
||||
|
||||
### 6.6 Create a dedicated MEV VG and thinpool
|
||||
|
||||
Use stable disk identifiers by serial, not guessed `/dev/sdX` names.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
pvcreate /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
|
||||
vgcreate pve-mev /dev/disk/by-id/<samsung-serial-1> /dev/disk/by-id/<samsung-serial-2>
|
||||
lvcreate -l 95%VG -T -n data pve-mev
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
vgs pve-mev
|
||||
lvs pve-mev
|
||||
```
|
||||
|
||||
### 6.7 Add storage to Proxmox
|
||||
|
||||
```bash
|
||||
pvesm add lvmthin mev-local-lvm \
|
||||
--vgname pve-mev \
|
||||
--thinpool data \
|
||||
--content images,rootdir \
|
||||
--nodes r630-04
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
pvesm status | egrep "mev-local-lvm|local-lvm|data"
|
||||
```
|
||||
|
||||
### 6.8 Move CT 2421 onto the new storage
|
||||
|
||||
The cleanest move is while the CT is stopped.
|
||||
|
||||
Preflight backup:
|
||||
|
||||
```bash
|
||||
vzdump 2421 --mode stop --compress zstd --storage local
|
||||
```
|
||||
|
||||
Stop the CT:
|
||||
|
||||
```bash
|
||||
pct shutdown 2421 --timeout 120 || pct stop 2421 --skiplock
|
||||
```
|
||||
|
||||
Preferred move in place:
|
||||
|
||||
```bash
|
||||
pct move-volume 2421 rootfs mev-local-lvm --delete 1
|
||||
```
|
||||
|
||||
Then confirm:
|
||||
|
||||
```bash
|
||||
pct config 2421 | grep '^rootfs:'
|
||||
```
|
||||
|
||||
Start the CT:
|
||||
|
||||
```bash
|
||||
pct start 2421
|
||||
```
|
||||
|
||||
### 6.9 Post-move verification
|
||||
|
||||
```bash
|
||||
pct status 2421
|
||||
curl -fsS https://mev.defi-oracle.io/api/health
|
||||
curl -fsS https://mev.defi-oracle.io/api/infra
|
||||
curl -fsS https://mev.defi-oracle.io/api/stats/freshness
|
||||
API_KEY='cc49035e743863aba6a8bd4aa925fb59efb2f991ccab0898e61fa96cadfc951a' \
|
||||
bash scripts/verify/run-mev-roadmap-validation.sh --live-per-chain
|
||||
```
|
||||
|
||||
## 7. Rollback points
|
||||
|
||||
### Path A rollback
|
||||
|
||||
- if migration fails before cutover, keep `2421` stopped on `r630-04` and restart it there
|
||||
- if restore-on-target fails, original CT still exists on source until explicitly destroyed
|
||||
|
||||
### Path B rollback
|
||||
|
||||
- if controller exposure step looks wrong, stop and do **not** create PVs
|
||||
- if VG/thinpool creation fails, remove only the new VG and leave CT `2421` where it is
|
||||
- if `pct move-volume` fails, the preflight `vzdump` is the safety net
|
||||
|
||||
## 8. Recommendation
|
||||
|
||||
If the priority is **lowest operational risk**, do:
|
||||
|
||||
1. **Path A now** — move CT `2421` to `r630-03`
|
||||
2. then repair `r630-04` storage at leisure
|
||||
|
||||
If the priority is **keeping MEV on r630-04**, do:
|
||||
|
||||
1. expose the two Samsung SSDs from the controller
|
||||
2. build `pve-mev`
|
||||
3. move CT `2421` to `mev-local-lvm`
|
||||
4. then retire / replace `sda`
|
||||
|
||||
## 9. Current practical recommendation
|
||||
|
||||
Because the two Samsung SSDs are healthy and already identified by serial, `r630-04` does have a viable long-term local storage redesign path.
|
||||
|
||||
But because `sda` is already erroring in production, the lowest-risk sequence remains:
|
||||
|
||||
1. keep MEV stable with the applied hardening
|
||||
2. migrate `2421` to `r630-03` if you want immediate risk removal
|
||||
3. redesign `r630-04` local storage afterward
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
| **Agent / IDE instructions** | [AGENTS.md](../AGENTS.md) (repo root) |
|
||||
| **Local green-path tests** | Root `pnpm test` → [`scripts/verify/run-repo-green-test-path.sh`](../scripts/verify/run-repo-green-test-path.sh) |
|
||||
| **Git submodule hygiene + explorer remotes** | [00-meta/SUBMODULE_HYGIENE.md](00-meta/SUBMODULE_HYGIENE.md) — detached HEAD, push order, Gitea/GitHub, `submodules-clean.sh` |
|
||||
| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
|
||||
| **MEV intel + public GUI (`mev.defi-oracle.io`)** | Framing: [../MEV_Bot/docs/framing/README.md](../MEV_Bot/docs/framing/README.md); deploy: [04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md](04-configuration/MEV_CONTROL_DEFI_ORACLE_IO_DEPLOYMENT.md); LAN bring-up: [04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md](04-configuration/MEV_CONTROL_LAN_BRINGUP_CHECKLIST.md) (dedicated backend CT on `r630-04`); completion list: [04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md](04-configuration/MEV_CONTROL_COMPLETION_PUNCHLIST.md); execution values/readiness: [04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md](04-configuration/MEV_EXECUTION_VALUE_SOURCES_AND_READINESS.md); `r630-04` storage repair / MEV pool redesign: [04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md](04-configuration/R630_04_STORAGE_REMEDIATION_AND_MEV_PLAN.md); specs: [../MEV_Bot/specs/README.md](../MEV_Bot/specs/README.md) |
|
||||
| **What to do next** | [00-meta/NEXT_STEPS_INDEX.md](00-meta/NEXT_STEPS_INDEX.md) — ordered actions, by audience, execution plan |
|
||||
| **Recent cleanup / handoff summary** | [00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md](00-meta/OPERATOR_HANDOFF_2026-04-13_CLEANUP_AND_PLATFORM_SUMMARY.md) |
|
||||
| **Live verification evidence (dated)** | [00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md](00-meta/LIVE_VERIFICATION_LOG_2026-03-30.md) |
|
||||
|
||||
Reference in New Issue
Block a user