Files
proxmox/docs/06-besu/FIX_BLOCK_PRODUCTION_RUNBOOK.md

100 lines
3.7 KiB
Markdown
Raw Permalink Normal View History

# Fix Block Production — Runbook
**Last Updated:** 2026-03-04
**When:** Block production is stalled on Chain 138 (no new blocks; validators active).
---
## 1. Confirm the problem
```bash
# Block not advancing (run twice, 10s apart)
cast block-number --rpc-url http://192.168.11.211:8545
sleep 10
cast block-number --rpc-url http://192.168.11.211:8545
# If same → stalled
```
```bash
./scripts/monitoring/monitor-blockchain-health.sh
# Look for: "Block production stalled (no new blocks in 5s)"
```
---
## 2. Check validator status and height
All 5 validators (10001004) must be **active** and ideally at **chain head**:
```bash
# Service status (from repo root)
for spec in "1000:192.168.11.11" "1001:192.168.11.11" "1002:192.168.11.11" "1003:192.168.11.10" "1004:192.168.11.10"; do
IFS=: read -r vmid host <<< "$spec"
s=$(ssh -o ConnectTimeout=6 root@"$host" "pct exec $vmid -- systemctl is-active besu-validator 2>/dev/null" || echo "?")
echo "Validator $vmid: $s"
done
```
Optional: check block height per validator (metrics on port 9545):
```bash
ssh root@192.168.11.11 "pct exec 1000 -- curl -s -m 4 http://127.0.0.1:9545/metrics" | grep -E '^ethereum_best_known_block_number |^besu_blockchain_difficulty_total '
# Should be ~2547803 (chain head)
```
---
## 3. Apply fix: staggered restart
Restart validators **one at a time** so the rest stay at head and the restarted node syncs quickly. This preserves quorum and avoids "everyone in full sync."
```bash
cd /home/intlc/projects/proxmox
./scripts/maintenance/fix-block-production-staggered-restart.sh
```
- **Dry run:** `./scripts/maintenance/fix-block-production-staggered-restart.sh --dry-run`
- **Duration:** ~78 minutes (90s wait between each of 5 restarts + final 30s).
- **Order:** 1004 → 1003 → 1002 → 1001 → 1000 (ML110 first, then R630-01).
---
## 4. Verify block production
```bash
./scripts/monitoring/monitor-blockchain-health.sh
# Expect: "Block production" advancing (block diff > 0 in 5s window)
```
Or:
```bash
watch -n 5 'cast block-number --rpc-url http://192.168.11.211:8545'
# Block number should increase every ~2s (genesis blockperiodseconds=2)
```
---
## 5. If still stalled
**Quorum:** With 5 validators, QBFT needs **4 at chain head** (2F+1) to produce blocks. If only 3 are at head (e.g. 1000, 1001, 1002), blocks will not advance until 1003 and/or 1004 sync to head. Check each validator's `ethereum_best_known_block_number` or `besu_blockchain_difficulty_total` (metrics on port 9545); all should match RPC block number.
1. **Validator peer count:** Validators must peer with each other. On a validator:
`pct exec <vmid> -- curl -s http://127.0.0.1:9545/metrics | grep besu_peers_connected_total`
Should be several (e.g. 4+). If 0, check static-nodes / permissions and P2P ports (30303).
2. **Check validator logs** for QBFT/consensus errors:
```bash
ssh root@192.168.11.11 "pct exec 1000 -- journalctl -u besu-validator -n 100 --no-pager" | grep -iE 'qbft|consensus|propos|round|error'
```
2. **Check time sync:** QBFT is time-based; ensure NTP on all Proxmox hosts and containers.
3. **Enable INFO logging** (see [CRITICAL_ISSUE_BLOCK_PRODUCTION_STOPPED.md](CRITICAL_ISSUE_BLOCK_PRODUCTION_STOPPED.md) § Enable Verbose Logging) and restart one validator; watch logs for round/proposal messages.
4. **Genesis:** Confirm `config.qbft.blockperiodseconds` (e.g. 2) and validator set in genesis match running nodes.
---
## References
- [CRITICAL_ISSUE_BLOCK_PRODUCTION_STOPPED.md](CRITICAL_ISSUE_BLOCK_PRODUCTION_STOPPED.md)
- [SOLUTION_QUORUM_LOSS.md](SOLUTION_QUORUM_LOSS.md) — if fewer than 4/5 validators are running
- Script: `scripts/maintenance/fix-block-production-staggered-restart.sh`