Files
proxmox/docs/06-besu/FIX_BLOCK_PRODUCTION_RUNBOOK.md
defiQUG 3f76bc9507
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: update master documentation and push to Gitea (2026-03-06)
- MASTER_INDEX: Last Updated 2026-03-06; status 59/59 contracts; add NEXT_STEPS_LIST, CONTRACT_NEXT_STEPS_LIST
- docs/README, NEXT_STEPS_INDEX, 06-besu/MASTER_INDEX: Last Updated 2026-03-06
- Contract check script: 59 addresses (PMM, vault/reserve, CompliantFiatTokens); canonical CCIP/router
- New docs: EXECUTION_CHECKLIST, NEXT_STEPS_LIST, DOTENV_AUDIT, ADDITIONAL_PATHS, deployer gas runbook, WEMIX_ACQUISITION_TABLED, etc.
- Config: deployer-gas-routes, cro-wemix-swap-routes, routing-registry, token-mapping
- Scripts: check-contracts-on-chain-138, check-pmm-pool-balances-chain138, deployer-gas-auto-route, acquire-cro-and-wemix-gas
- Operator rule: operator-lan-access-check.mdc

Made-with: Cursor
2026-03-06 19:11:25 -08:00

3.7 KiB
Raw Permalink Blame History

Fix Block Production — Runbook

Last Updated: 2026-03-04
When: Block production is stalled on Chain 138 (no new blocks; validators active).


1. Confirm the problem

# Block not advancing (run twice, 10s apart)
cast block-number --rpc-url http://192.168.11.211:8545
sleep 10
cast block-number --rpc-url http://192.168.11.211:8545
# If same → stalled
./scripts/monitoring/monitor-blockchain-health.sh
# Look for: "Block production stalled (no new blocks in 5s)"

2. Check validator status and height

All 5 validators (10001004) must be active and ideally at chain head:

# Service status (from repo root)
for spec in "1000:192.168.11.11" "1001:192.168.11.11" "1002:192.168.11.11" "1003:192.168.11.10" "1004:192.168.11.10"; do
  IFS=: read -r vmid host <<< "$spec"
  s=$(ssh -o ConnectTimeout=6 root@"$host" "pct exec $vmid -- systemctl is-active besu-validator 2>/dev/null" || echo "?")
  echo "Validator $vmid: $s"
done

Optional: check block height per validator (metrics on port 9545):

ssh root@192.168.11.11 "pct exec 1000 -- curl -s -m 4 http://127.0.0.1:9545/metrics" | grep -E '^ethereum_best_known_block_number |^besu_blockchain_difficulty_total '
# Should be ~2547803 (chain head)

3. Apply fix: staggered restart

Restart validators one at a time so the rest stay at head and the restarted node syncs quickly. This preserves quorum and avoids "everyone in full sync."

cd /home/intlc/projects/proxmox
./scripts/maintenance/fix-block-production-staggered-restart.sh
  • Dry run: ./scripts/maintenance/fix-block-production-staggered-restart.sh --dry-run
  • Duration: ~78 minutes (90s wait between each of 5 restarts + final 30s).
  • Order: 1004 → 1003 → 1002 → 1001 → 1000 (ML110 first, then R630-01).

4. Verify block production

./scripts/monitoring/monitor-blockchain-health.sh
# Expect: "Block production" advancing (block diff > 0 in 5s window)

Or:

watch -n 5 'cast block-number --rpc-url http://192.168.11.211:8545'
# Block number should increase every ~2s (genesis blockperiodseconds=2)

5. If still stalled

Quorum: With 5 validators, QBFT needs 4 at chain head (2F+1) to produce blocks. If only 3 are at head (e.g. 1000, 1001, 1002), blocks will not advance until 1003 and/or 1004 sync to head. Check each validator's ethereum_best_known_block_number or besu_blockchain_difficulty_total (metrics on port 9545); all should match RPC block number.

  1. Validator peer count: Validators must peer with each other. On a validator:
    pct exec <vmid> -- curl -s http://127.0.0.1:9545/metrics | grep besu_peers_connected_total
    Should be several (e.g. 4+). If 0, check static-nodes / permissions and P2P ports (30303).
  2. Check validator logs for QBFT/consensus errors:
    ssh root@192.168.11.11 "pct exec 1000 -- journalctl -u besu-validator -n 100 --no-pager" | grep -iE 'qbft|consensus|propos|round|error'
    
  3. Check time sync: QBFT is time-based; ensure NTP on all Proxmox hosts and containers.
  4. Enable INFO logging (see CRITICAL_ISSUE_BLOCK_PRODUCTION_STOPPED.md § Enable Verbose Logging) and restart one validator; watch logs for round/proposal messages.
  5. Genesis: Confirm config.qbft.blockperiodseconds (e.g. 2) and validator set in genesis match running nodes.

References