Files
proxmox/docs/06-besu/BLOCKCHAIN_STABILITY_REMEDIATION_PLAN.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

15 KiB

Blockchain Stability Remediation Plan

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Date: 2025-01-20
Status: 📋 COMPREHENSIVE PLAN
Priority: 🔴 CRITICAL


Executive Summary

This document outlines a comprehensive remediation plan to ensure blockchain stability, prevent block production failures, resolve stuck transactions, and eliminate faults that cause network disruptions.


Problem Analysis

Issues Identified

  1. Block Production Failures

    • Validators stop without detection
    • Configuration file path mismatches
    • Missing required files (genesis, permissions, static-nodes)
    • Node permissioning conflicts
    • Validators fail to reach consensus
  2. Stuck Transactions

    • Transactions persist in mempool indefinitely
    • Nonce conflicts block subsequent transactions
    • Transaction pool database persistence
    • Network sync/replay re-adds cleared transactions
  3. Configuration Issues

    • File path mismatches (expected vs actual locations)
    • Missing symlinks
    • Invalid TOML files
    • Permissioning misconfigurations
  4. Validator Stability

    • Services crash and restart repeatedly
    • No health monitoring
    • No automatic recovery
    • No alerting for failures
  5. Network Resilience

    • No redundancy checks
    • No automatic failover
    • No consensus health monitoring

Remediation Plan

Phase 1: Configuration Standardization IMMEDIATE

1.1 Standardize File Paths

Problem: Validators expect files at /genesis/, /permissions/, but files are at /etc/besu/, /config/

Solution:

  • Create standardized directory structure on all validators
  • Use consistent paths across all nodes
  • Create symlinks as fallback, but prefer direct paths

Implementation:

# Standard structure for all validators
/etc/besu/
  ├── genesis.json
  ├── static-nodes.json
  ├── permissions-nodes.toml
  └── permissions-accounts.toml

# Create symlinks for compatibility
/genesis/ -> /etc/besu/
/permissions/ -> /etc/besu/

Action Items:

  • Create deployment script to standardize paths on all validators
  • Update Besu config files to use standardized paths
  • Remove dependency on symlinks
  • Test on all validators

1.2 Fix Configuration Files

Problem: Invalid TOML files, missing required sections

Solution:

  • Validate all TOML files before deployment
  • Create proper empty configurations (not just comments)
  • Ensure all required sections exist

Implementation:

# Proper empty permissions-accounts.toml
accounts-allowlist=[]

# Proper empty permissions-nodes.toml (if needed)
nodes-allowlist=[]

Action Items:

  • Create validation script for all config files
  • Fix permissions-accounts.toml on all validators
  • Fix permissions-nodes.toml on all validators
  • Add config validation to deployment process

1.3 Disable Problematic Permissioning

Problem: Node permissioning blocks static nodes from connecting

Solution:

  • Disable node permissioning for development/stability
  • OR: Properly configure allowlist with all static nodes
  • Use account permissioning only if needed

Implementation:

# config-validator.toml
permissions-nodes-config-file-enabled=false  # Disable node permissioning
permissions-accounts-config-file-enabled=true  # Keep account permissioning if needed

Action Items:

  • Update all validator configs to disable node permissioning
  • OR: Add all static nodes to allowlist
  • Test validator connectivity
  • Document permissioning strategy

Phase 2: Validator Health Monitoring CRITICAL

2.1 Health Check Script

Problem: No monitoring of validator health

Solution: Create comprehensive health check script

Implementation:

#!/usr/bin/env bash
# check-validator-health.sh

# Check service status
# Check if validator is producing blocks
# Check if validator is synced
# Check for errors in logs
# Check peer connections
# Check consensus participation

Action Items:

  • Create health check script
  • Deploy to all validators
  • Set up cron job (every 1-2 minutes)
  • Configure alerting on failures

2.2 Automatic Service Recovery

Problem: Services crash and may not restart properly

Solution: Enhanced systemd service configuration

Implementation:

[Service]
Restart=always
RestartSec=10
StartLimitInterval=300
StartLimitBurst=5
# Add health check script
ExecStartPre=/usr/local/bin/check-validator-prerequisites.sh
ExecStartPost=/usr/local/bin/verify-validator-started.sh

Action Items:

  • Update systemd service files
  • Add restart policies
  • Add health check hooks
  • Test service recovery

2.3 Validator Status Dashboard

Problem: No visibility into validator status

Solution: Create monitoring dashboard/script

Implementation:

  • Real-time status of all validators
  • Block production rate
  • Consensus participation
  • Error tracking

Action Items:

  • Create status monitoring script
  • Set up regular status reports
  • Create alerting thresholds
  • Document monitoring procedures

Phase 3: Transaction Management HIGH PRIORITY

3.1 Transaction Pool Monitoring

Problem: Transactions get stuck in mempool

Solution: Monitor and manage transaction pool

Implementation:

# Monitor transaction pool
# Check for stuck transactions
# Clear stuck transactions automatically
# Alert on transaction pool issues

Action Items:

  • Create transaction pool monitoring script
  • Implement automatic stuck transaction detection
  • Create transaction pool cleanup procedures
  • Set up alerts for stuck transactions

3.2 Nonce Management

Problem: Nonce conflicts block transactions

Solution: Proper nonce tracking and management

Implementation:

  • Track latest vs pending nonces
  • Detect nonce gaps
  • Automatically handle nonce conflicts
  • Provide nonce skip functionality

Action Items:

  • Create nonce monitoring script
  • Implement nonce conflict detection
  • Create nonce skip utilities
  • Document nonce management procedures

3.3 Transaction Timeout Handling

Problem: Transactions can wait indefinitely

Solution: Implement transaction timeouts

Implementation:

  • Set maximum transaction age
  • Automatically cancel/retry old transactions
  • Alert on transactions exceeding timeout

Action Items:

  • Define transaction timeout policy
  • Implement timeout detection
  • Create automatic cleanup
  • Document timeout procedures

Phase 4: Block Production Stability CRITICAL

4.1 Consensus Health Monitoring

Problem: No monitoring of consensus health

Solution: Monitor QBFT consensus status

Implementation:

# Check validator participation
# Monitor block production rate
# Detect consensus failures
# Alert on consensus issues

Action Items:

  • Create consensus monitoring script
  • Monitor block production rate
  • Detect when consensus fails
  • Set up alerts for consensus issues

4.2 Validator Quorum Monitoring

Problem: No monitoring of validator quorum

Solution: Monitor active validator count

Implementation:

  • Check how many validators are active
  • Verify minimum quorum (3/5 for QBFT)
  • Alert if quorum is lost

Action Items:

  • Create quorum monitoring script
  • Set up quorum alerts
  • Document quorum requirements
  • Create recovery procedures

4.3 Block Production Rate Monitoring

Problem: No detection of stalled block production

Solution: Monitor block production continuously

Implementation:

  • Track block number progression
  • Detect when blocks stop advancing
  • Alert on block production stalls
  • Automatic recovery attempts

Action Items:

  • Create block production monitor
  • Set up continuous monitoring
  • Configure alerts for stalls
  • Create recovery procedures

Phase 5: Network Resilience HIGH PRIORITY

5.1 Peer Connection Monitoring

Problem: No monitoring of peer connections

Solution: Monitor validator peer connections

Implementation:

  • Check peer count for each validator
  • Verify validators can communicate
  • Alert on peer connection issues

Action Items:

  • Create peer monitoring script
  • Monitor peer connections
  • Set up alerts for connection issues
  • Document peer requirements

5.2 Network Sync Monitoring

Problem: No monitoring of network sync status

Solution: Monitor sync status across network

Implementation:

  • Check if validators are synced
  • Detect sync delays
  • Alert on sync issues

Action Items:

  • Create sync monitoring script
  • Monitor sync status
  • Set up alerts for sync issues
  • Document sync procedures

5.3 Redundancy and Failover

Problem: No redundancy for critical components

Solution: Implement redundancy where possible

Implementation:

  • Multiple RPC nodes
  • Validator redundancy (already have 5)
  • Backup configurations

Action Items:

  • Document redundancy strategy
  • Implement failover procedures
  • Test failover scenarios
  • Document recovery procedures

Phase 6: Automated Recovery CRITICAL

6.1 Automatic Validator Restart

Problem: Validators stop and don't restart properly

Solution: Enhanced auto-restart with health checks

Implementation:

  • Systemd restart policies
  • Health check before restart
  • Escalation if restart fails

Action Items:

  • Update systemd services
  • Add health checks
  • Test restart procedures
  • Document restart policies

6.2 Automatic Configuration Fix

Problem: Configuration issues cause failures

Solution: Automatic configuration validation and fix

Implementation:

  • Validate configuration on startup
  • Automatically fix common issues
  • Alert on unfixable issues

Action Items:

  • Create config validation script
  • Implement auto-fix for common issues
  • Test auto-fix procedures
  • Document manual fix procedures

6.3 Automatic Transaction Pool Cleanup

Problem: Stuck transactions block new transactions

Solution: Automatic detection and cleanup

Implementation:

  • Monitor transaction pool
  • Detect stuck transactions
  • Automatically clear if needed
  • Alert on cleanup actions

Action Items:

  • Create transaction pool cleanup script
  • Implement automatic cleanup
  • Set up alerts
  • Document cleanup procedures

Phase 7: Monitoring and Alerting HIGH PRIORITY

7.1 Comprehensive Monitoring System

Problem: No centralized monitoring

Solution: Implement comprehensive monitoring

Components:

  • Validator health
  • Block production
  • Transaction pool
  • Network status
  • Consensus health

Action Items:

  • Design monitoring architecture
  • Implement monitoring scripts
  • Set up data collection
  • Create monitoring dashboard

7.2 Alerting System

Problem: No alerts for critical issues

Solution: Implement alerting for all critical metrics

Alerts Needed:

  • Validator service down
  • Block production stalled
  • Consensus failure
  • Transaction pool issues
  • Network connectivity issues

Action Items:

  • Define alert thresholds
  • Implement alerting system
  • Configure alert channels
  • Test alerting system

7.3 Logging and Diagnostics

Problem: Insufficient logging for diagnostics

Solution: Enhanced logging and diagnostics

Implementation:

  • Structured logging
  • Log aggregation
  • Error tracking
  • Performance metrics

Action Items:

  • Configure enhanced logging
  • Set up log aggregation
  • Create diagnostic tools
  • Document logging procedures

Phase 8: Preventive Measures ONGOING

8.1 Pre-Deployment Validation

Problem: Issues discovered after deployment

Solution: Validate everything before deployment

Checks:

  • Configuration files valid
  • Required files present
  • Services can start
  • Network connectivity
  • Consensus can be reached

Action Items:

  • Create pre-deployment validation script
  • Run validation before all deployments
  • Document validation procedures
  • Integrate into deployment process

8.2 Regular Health Audits

Problem: Issues accumulate over time

Solution: Regular comprehensive health audits

Audit Areas:

  • Validator health
  • Configuration consistency
  • Network status
  • Transaction pool health
  • Consensus health

Action Items:

  • Create health audit script
  • Schedule regular audits (daily/weekly)
  • Document audit procedures
  • Create audit reports

8.3 Change Management

Problem: Changes cause unexpected issues

Solution: Proper change management process

Process:

  • Test changes in non-production
  • Validate before applying
  • Rollback procedures
  • Change documentation

Action Items:

  • Create change management process
  • Document change procedures
  • Create rollback procedures
  • Test change process

Implementation Priority

🔴 CRITICAL - Immediate (Week 1)

  1. Configuration standardization (Phase 1)
  2. Validator health monitoring (Phase 2.1, 2.2)
  3. Block production monitoring (Phase 4.1, 4.2, 4.3)
  4. Automatic recovery (Phase 6.1, 6.2)

🟠 HIGH PRIORITY - Short Term (Week 2-3)

  1. Transaction management (Phase 3)
  2. Network resilience (Phase 5)
  3. Monitoring and alerting (Phase 7)

🟡 MEDIUM PRIORITY - Medium Term (Week 4+)

  1. Preventive measures (Phase 8)
  2. Advanced monitoring
  3. Performance optimization

Success Criteria

Stability Metrics

  • Block Production Uptime: > 99.9%
  • Validator Availability: > 99.5%
  • Transaction Confirmation Time: < 30 seconds
  • Mean Time to Recovery (MTTR): < 5 minutes

Monitoring Coverage

  • All validators monitored
  • Block production monitored
  • Transaction pool monitored
  • Consensus health monitored
  • Network status monitored

Alerting Coverage

  • Critical issues alert within 1 minute
  • All validators have alerting
  • All RPC nodes have alerting
  • Block production alerts configured

Risk Mitigation

Identified Risks

  1. Configuration Drift: Validators get out of sync
  2. Silent Failures: Issues not detected
  3. Cascading Failures: One issue causes others
  4. Human Error: Manual mistakes cause issues

Mitigation Strategies

  1. Automated Configuration Management: Prevent drift
  2. Comprehensive Monitoring: Detect issues early
  3. Isolation: Prevent cascading failures
  4. Automation: Reduce human error

Documentation Requirements

Required Documentation

  1. Deployment Procedures: Step-by-step deployment guides
  2. Configuration Reference: All configuration options documented
  3. Troubleshooting Guide: Common issues and solutions
  4. Monitoring Guide: How to monitor and interpret metrics
  5. Recovery Procedures: Step-by-step recovery guides

Next Steps

Immediate Actions

  1. Create configuration standardization script
  2. Create validator health check script
  3. Create block production monitor
  4. Update systemd service files
  5. Create monitoring dashboard

Short-term Actions

  1. Implement transaction pool monitoring
  2. Set up alerting system
  3. Create recovery automation
  4. Document all procedures

Status: Comprehensive plan created
Priority: Implement critical items immediately
Timeline: Phased implementation over 4+ weeks