- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
15 KiB
Blockchain Stability Remediation Plan
Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation
Date: 2025-01-20
Status: 📋 COMPREHENSIVE PLAN
Priority: 🔴 CRITICAL
Executive Summary
This document outlines a comprehensive remediation plan to ensure blockchain stability, prevent block production failures, resolve stuck transactions, and eliminate faults that cause network disruptions.
Problem Analysis
Issues Identified
-
Block Production Failures
- Validators stop without detection
- Configuration file path mismatches
- Missing required files (genesis, permissions, static-nodes)
- Node permissioning conflicts
- Validators fail to reach consensus
-
Stuck Transactions
- Transactions persist in mempool indefinitely
- Nonce conflicts block subsequent transactions
- Transaction pool database persistence
- Network sync/replay re-adds cleared transactions
-
Configuration Issues
- File path mismatches (expected vs actual locations)
- Missing symlinks
- Invalid TOML files
- Permissioning misconfigurations
-
Validator Stability
- Services crash and restart repeatedly
- No health monitoring
- No automatic recovery
- No alerting for failures
-
Network Resilience
- No redundancy checks
- No automatic failover
- No consensus health monitoring
Remediation Plan
Phase 1: Configuration Standardization ✅ IMMEDIATE
1.1 Standardize File Paths
Problem: Validators expect files at /genesis/, /permissions/, but files are at /etc/besu/, /config/
Solution:
- Create standardized directory structure on all validators
- Use consistent paths across all nodes
- Create symlinks as fallback, but prefer direct paths
Implementation:
# Standard structure for all validators
/etc/besu/
├── genesis.json
├── static-nodes.json
├── permissions-nodes.toml
└── permissions-accounts.toml
# Create symlinks for compatibility
/genesis/ -> /etc/besu/
/permissions/ -> /etc/besu/
Action Items:
- Create deployment script to standardize paths on all validators
- Update Besu config files to use standardized paths
- Remove dependency on symlinks
- Test on all validators
1.2 Fix Configuration Files
Problem: Invalid TOML files, missing required sections
Solution:
- Validate all TOML files before deployment
- Create proper empty configurations (not just comments)
- Ensure all required sections exist
Implementation:
# Proper empty permissions-accounts.toml
accounts-allowlist=[]
# Proper empty permissions-nodes.toml (if needed)
nodes-allowlist=[]
Action Items:
- Create validation script for all config files
- Fix permissions-accounts.toml on all validators
- Fix permissions-nodes.toml on all validators
- Add config validation to deployment process
1.3 Disable Problematic Permissioning
Problem: Node permissioning blocks static nodes from connecting
Solution:
- Disable node permissioning for development/stability
- OR: Properly configure allowlist with all static nodes
- Use account permissioning only if needed
Implementation:
# config-validator.toml
permissions-nodes-config-file-enabled=false # Disable node permissioning
permissions-accounts-config-file-enabled=true # Keep account permissioning if needed
Action Items:
- Update all validator configs to disable node permissioning
- OR: Add all static nodes to allowlist
- Test validator connectivity
- Document permissioning strategy
Phase 2: Validator Health Monitoring ✅ CRITICAL
2.1 Health Check Script
Problem: No monitoring of validator health
Solution: Create comprehensive health check script
Implementation:
#!/usr/bin/env bash
# check-validator-health.sh
# Check service status
# Check if validator is producing blocks
# Check if validator is synced
# Check for errors in logs
# Check peer connections
# Check consensus participation
Action Items:
- Create health check script
- Deploy to all validators
- Set up cron job (every 1-2 minutes)
- Configure alerting on failures
2.2 Automatic Service Recovery
Problem: Services crash and may not restart properly
Solution: Enhanced systemd service configuration
Implementation:
[Service]
Restart=always
RestartSec=10
StartLimitInterval=300
StartLimitBurst=5
# Add health check script
ExecStartPre=/usr/local/bin/check-validator-prerequisites.sh
ExecStartPost=/usr/local/bin/verify-validator-started.sh
Action Items:
- Update systemd service files
- Add restart policies
- Add health check hooks
- Test service recovery
2.3 Validator Status Dashboard
Problem: No visibility into validator status
Solution: Create monitoring dashboard/script
Implementation:
- Real-time status of all validators
- Block production rate
- Consensus participation
- Error tracking
Action Items:
- Create status monitoring script
- Set up regular status reports
- Create alerting thresholds
- Document monitoring procedures
Phase 3: Transaction Management ✅ HIGH PRIORITY
3.1 Transaction Pool Monitoring
Problem: Transactions get stuck in mempool
Solution: Monitor and manage transaction pool
Implementation:
# Monitor transaction pool
# Check for stuck transactions
# Clear stuck transactions automatically
# Alert on transaction pool issues
Action Items:
- Create transaction pool monitoring script
- Implement automatic stuck transaction detection
- Create transaction pool cleanup procedures
- Set up alerts for stuck transactions
3.2 Nonce Management
Problem: Nonce conflicts block transactions
Solution: Proper nonce tracking and management
Implementation:
- Track latest vs pending nonces
- Detect nonce gaps
- Automatically handle nonce conflicts
- Provide nonce skip functionality
Action Items:
- Create nonce monitoring script
- Implement nonce conflict detection
- Create nonce skip utilities
- Document nonce management procedures
3.3 Transaction Timeout Handling
Problem: Transactions can wait indefinitely
Solution: Implement transaction timeouts
Implementation:
- Set maximum transaction age
- Automatically cancel/retry old transactions
- Alert on transactions exceeding timeout
Action Items:
- Define transaction timeout policy
- Implement timeout detection
- Create automatic cleanup
- Document timeout procedures
Phase 4: Block Production Stability ✅ CRITICAL
4.1 Consensus Health Monitoring
Problem: No monitoring of consensus health
Solution: Monitor QBFT consensus status
Implementation:
# Check validator participation
# Monitor block production rate
# Detect consensus failures
# Alert on consensus issues
Action Items:
- Create consensus monitoring script
- Monitor block production rate
- Detect when consensus fails
- Set up alerts for consensus issues
4.2 Validator Quorum Monitoring
Problem: No monitoring of validator quorum
Solution: Monitor active validator count
Implementation:
- Check how many validators are active
- Verify minimum quorum (3/5 for QBFT)
- Alert if quorum is lost
Action Items:
- Create quorum monitoring script
- Set up quorum alerts
- Document quorum requirements
- Create recovery procedures
4.3 Block Production Rate Monitoring
Problem: No detection of stalled block production
Solution: Monitor block production continuously
Implementation:
- Track block number progression
- Detect when blocks stop advancing
- Alert on block production stalls
- Automatic recovery attempts
Action Items:
- Create block production monitor
- Set up continuous monitoring
- Configure alerts for stalls
- Create recovery procedures
Phase 5: Network Resilience ✅ HIGH PRIORITY
5.1 Peer Connection Monitoring
Problem: No monitoring of peer connections
Solution: Monitor validator peer connections
Implementation:
- Check peer count for each validator
- Verify validators can communicate
- Alert on peer connection issues
Action Items:
- Create peer monitoring script
- Monitor peer connections
- Set up alerts for connection issues
- Document peer requirements
5.2 Network Sync Monitoring
Problem: No monitoring of network sync status
Solution: Monitor sync status across network
Implementation:
- Check if validators are synced
- Detect sync delays
- Alert on sync issues
Action Items:
- Create sync monitoring script
- Monitor sync status
- Set up alerts for sync issues
- Document sync procedures
5.3 Redundancy and Failover
Problem: No redundancy for critical components
Solution: Implement redundancy where possible
Implementation:
- Multiple RPC nodes
- Validator redundancy (already have 5)
- Backup configurations
Action Items:
- Document redundancy strategy
- Implement failover procedures
- Test failover scenarios
- Document recovery procedures
Phase 6: Automated Recovery ✅ CRITICAL
6.1 Automatic Validator Restart
Problem: Validators stop and don't restart properly
Solution: Enhanced auto-restart with health checks
Implementation:
- Systemd restart policies
- Health check before restart
- Escalation if restart fails
Action Items:
- Update systemd services
- Add health checks
- Test restart procedures
- Document restart policies
6.2 Automatic Configuration Fix
Problem: Configuration issues cause failures
Solution: Automatic configuration validation and fix
Implementation:
- Validate configuration on startup
- Automatically fix common issues
- Alert on unfixable issues
Action Items:
- Create config validation script
- Implement auto-fix for common issues
- Test auto-fix procedures
- Document manual fix procedures
6.3 Automatic Transaction Pool Cleanup
Problem: Stuck transactions block new transactions
Solution: Automatic detection and cleanup
Implementation:
- Monitor transaction pool
- Detect stuck transactions
- Automatically clear if needed
- Alert on cleanup actions
Action Items:
- Create transaction pool cleanup script
- Implement automatic cleanup
- Set up alerts
- Document cleanup procedures
Phase 7: Monitoring and Alerting ✅ HIGH PRIORITY
7.1 Comprehensive Monitoring System
Problem: No centralized monitoring
Solution: Implement comprehensive monitoring
Components:
- Validator health
- Block production
- Transaction pool
- Network status
- Consensus health
Action Items:
- Design monitoring architecture
- Implement monitoring scripts
- Set up data collection
- Create monitoring dashboard
7.2 Alerting System
Problem: No alerts for critical issues
Solution: Implement alerting for all critical metrics
Alerts Needed:
- Validator service down
- Block production stalled
- Consensus failure
- Transaction pool issues
- Network connectivity issues
Action Items:
- Define alert thresholds
- Implement alerting system
- Configure alert channels
- Test alerting system
7.3 Logging and Diagnostics
Problem: Insufficient logging for diagnostics
Solution: Enhanced logging and diagnostics
Implementation:
- Structured logging
- Log aggregation
- Error tracking
- Performance metrics
Action Items:
- Configure enhanced logging
- Set up log aggregation
- Create diagnostic tools
- Document logging procedures
Phase 8: Preventive Measures ✅ ONGOING
8.1 Pre-Deployment Validation
Problem: Issues discovered after deployment
Solution: Validate everything before deployment
Checks:
- Configuration files valid
- Required files present
- Services can start
- Network connectivity
- Consensus can be reached
Action Items:
- Create pre-deployment validation script
- Run validation before all deployments
- Document validation procedures
- Integrate into deployment process
8.2 Regular Health Audits
Problem: Issues accumulate over time
Solution: Regular comprehensive health audits
Audit Areas:
- Validator health
- Configuration consistency
- Network status
- Transaction pool health
- Consensus health
Action Items:
- Create health audit script
- Schedule regular audits (daily/weekly)
- Document audit procedures
- Create audit reports
8.3 Change Management
Problem: Changes cause unexpected issues
Solution: Proper change management process
Process:
- Test changes in non-production
- Validate before applying
- Rollback procedures
- Change documentation
Action Items:
- Create change management process
- Document change procedures
- Create rollback procedures
- Test change process
Implementation Priority
🔴 CRITICAL - Immediate (Week 1)
- Configuration standardization (Phase 1)
- Validator health monitoring (Phase 2.1, 2.2)
- Block production monitoring (Phase 4.1, 4.2, 4.3)
- Automatic recovery (Phase 6.1, 6.2)
🟠 HIGH PRIORITY - Short Term (Week 2-3)
- Transaction management (Phase 3)
- Network resilience (Phase 5)
- Monitoring and alerting (Phase 7)
🟡 MEDIUM PRIORITY - Medium Term (Week 4+)
- Preventive measures (Phase 8)
- Advanced monitoring
- Performance optimization
Success Criteria
Stability Metrics
- Block Production Uptime: > 99.9%
- Validator Availability: > 99.5%
- Transaction Confirmation Time: < 30 seconds
- Mean Time to Recovery (MTTR): < 5 minutes
Monitoring Coverage
- ✅ All validators monitored
- ✅ Block production monitored
- ✅ Transaction pool monitored
- ✅ Consensus health monitored
- ✅ Network status monitored
Alerting Coverage
- ✅ Critical issues alert within 1 minute
- ✅ All validators have alerting
- ✅ All RPC nodes have alerting
- ✅ Block production alerts configured
Risk Mitigation
Identified Risks
- Configuration Drift: Validators get out of sync
- Silent Failures: Issues not detected
- Cascading Failures: One issue causes others
- Human Error: Manual mistakes cause issues
Mitigation Strategies
- Automated Configuration Management: Prevent drift
- Comprehensive Monitoring: Detect issues early
- Isolation: Prevent cascading failures
- Automation: Reduce human error
Documentation Requirements
Required Documentation
- Deployment Procedures: Step-by-step deployment guides
- Configuration Reference: All configuration options documented
- Troubleshooting Guide: Common issues and solutions
- Monitoring Guide: How to monitor and interpret metrics
- Recovery Procedures: Step-by-step recovery guides
Next Steps
Immediate Actions
- ✅ Create configuration standardization script
- ✅ Create validator health check script
- ✅ Create block production monitor
- ✅ Update systemd service files
- ✅ Create monitoring dashboard
Short-term Actions
- Implement transaction pool monitoring
- Set up alerting system
- Create recovery automation
- Document all procedures
Status: Comprehensive plan created
Priority: Implement critical items immediately
Timeline: Phased implementation over 4+ weeks