Files
proxmox/docs/10-best-practices/SERVICE_STATE_MACHINE.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

351 lines
6.7 KiB
Markdown

# Service State Machine
**Last Updated:** 2025-01-20
**Document Version:** 1.0
**Status:** Active Documentation
---
## Overview
This document defines the state machine for services in the infrastructure, including valid states, transitions, and recovery actions.
---
## Service State Diagram
```mermaid
stateDiagram-v2
[*] --> Stopped
Stopped --> Starting: start()
Starting --> Running: initialized successfully
Starting --> Error: initialization failed
Running --> Stopping: stop()
Running --> Error: runtime error
Stopping --> Stopped: stopped successfully
Stopping --> Error: stop failed
Error --> Stopped: reset()
Error --> Starting: restart()
Running --> Restarting: restart()
Restarting --> Starting: restart initiated
```
---
## State Definitions
### Stopped
**Description:** Service is not running
**Characteristics:**
- No processes active
- No resources allocated
- Configuration may be present
**Entry Conditions:**
- Initial state
- After successful stop
- After reset from error
**Exit Conditions:**
- Service started (`start()`)
---
### Starting
**Description:** Service is initializing
**Characteristics:**
- Process starting
- Configuration loading
- Resources being allocated
- Network connections being established
**Entry Conditions:**
- Service start requested
- Restart initiated
**Exit Conditions:**
- Initialization successful → Running
- Initialization failed → Error
**Typical Duration:**
- 10-60 seconds (depending on service)
---
### Running
**Description:** Service is operational
**Characteristics:**
- Process active
- Handling requests
- Monitoring active
- Health checks passing
**Entry Conditions:**
- Successful initialization
- Service started successfully
**Exit Conditions:**
- Stop requested → Stopping
- Runtime error → Error
- Restart requested → Restarting
**Verification:**
- Health check endpoint responding
- Service logs showing normal operation
- Metrics indicating activity
---
### Stopping
**Description:** Service is shutting down
**Characteristics:**
- Graceful shutdown in progress
- Finishing current requests
- Releasing resources
- Closing connections
**Entry Conditions:**
- Stop requested
- Service shutdown initiated
**Exit Conditions:**
- Shutdown successful → Stopped
- Shutdown failed → Error
**Typical Duration:**
- 5-30 seconds (graceful shutdown)
---
### Error
**Description:** Service is in error state
**Characteristics:**
- Service not functioning correctly
- Error logs present
- May be partially running
- Requires intervention
**Entry Conditions:**
- Initialization failed
- Runtime error occurred
- Stop operation failed
**Exit Conditions:**
- Reset requested → Stopped
- Restart requested → Starting
**Recovery Actions:**
- Check error logs
- Verify configuration
- Check dependencies
- Restart service
---
### Restarting
**Description:** Service restart in progress
**Characteristics:**
- Stop operation initiated
- Will transition to Starting after stop
**Entry Conditions:**
- Restart requested while Running
**Exit Conditions:**
- Stop complete → Starting
---
## State Transitions
### Transition: start()
**From:** Stopped
**To:** Starting
**Action:** Start service process
**Verification:** Process started, logs show initialization
---
### Transition: initialized successfully
**From:** Starting
**To:** Running
**Condition:** All initialization steps completed
**Verification:** Health check passes, service responding
---
### Transition: initialization failed
**From:** Starting
**To:** Error
**Condition:** Initialization error occurred
**Action:** Log error, stop process
**Recovery:** Check logs, fix configuration, restart
---
### Transition: stop()
**From:** Running
**To:** Stopping
**Action:** Initiate graceful shutdown
**Verification:** Shutdown process started
---
### Transition: stopped successfully
**From:** Stopping
**To:** Stopped
**Condition:** Shutdown completed
**Verification:** Process terminated, resources released
---
### Transition: stop failed
**From:** Stopping
**To:** Error
**Condition:** Shutdown error occurred
**Action:** Force stop if needed
**Recovery:** Manual intervention may be required
---
### Transition: runtime error
**From:** Running
**To:** Error
**Condition:** Runtime error detected
**Action:** Log error, attempt recovery
**Recovery:** Check logs, fix issue, restart
---
### Transition: reset()
**From:** Error
**To:** Stopped
**Action:** Reset service to clean state
**Verification:** Service stopped, error state cleared
---
### Transition: restart()
**From:** Error
**To:** Starting
**Action:** Restart service from error state
**Verification:** Service starting, initialization in progress
---
## Service-Specific State Machines
### Besu Node States
**Additional States:**
- **Syncing:** Blockchain synchronization in progress
- **Synced:** Blockchain fully synchronized
- **Consensus:** Participating in consensus (validators)
**State Flow:**
```
Starting → Syncing → Synced → Running (with Consensus if validator)
```
---
### Cloudflare Tunnel States
**Additional States:**
- **Connecting:** Establishing tunnel connection
- **Connected:** Tunnel connected to Cloudflare
- **Reconnecting:** Reconnecting after disconnection
**State Flow:**
```
Starting → Connecting → Connected → Running
Running → Reconnecting → Connected → Running
```
---
## Monitoring and Alerts
### State Monitoring
**Metrics to Track:**
- Current state
- State transition frequency
- Time in each state
- Error state occurrences
**Alerts:**
- Service in Error state > 5 minutes
- Frequent state transitions (thrashing)
- Service stuck in Starting > 10 minutes
- Service in Stopping > 2 minutes
---
## Recovery Procedures
### From Error State
**Step 1: Diagnose**
```bash
# Check service logs
journalctl -u <service> -n 100
# Check service status
systemctl status <service>
# Check error messages
journalctl -u <service> | grep -i error
```
**Step 2: Fix Issue**
- Fix configuration errors
- Resolve dependency issues
- Address resource constraints
- Fix network problems
**Step 3: Recover**
```bash
# Option 1: Restart
systemctl restart <service>
# Option 2: Reset and start
systemctl stop <service>
# Fix issues
systemctl start <service>
```
---
## Related Documentation
- **[OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** ⭐⭐ - Operational procedures
- **[TROUBLESHOOTING_FAQ.md](/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md)** ⭐⭐⭐ - Troubleshooting guide
- **[BESU_ALLOWLIST_RUNBOOK.md](../06-besu/BESU_ALLOWLIST_RUNBOOK.md)** ⭐ - Besu allowlist and node operations
---
**Last Updated:** 2025-01-20
**Review Cycle:** Quarterly