Files
smom-dbis-138/runbooks/ccip-operations.md

203 lines
4.1 KiB
Markdown
Raw Normal View History

# CCIP Operations Runbook
## Overview
This runbook provides operational procedures for managing CCIP (Chainlink Cross-Chain Interoperability Protocol) on the DeFi Oracle Meta Mainnet.
## Daily Operations
### Health Checks
1. **Check CCIP Router Status**
```bash
kubectl get pods -n besu-network -l app=ccip-router
kubectl logs -n besu-network -l app=ccip-router --tail=100
```
2. **Check Message Processing**
```bash
# Check recent messages
cast logs --from-block latest-100 --address $CCIP_SENDER --rpc-url $RPC_URL | grep MessageSent
```
3. **Monitor Metrics**
- Message send success rate
- Message delivery latency
- Fee consumption
- Error rates
### LINK Balance Monitoring
1. **Check Balance**
```bash
cast call $LINK_TOKEN "balanceOf(address)" $CCIP_SENDER --rpc-url $RPC_URL
```
2. **Set Alert Threshold**
- Alert when balance < 10 LINK
- Alert when balance < 5 LINK (critical)
3. **Refill Balance**
```bash
cast send $LINK_TOKEN "transfer(address,uint256)" $CCIP_SENDER $AMOUNT \
--rpc-url $RPC_URL --private-key $PRIVATE_KEY
```
## Weekly Operations
### Review Message Statistics
1. **Message Volume**
- Total messages sent
- Success rate
- Average latency
2. **Fee Analysis**
- Total fees spent
- Average fee per message
- Fee trends
3. **Error Analysis**
- Error types
- Error frequency
- Root causes
### Performance Review
1. **Latency Analysis**
- Average delivery time
- P95/P99 latency
- Outliers
2. **Throughput**
- Messages per hour/day
- Peak load times
- Capacity planning
## Monthly Operations
### Security Review
1. **Access Control Audit**
- Review authorized senders/receivers
- Check for unauthorized access
- Verify role assignments
2. **Message Validation**
- Review message format compliance
- Check for anomalies
- Verify replay protection
### Cost Optimization
1. **Fee Optimization**
- Review fee trends
- Identify optimization opportunities
- Implement improvements
2. **Message Optimization**
- Reduce message size
- Batch updates when possible
- Optimize encoding
## Incident Response
### Message Delivery Failure
1. **Identify Issue**
- Check message status
- Verify target chain status
- Check router logs
2. **Diagnose**
- Check LINK balance
- Verify router configuration
- Check target chain connectivity
3. **Resolve**
- Fix underlying issue
- Resend message if needed
- Update configuration if required
### High Error Rate
1. **Investigate**
- Check error logs
- Identify error patterns
- Review recent changes
2. **Mitigate**
- Pause sending if critical
- Fix root cause
- Resume operations
### Router Unavailable
1. **Check Status**
- Verify router deployment
- Check service health
- Review logs
2. **Recovery**
- Restart router if needed
- Verify connectivity
- Test message sending
## Maintenance
### Contract Upgrades
1. **Plan Upgrade**
- Review upgrade proposal
- Test in staging
- Schedule maintenance window
2. **Execute Upgrade**
- Pause operations
- Deploy new contracts
- Update configurations
- Resume operations
3. **Verify**
- Test message sending
- Verify message receiving
- Monitor for issues
### Configuration Changes
1. **Review Impact**
- Assess change impact
- Test in staging
- Plan rollback
2. **Apply Changes**
- Update configuration
- Monitor closely
- Verify functionality
## Monitoring
### Key Metrics
- `ccip_messages_sent_total`: Total messages sent
- `ccip_messages_received_total`: Total messages received
- `ccip_message_latency_seconds`: Message delivery latency
- `ccip_fees_total`: Total LINK spent on fees
- `ccip_errors_total`: Total errors
### Alerts
- High error rate (> 5%)
- Low success rate (< 95%)
- High latency (> 5 minutes)
- Low LINK balance (< 10 LINK)
- Router unavailable
## References
- [CCIP Integration Guide](../docs/CCIP_INTEGRATION.md)
- [CCIP Router Setup](../docs/CCIP_ROUTER_SETUP.md)
- [CCIP Troubleshooting](../docs/CCIP_TROUBLESHOOTING.md)
- [CCIP Incident Response](ccip-incident-response.md)