Files
smom-dbis-138/runbooks/troubleshooting.md
defiQUG 1fb7266469 Add Oracle Aggregator and CCIP Integration
- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control.
- Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities.
- Created .gitmodules to include OpenZeppelin contracts as a submodule.
- Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment.
- Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks.
- Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring.
- Created scripts for resource import and usage validation across non-US regions.
- Added tests for CCIP error handling and integration to ensure robust functionality.
- Included various new files and directories for the orchestration portal and deployment scripts.
2025-12-12 14:57:48 -08:00

305 lines
6.3 KiB
Markdown

# Troubleshooting Guide
## Common Issues and Solutions
### Network Issues
#### Blocks Not Being Produced
**Symptoms**: No new blocks, validators not responding
**Diagnosis**:
```bash
# Check validator status
kubectl get pods -n besu-network -l component=validator
# Check logs
kubectl logs -n besu-network <validator-pod> --tail=100
# Check block number
curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://<rpc-endpoint>
```
**Solutions**:
1. Restart validators: `kubectl rollout restart statefulset/besu-validator -n besu-network`
2. Check network connectivity
3. Verify validator keys
4. Check IBFT configuration
5. Verify genesis file
#### Validators Not Peering
**Symptoms**: Validators not connecting to each other
**Diagnosis**:
```bash
# Check peer count
kubectl exec -n besu-network <validator-pod> -- \
curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
http://localhost:8545
# Check static nodes
kubectl get configmap besu-validator-config -n besu-network -o yaml
```
**Solutions**:
1. Verify static-nodes.json configuration
2. Check network policies
3. Verify firewall rules
4. Check P2P port (30303) connectivity
5. Verify enode addresses
### RPC Issues
#### RPC Endpoints Not Responding
**Symptoms**: RPC calls failing, timeouts
**Diagnosis**:
```bash
# Check RPC pod status
kubectl get pods -n besu-network -l component=rpc
# Check logs
kubectl logs -n besu-network <rpc-pod> --tail=100
# Test RPC endpoint
curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://<rpc-endpoint>
```
**Solutions**:
1. Restart RPC pods: `kubectl rollout restart statefulset/besu-rpc -n besu-network`
2. Check Application Gateway status
3. Verify network policies
4. Check rate limiting
5. Scale RPC nodes if needed
#### High Latency
**Symptoms**: Slow RPC responses
**Diagnosis**:
```bash
# Check pod resources
kubectl top pods -n besu-network -l component=rpc
# Check metrics
curl http://<rpc-pod>:9545/metrics
# Check sync status
curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' \
http://<rpc-endpoint>
```
**Solutions**:
1. Scale RPC nodes
2. Increase resource limits
3. Check disk I/O
4. Verify network connectivity
5. Check for sync issues
### Oracle Issues
#### Oracle Not Updating
**Symptoms**: Oracle price not updating, circuit breaker open
**Diagnosis**:
```bash
# Check oracle publisher status
kubectl get pods -n besu-network -l app=oracle-publisher
# Check logs
kubectl logs -n besu-network <oracle-pod> --tail=100
# Check health endpoint
curl http://<oracle-pod>:8080/health
# Check metrics
curl http://<oracle-pod>:8000/metrics
```
**Solutions**:
1. Restart oracle publisher
2. Check data sources
3. Verify RPC connectivity
4. Check private key access
5. Verify circuit breaker configuration
#### Data Source Failures
**Symptoms**: Failed to fetch from data sources
**Diagnosis**:
```bash
# Check data source connectivity
curl <data-source-url>
# Check oracle publisher logs
kubectl logs -n besu-network <oracle-pod> | grep -i "data source"
```
**Solutions**:
1. Verify data source URLs
2. Check network connectivity
3. Verify API keys
4. Check rate limiting
5. Update data source configuration
### Storage Issues
#### Disk Full
**Symptoms**: Pods failing, disk space errors
**Diagnosis**:
```bash
# Check disk usage
kubectl exec -n besu-network <pod> -- df -h
# Check PVC usage
kubectl get pvc -n besu-network
# Check pod logs
kubectl logs -n besu-network <pod> | grep -i "disk\|space\|full"
```
**Solutions**:
1. Increase PVC size
2. Clean up old data
3. Archive chaindata
4. Use snap sync for RPC nodes
5. Implement data retention policies
#### Slow Disk I/O
**Symptoms**: Slow sync, high latency
**Diagnosis**:
```bash
# Check disk I/O
kubectl exec -n besu-network <pod> -- iostat -x 1
# Check metrics
curl http://<pod>:9545/metrics | grep -i "disk\|io"
```
**Solutions**:
1. Upgrade to Premium SSD
2. Increase disk size
3. Optimize Besu configuration
4. Check for disk contention
5. Use faster storage class
### Monitoring Issues
#### Metrics Not Collecting
**Symptoms**: No metrics in Prometheus
**Diagnosis**:
```bash
# Check Prometheus targets
curl http://<prometheus>:9090/api/v1/targets
# Check service discovery
kubectl get servicemonitors -n besu-network
# Check pod metrics endpoint
curl http://<pod>:9545/metrics
```
**Solutions**:
1. Verify ServiceMonitor configuration
2. Check network policies
3. Verify metrics endpoint
4. Restart Prometheus
5. Check service discovery configuration
#### Alerts Not Firing
**Symptoms**: Alerts not triggering
**Diagnosis**:
```bash
# Check Alertmanager status
curl http://<alertmanager>:9093/api/v1/status
# Check alert rules
kubectl get prometheusrules -n besu-network
# Check notification channels
kubectl get secret alertmanager-config -n besu-network -o yaml
```
**Solutions**:
1. Verify alert rules
2. Check Alertmanager configuration
3. Verify notification channels
4. Check alert thresholds
5. Test alert rules
## Debugging Commands
### Network Debugging
```bash
# Check pod networking
kubectl exec -n besu-network <pod> -- ip addr
# Check DNS
kubectl exec -n besu-network <pod> -- nslookup <service>
# Check connectivity
kubectl exec -n besu-network <pod> -- ping <target>
```
### Besu Debugging
```bash
# Check Besu version
kubectl exec -n besu-network <pod> -- /opt/besu/bin/besu --version
# Check configuration
kubectl exec -n besu-network <pod> -- cat /config/besu-config.toml
# Check logs
kubectl logs -n besu-network <pod> --tail=100 -f
```
### Kubernetes Debugging
```bash
# Check pod status
kubectl describe pod <pod> -n besu-network
# Check events
kubectl get events -n besu-network --sort-by='.lastTimestamp'
# Check resources
kubectl top nodes
kubectl top pods -n besu-network
```
## Useful Resources
- [Besu Documentation](https://besu.hyperledger.org/)
- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
## Getting Help
- Check logs first
- Review monitoring dashboards
- Consult runbooks
- Contact on-call engineer
- Escalate if needed