smom-dbis-138/runbooks/troubleshooting.md

# Troubleshooting Guide

## Common Issues and Solutions

### Network Issues

#### Blocks Not Being Produced

**Symptoms**: No new blocks, validators not responding

**Diagnosis**:
```bash
# Check validator status
kubectl get pods -n besu-network -l component=validator

# Check logs
kubectl logs -n besu-network <validator-pod> --tail=100

# Check block number
curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://<rpc-endpoint>
```

**Solutions**:
1. Restart validators: `kubectl rollout restart statefulset/besu-validator -n besu-network`
2. Check network connectivity
3. Verify validator keys
4. Check IBFT configuration
5. Verify genesis file

#### Validators Not Peering

**Symptoms**: Validators not connecting to each other

**Diagnosis**:
```bash
# Check peer count
kubectl exec -n besu-network <validator-pod> -- \
  curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
  http://localhost:8545

# Check static nodes
kubectl get configmap besu-validator-config -n besu-network -o yaml
```

**Solutions**:
1. Verify static-nodes.json configuration
2. Check network policies
3. Verify firewall rules
4. Check P2P port (30303) connectivity
5. Verify enode addresses

### RPC Issues

#### RPC Endpoints Not Responding

**Symptoms**: RPC calls failing, timeouts

**Diagnosis**:
```bash
# Check RPC pod status
kubectl get pods -n besu-network -l component=rpc

# Check logs
kubectl logs -n besu-network <rpc-pod> --tail=100

# Test RPC endpoint
curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://<rpc-endpoint>
```

**Solutions**:
1. Restart RPC pods: `kubectl rollout restart statefulset/besu-rpc -n besu-network`
2. Check Application Gateway status
3. Verify network policies
4. Check rate limiting
5. Scale RPC nodes if needed

#### High Latency

**Symptoms**: Slow RPC responses

**Diagnosis**:
```bash
# Check pod resources
kubectl top pods -n besu-network -l component=rpc

# Check metrics
curl http://<rpc-pod>:9545/metrics

# Check sync status
curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' \
  http://<rpc-endpoint>
```

**Solutions**:
1. Scale RPC nodes
2. Increase resource limits
3. Check disk I/O
4. Verify network connectivity
5. Check for sync issues

### Oracle Issues

#### Oracle Not Updating

**Symptoms**: Oracle price not updating, circuit breaker open

**Diagnosis**:
```bash
# Check oracle publisher status
kubectl get pods -n besu-network -l app=oracle-publisher

# Check logs
kubectl logs -n besu-network <oracle-pod> --tail=100

# Check health endpoint
curl http://<oracle-pod>:8080/health

# Check metrics
curl http://<oracle-pod>:8000/metrics
```

**Solutions**:
1. Restart oracle publisher
2. Check data sources
3. Verify RPC connectivity
4. Check private key access
5. Verify circuit breaker configuration

#### Data Source Failures

**Symptoms**: Failed to fetch from data sources

**Diagnosis**:
```bash
# Check data source connectivity
curl <data-source-url>

# Check oracle publisher logs
kubectl logs -n besu-network <oracle-pod> | grep -i "data source"
```

**Solutions**:
1. Verify data source URLs
2. Check network connectivity
3. Verify API keys
4. Check rate limiting
5. Update data source configuration

### Storage Issues

#### Disk Full

**Symptoms**: Pods failing, disk space errors

**Diagnosis**:
```bash
# Check disk usage
kubectl exec -n besu-network <pod> -- df -h

# Check PVC usage
kubectl get pvc -n besu-network

# Check pod logs
kubectl logs -n besu-network <pod> | grep -i "disk\|space\|full"
```

**Solutions**:
1. Increase PVC size
2. Clean up old data
3. Archive chaindata
4. Use snap sync for RPC nodes
5. Implement data retention policies

#### Slow Disk I/O

**Symptoms**: Slow sync, high latency

**Diagnosis**:
```bash
# Check disk I/O
kubectl exec -n besu-network <pod> -- iostat -x 1

# Check metrics
curl http://<pod>:9545/metrics | grep -i "disk\|io"
```

**Solutions**:
1. Upgrade to Premium SSD
2. Increase disk size
3. Optimize Besu configuration
4. Check for disk contention
5. Use faster storage class

### Monitoring Issues

#### Metrics Not Collecting

**Symptoms**: No metrics in Prometheus

**Diagnosis**:
```bash
# Check Prometheus targets
curl http://<prometheus>:9090/api/v1/targets

# Check service discovery
kubectl get servicemonitors -n besu-network

# Check pod metrics endpoint
curl http://<pod>:9545/metrics
```

**Solutions**:
1. Verify ServiceMonitor configuration
2. Check network policies
3. Verify metrics endpoint
4. Restart Prometheus
5. Check service discovery configuration

#### Alerts Not Firing

**Symptoms**: Alerts not triggering

**Diagnosis**:
```bash
# Check Alertmanager status
curl http://<alertmanager>:9093/api/v1/status

# Check alert rules
kubectl get prometheusrules -n besu-network

# Check notification channels
kubectl get secret alertmanager-config -n besu-network -o yaml
```

**Solutions**:
1. Verify alert rules
2. Check Alertmanager configuration
3. Verify notification channels
4. Check alert thresholds
5. Test alert rules

## Debugging Commands

### Network Debugging

```bash
# Check pod networking
kubectl exec -n besu-network <pod> -- ip addr

# Check DNS
kubectl exec -n besu-network <pod> -- nslookup <service>

# Check connectivity
kubectl exec -n besu-network <pod> -- ping <target>
```

### Besu Debugging

```bash
# Check Besu version
kubectl exec -n besu-network <pod> -- /opt/besu/bin/besu --version

# Check configuration
kubectl exec -n besu-network <pod> -- cat /config/besu-config.toml

# Check logs
kubectl logs -n besu-network <pod> --tail=100 -f
```

### Kubernetes Debugging

```bash
# Check pod status
kubectl describe pod <pod> -n besu-network

# Check events
kubectl get events -n besu-network --sort-by='.lastTimestamp'

# Check resources
kubectl top nodes
kubectl top pods -n besu-network
```

## Useful Resources

- [Besu Documentation](https://besu.hyperledger.org/)
- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)

## Getting Help

- Check logs first
- Review monitoring dashboards
- Consult runbooks
- Contact on-call engineer
- Escalate if needed