Files
proxmox/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md
defiQUG e4c9dda0fd
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: update submodule references and documentation
- Marked submodules ai-mcp-pmm-controller, explorer-monorepo, and smom-dbis-138 as dirty to reflect recent changes.
- Updated documentation to clarify operator script usage, including dotenv loading and task execution instructions.
- Enhanced the README and various index files to provide clearer navigation and task completion guidance.

Made-with: Cursor
2026-03-04 02:03:08 -08:00

751 lines
21 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Troubleshooting FAQ
**Last Updated:** 2026-01-31
**Document Version:** 1.0
**Status:** Active Documentation
---
Common issues and solutions for Besu validated set deployment.
## Table of Contents
**Estimated Reading Time:** 30 minutes
**Progress:** Check off sections as you read
1. ✅ [Container Issues](#container-issues) - *Container troubleshooting*
2. ✅ [Service Issues](#service-issues) - *Service troubleshooting*
3. ✅ [Network Issues](#network-issues) - *Network troubleshooting*
4. ✅ [Consensus Issues](#consensus-issues) - *Consensus troubleshooting*
5. ✅ [Configuration Issues](#configuration-issues) - *Configuration troubleshooting*
6. ✅ [Performance Issues](#performance-issues) - *Performance troubleshooting*
7. ✅ [Additional Common Questions](#additional-common-questions) - *More FAQs*
8. [RPC errors -32001 / -32602 / gas 32xxx](RPC_ERRORS_32001_32602.md) - *Nonce too low, Invalid params, gas when deploying*
---
## Troubleshooting Flow (Decision Tree)
1. **Is the service/container down?** → Check logs (`journalctl -u pve-container@<vmid>`, `systemctl status`), then [Container Issues](#container-issues) or [Service Issues](#service-issues).
2. **Network/connectivity issue?** → Check ping, curl, DNS, firewall; see [Network Issues](#network-issues).
3. **Consensus / QBFT?** → See [QBFT_TROUBLESHOOTING.md](QBFT_TROUBLESHOOTING.md) and [Consensus Issues](#consensus-issues).
4. **Configuration or performance?** → See [Configuration Issues](#configuration-issues), [Performance Issues](#performance-issues), or [Additional Common Questions](#additional-common-questions).
---
## Container Issues
### Q: Container won't start
**Symptoms**: `pct status <vmid>` shows "stopped" or errors during startup
**Solutions**:
```bash
# Check container status
pct status <vmid>
# View container console
pct console <vmid>
# Check logs
journalctl -u pve-container@<vmid>
# Check container configuration
pct config <vmid>
# Try starting manually
pct start <vmid>
```
**Common Causes**:
- Insufficient resources (RAM, disk)
- Network configuration errors
- Invalid container configuration
- OS template issues
<details>
<summary>Click to expand advanced troubleshooting steps</summary>
**Advanced Diagnostics:**
```bash
# Check container resources
pct list --full | grep <vmid>
# Check Proxmox host resources
free -h
df -h
# Check container logs in detail
journalctl -u pve-container@<vmid> -n 100 --no-pager
# Verify container template
pveam list | grep <template-name>
```
</details>
---
### Q: Container runs out of disk space
**Symptoms**: Services fail, "No space left on device" errors
**Solutions**:
```bash
# Check disk usage
pct exec <vmid> -- df -h
# Check Besu database size
pct exec <vmid> -- du -sh /data/besu/database/
# Clean up old logs
pct exec <vmid> -- journalctl --vacuum-time=7d
# Increase disk size (if using LVM)
pct resize <vmid> rootfs +10G
```
---
### Q: Container network issues
**Symptoms**: Cannot ping, cannot connect to services
**Solutions**:
```bash
# Check network configuration
pct config <vmid> | grep net0
# Check if container has IP
pct exec <vmid> -- ip addr show
# Check routing
pct exec <vmid> -- ip route
# Restart container networking
pct stop <vmid>
pct start <vmid>
```
---
## Service Issues
### Q: Besu service won't start
**Symptoms**: `systemctl status besu-validator` shows failed
**Solutions**:
```bash
# Check service status
pct exec <vmid> -- systemctl status besu-validator
# View service logs
pct exec <vmid> -- journalctl -u besu-validator -n 100
# Check for configuration errors
pct exec <vmid> -- besu --config-file=/etc/besu/config-validator.toml --help
# Verify configuration file syntax
pct exec <vmid> -- cat /etc/besu/config-validator.toml
```
**Common Causes**:
- Missing configuration files
- Invalid configuration syntax
- Missing validator keys
- Port conflicts
- Insufficient resources
---
### Q: Service starts but crashes
**Symptoms**: Service starts then stops, high restart count
**Solutions**:
```bash
# Check crash logs
pct exec <vmid> -- journalctl -u besu-validator --since "10 minutes ago"
# Check for out of memory
pct exec <vmid> -- dmesg | grep -i "out of memory"
# Check system resources
pct exec <vmid> -- free -h
pct exec <vmid> -- df -h
# Check JVM heap settings
pct exec <vmid> -- cat /etc/systemd/system/besu-validator.service | grep BESU_OPTS
```
---
### Q: Service shows as active but not responding
**Symptoms**: Service status shows "active" but RPC/P2P not responding
**Solutions**:
```bash
# Check if process is actually running
pct exec <vmid> -- ps aux | grep besu
# Check if ports are listening
pct exec <vmid> -- netstat -tuln | grep -E "30303|8545|9545"
# Check firewall rules
pct exec <vmid> -- iptables -L -n
# Test connectivity
pct exec <vmid> -- curl -s http://localhost:8545
```
---
## Network Issues
### Q: Nodes cannot connect to peers
**Symptoms**: Low or zero peer count, "No peers" in logs
**Solutions**:
```bash
# Check static-nodes.json
pct exec <vmid> -- cat /etc/besu/static-nodes.json
# Check permissions-nodes.toml
pct exec <vmid> -- cat /etc/besu/permissions-nodes.toml
# Verify enode URLs are correct
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode
# Check P2P port is open
pct exec <vmid> -- netstat -tuln | grep 30303
# Test connectivity to peer
pct exec <vmid> -- ping -c 3 <peer-ip>
```
**Common Causes**:
- Incorrect enode URLs in static-nodes.json
- Firewall blocking P2P port (30303)
- Nodes not in permissions-nodes.toml
- Network connectivity issues
---
### Q: Invalid enode URL errors
**Symptoms**: "Invalid enode URL syntax" or "Invalid node ID" in logs
**Solutions**:
```bash
# Check node ID length (must be 128 hex chars)
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode | \
sed 's|^enode://||' | cut -d'@' -f1 | wc -c
# Should output 129 (128 chars + newline)
# Fix node IDs using allowlist scripts
./scripts/besu-collect-all-enodes.sh
./scripts/besu-generate-allowlist.sh
./scripts/besu-deploy-allowlist.sh
```
---
### Q: RPC endpoint not accessible
**Symptoms**: Cannot connect to RPC on port 8545
**Solutions**:
```bash
# Check if RPC is enabled (validators typically don't have RPC)
pct exec <vmid> -- grep -i "rpc-http-enabled" /etc/besu/config-*.toml
# Check if RPC port is listening
pct exec <vmid> -- netstat -tuln | grep 8545
# Check firewall
pct exec <vmid> -- iptables -L -n | grep 8545
# Test from container
pct exec <vmid> -- curl -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://localhost:8545
# Check host allowlist in config
pct exec <vmid> -- grep -i "host-allowlist\|rpc-http-host" /etc/besu/config-*.toml
```
---
## Consensus Issues
### Q: No blocks being produced
**Symptoms**: Block height not increasing, "No blocks" in logs
**Solutions**:
```bash
# Check validator service is running
pct exec <vmid> -- systemctl status besu-validator
# Check validator keys
pct exec <vmid> -- ls -la /keys/validators/
# Check consensus logs
pct exec <vmid> -- journalctl -u besu-validator | grep -i "consensus\|qbft\|proposing"
# Verify validators are in genesis (if static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | grep -A 20 "qbft"
# Check peer connectivity
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
http://localhost:8545
```
**Common Causes**:
- Validator keys missing or incorrect
- Not enough validators online
- Network connectivity issues
- Consensus configuration errors
---
### Q: Validator not participating in consensus
**Symptoms**: Validator running but not producing blocks
**Solutions**:
```bash
# Verify validator address
pct exec <vmid> -- cat /keys/validators/validator-*/address.txt
# Check if address is in validator contract (for dynamic validators)
# Or check genesis.json (for static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | python3 -m json.tool | grep -A 10 "qbft"
# Verify validator keys are loaded
pct exec <vmid> -- journalctl -u besu-validator | grep -i "validator.*key"
# Check for permission errors
pct exec <vmid> -- journalctl -u besu-validator | grep -i "permission\|denied"
```
---
## Configuration Issues
### Q: Configuration file not found
**Symptoms**: "File not found" errors, service won't start
**Solutions**:
```bash
# List all config files
pct exec <vmid> -- ls -la /etc/besu/
# Verify required files exist
pct exec <vmid> -- test -f /etc/besu/genesis.json && echo "genesis.json OK" || echo "genesis.json MISSING"
pct exec <vmid> -- test -f /etc/besu/config-validator.toml && echo "config OK" || echo "config MISSING"
# Copy missing files
# (Use copy-besu-config.sh script)
./scripts/copy-besu-config.sh /path/to/smom-dbis-138
```
---
### Q: Invalid configuration syntax
**Symptoms**: "Invalid option" or syntax errors in logs
**Solutions**:
```bash
# Validate TOML syntax
pct exec <vmid> -- python3 -c "import tomllib; open('/etc/besu/config-validator.toml').read()" 2>&1
# Validate JSON syntax
pct exec <vmid> -- python3 -m json.tool /etc/besu/genesis.json > /dev/null
# Check for deprecated options
pct exec <vmid> -- journalctl -u besu-validator | grep -i "deprecated\|unknown option"
# Review Besu documentation for current options
```
---
### Q: Path errors in configuration
**Symptoms**: "File not found" errors with paths like "/config/genesis.json"
**Solutions**:
```bash
# Check configuration file paths
pct exec <vmid> -- grep -E "genesis-file|data-path" /etc/besu/config-validator.toml
# Correct paths should be:
# genesis-file="/etc/besu/genesis.json"
# data-path="/data/besu"
# Fix paths if needed
pct exec <vmid> -- sed -i 's|/config/|/etc/besu/|g' /etc/besu/config-validator.toml
```
---
## Performance Issues
### Q: High CPU usage
**Symptoms**: Container CPU usage > 80% consistently
**Solutions**:
```bash
# Check CPU usage
pct exec <vmid> -- top -bn1 | head -20
# Check JVM GC activity
pct exec <vmid> -- journalctl -u besu-validator | grep -i "gc\|pause"
# Adjust JVM settings if needed
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS and JAVA_OPTS
# Consider allocating more CPU cores
pct set <vmid> --cores 4
```
---
### Q: High memory usage
**Symptoms**: Container running out of memory, OOM kills
**Solutions**:
```bash
# Check memory usage
pct exec <vmid> -- free -h
# Check JVM heap settings
pct exec <vmid> -- ps aux | grep besu | grep -oP 'Xm[xs]\K[0-9]+[gm]'
# Reduce heap size if too large
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS="-Xmx4g" to appropriate size
# Or increase container memory
pct set <vmid> --memory 8192
```
---
### Q: Slow sync or block processing
**Symptoms**: Blocks processing slowly, falling behind
**Solutions**:
```bash
# Check database size and health
pct exec <vmid> -- du -sh /data/besu/database/
# Check disk I/O
pct exec <vmid> -- iostat -x 1 5
# Consider using SSD storage
# Check network latency
pct exec <vmid> -- ping -c 10 <peer-ip>
# Verify sufficient peers
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
http://localhost:8545 | python3 -c "import sys, json; print(len(json.load(sys.stdin).get('result', [])))"
```
---
## General Troubleshooting Commands
```bash
# View all container statuses
for vmid in 1000 1001 1002 1003 1004 1500 1501 1502 1503 2500 2501 2502; do
echo "=== Container $vmid ==="
pct status $vmid
done
# Check all service statuses
for vmid in 1000 1001 1002 1003 1004; do
pct exec $vmid -- systemctl status besu-validator --no-pager -l | head -10
done
# View recent logs from all nodes
for vmid in 1000 1001 1002 1003 1004; do
echo "=== Logs for container $vmid ==="
pct exec $vmid -- journalctl -u besu-validator -n 20 --no-pager
done
# Check network connectivity between nodes
pct exec 1000 -- ping -c 3 192.168.11.14 # validator to validator
# Verify RPC endpoint (RPC nodes only)
pct exec 2500 -- curl -s -X POST -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://localhost:8545 | python3 -m json.tool
```
---
## Getting Help
If issues persist:
1. **Collect Information**:
- Service logs: `journalctl -u besu-validator -n 100`
- Container status: `pct status <vmid>`
- Configuration: `pct exec <vmid> -- cat /etc/besu/config-validator.toml`
- Network: `pct exec <vmid> -- ip addr show`
2. **Check Documentation**:
- [Besu Nodes File Reference](../06-besu/BESU_NODES_FILE_REFERENCE.md)
- [Deployment Guide](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)
- [Besu Documentation](https://besu.hyperledger.org/)
3. **Validate Configuration**:
- Run prerequisites check: `./scripts/validation/check-prerequisites.sh`
- Validate validators: `./scripts/validation/validate-validator-set.sh`
4. **Review Logs**:
- Check deployment logs: `logs/deploy-validated-set-*.log`
- Check service logs in containers
- Check Proxmox host logs
---
## Additional Common Questions
### Q: How do I add a new VMID?
**Answer:**
1. Check available VMID ranges in [VMID_ALLOCATION_FINAL.md](../02-architecture/VMID_ALLOCATION_FINAL.md)
2. Select an appropriate VMID from the designated range for your service
3. Verify the VMID is not already in use: `pct list | grep <vmid>` or `qm list | grep <vmid>`
4. Document the assignment in VMID_ALLOCATION_FINAL.md
5. Use the VMID when creating containers/VMs
**Example:**
```bash
# Check if VMID 2503 is available
pct list | grep 2503
qm list | grep 2503
# If available, create container with VMID 2503
pct create 2503 ...
```
**Related Documentation:**
- [VMID Allocation Registry](../02-architecture/VMID_ALLOCATION_FINAL.md) ⭐⭐⭐
- [Quick Reference Cards](../12-quick-reference/QUICK_REFERENCE_CARDS.md) (VMID and network) ⭐⭐⭐
---
### Q: What's the difference between public and private RPC?
**Answer:**
| Feature | Public RPC | Private RPC |
|---------|-----------|-------------|
| **Discovery** | Enabled | Disabled |
| **Permissioning** | Disabled | Enabled |
| **Access** | Public (CORS: *) | Restricted (internal only) |
| **APIs** | ETH, NET, WEB3 (read-only) | ETH, NET, WEB3, ADMIN, DEBUG (full) |
| **Use Case** | dApps, external users | Internal services, admin |
| **ChainID** | 0x8a (138) or 0x1 (wallet compatibility) | 0x8a (138) |
| **Domain** | rpc-http-pub.d-bis.org | rpc-http-prv.d-bis.org |
**Public RPC:**
- Accessible from the internet
- Used by dApps and external tools
- Read-only APIs for security
- May report chainID 0x1 for MetaMask compatibility
**Private RPC:**
- Internal network only
- Used by internal services and administration
- Full API access including ADMIN and DEBUG
- Strict permissioning and access control
**Related Documentation:**
- [RPC Node Types Architecture](../05-network/RPC_NODE_TYPES_ARCHITECTURE.md) ⭐⭐
- [RPC Template Types](../05-network/RPC_TEMPLATE_TYPES.md) ⭐
---
### Q: How do I troubleshoot Cloudflare tunnel issues?
**Answer:**
**Step 1: Check Tunnel Status**
```bash
# Check cloudflared container status
pct status 102
# Check tunnel logs
pct logs 102 --tail 50
# Verify tunnel is running
pct exec 102 -- ps aux | grep cloudflared
```
**Step 2: Verify Configuration**
```bash
# Check tunnel configuration
pct exec 102 -- cat /etc/cloudflared/config.yaml
# Verify credentials file exists
pct exec 102 -- ls -la /etc/cloudflared/*.json
```
**Step 3: Test Connectivity**
```bash
# Test from internal network
curl -I http://192.168.11.21:80
# Test from external (through Cloudflare)
curl -I https://explorer.d-bis.org
```
**Step 4: Check Cloudflare Dashboard**
- Verify tunnel is healthy in Cloudflare Zero Trust dashboard
- Check ingress rules are configured correctly
- Verify DNS records point to tunnel
**Common Issues:**
- Tunnel not running → Restart: `pct restart 102`
- Configuration error → Check YAML syntax
- Credentials invalid → Regenerate tunnel token
- DNS not resolving → Check Cloudflare DNS settings
**Related Documentation:**
- [Cloudflare Tunnel Routing Architecture](../05-network/CLOUDFLARE_TUNNEL_ROUTING_ARCHITECTURE.md) ⭐⭐⭐
- [Cloudflare Routing Master Reference](../05-network/CLOUDFLARE_ROUTING_MASTER.md) ⭐⭐⭐
- [Troubleshooting Quick Reference](../12-quick-reference/TROUBLESHOOTING_QUICK_REFERENCE.md) ⭐⭐⭐
---
### Q: What's the recommended storage configuration?
**Answer:**
**For R630 Compute Nodes:**
- **Boot drives (2×600GB):** ZFS mirror (recommended) or hardware RAID1
- **Data SSDs (6×250GB):** ZFS pool with one of:
- Striped mirrors (if pairs available)
- RAIDZ1 (single parity, 5 drives usable)
- RAIDZ2 (double parity, 4 drives usable)
- **High-write workloads:** Dedicated dataset with quotas
**For ML110 Management Node:**
- Standard Proxmox storage configuration
- Sufficient space for templates and backups
**Storage Best Practices:**
- Use ZFS for data integrity and snapshots
- Enable compression for space efficiency
- Set quotas for containers to prevent disk exhaustion
- Regular backups to external storage
**Related Documentation:**
- [Network Architecture - Storage Orchestration](../02-architecture/NETWORK_ARCHITECTURE.md#53-storage-orchestration-r630) ⭐⭐⭐
- [Backup and Restore](../03-deployment/BACKUP_AND_RESTORE.md) ⭐⭐
---
### Q: How do I migrate from flat LAN to VLANs?
**Answer:**
**Phase 1: Preparation**
1. Review VLAN plan in [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)
2. Document current IP assignments
3. Plan IP address migration for each service
4. Create rollback plan
**Phase 2: Network Configuration**
1. Configure ES216G switches with VLAN trunks
2. Enable VLAN-aware bridge on Proxmox hosts
3. Create VLAN interfaces on ER605 router
4. Test VLAN connectivity
**Phase 3: Service Migration**
1. Migrate services one VLAN at a time
2. Start with non-critical services
3. Update container/VM network configuration
4. Verify connectivity after each migration
**Phase 4: Validation**
1. Test all services on new VLANs
2. Verify routing between VLANs
3. Test egress NAT pools
4. Document final configuration
**Migration Order (Recommended):**
1. Management services (VLAN 11) - Already active
2. Monitoring/observability (VLAN 120, 121)
3. Besu network (VLANs 110, 111, 112)
4. CCIP network (VLANs 130, 132, 133, 134)
5. Service layer (VLAN 160)
6. Sovereign tenants (VLANs 200-203)
**Related Documentation:**
- [Network Architecture - VLAN Orchestration](../02-architecture/NETWORK_ARCHITECTURE.md#3-layer-2--vlan-orchestration-plan) ⭐⭐⭐
- [Orchestration Deployment Guide - VLAN Enablement](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md#phase-1--vlan-enablement) ⭐⭐⭐
---
## Additional Common Questions (Expanded)
### Q: How do I find which VMID uses a given IP?
**Answer:** See [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) for IP ranges by service type and VMID. Use `pct list` or `qm list` on the Proxmox host to list containers/VMs and their config (including IP).
### Q: What's the difference between public and private RPC?
**Answer:** **Public RPC** (e.g. rpc-http-pub.d-bis.org) is exposed for external clients; may have rate limits and JWT. **Private RPC** (e.g. rpc-http-prv.d-bis.org) is for internal or trusted clients. See [05-network/CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md) for domain → backend mapping.
### Q: Cloudflare tunnel not connecting where do I start?
**Answer:** 1) Check cloudflared service on the tunnel host (VMID 102 or NPMplus). 2) Verify credentials and tunnel ID. 3) Check [04-configuration/cloudflare/CLOUDFLARE_TUNNEL_CONFIGURATION_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_TUNNEL_CONFIGURATION_GUIDE.md) and [05-network/CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md). 4) Confirm NPMplus (192.168.11.167) is reachable from UDM Pro port forward.
### Q: Recommended storage configuration for RPC nodes?
**Answer:** Use SSD for Besu data directory; avoid NFS for Besu unless tested. See [02-architecture/NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md) and deployment guides for node layout. Run `scripts/audit-proxmox-rpc-storage.sh` to check restrictions.
---
## Related Documentation
### Operational Procedures
- **[OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** - Complete operational runbooks
- **[QBFT_TROUBLESHOOTING.md](QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting
### Deployment & Configuration
- **[DEPLOYMENT_STATUS_CONSOLIDATED.md](../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status
- **[NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)** - Network architecture reference
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Deployment guide
### Monitoring
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring
### Reference
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index
---
**Last Updated:** 2025-01-20
**Version:** 1.0