proxmox/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md

# Troubleshooting FAQ

**Last Updated:** 2026-01-31
**Document Version:** 1.0
**Status:** Active Documentation

---

Common issues and solutions for Besu validated set deployment.

## Table of Contents

**Estimated Reading Time:** 30 minutes
**Progress:** Check off sections as you read

1. ✅ [Container Issues](#container-issues) - *Container troubleshooting*
2. ✅ [Service Issues](#service-issues) - *Service troubleshooting*
3. ✅ [Network Issues](#network-issues) - *Network troubleshooting*
4. ✅ [Consensus Issues](#consensus-issues) - *Consensus troubleshooting*
5. ✅ [Configuration Issues](#configuration-issues) - *Configuration troubleshooting*
6. ✅ [Performance Issues](#performance-issues) - *Performance troubleshooting*
7. ✅ [Additional Common Questions](#additional-common-questions) - *More FAQs*
8. [RPC errors -32001 / -32602 / gas 32xxx](RPC_ERRORS_32001_32602.md) - *Nonce too low, Invalid params, gas when deploying*

---

## Troubleshooting Flow (Decision Tree)

1. **Is the service/container down?** → Check logs (`journalctl -u pve-container@<vmid>`, `systemctl status`), then [Container Issues](#container-issues) or [Service Issues](#service-issues).
2. **Network/connectivity issue?** → Check ping, curl, DNS, firewall; see [Network Issues](#network-issues).
3. **Consensus / QBFT?** → See [QBFT_TROUBLESHOOTING.md](QBFT_TROUBLESHOOTING.md) and [Consensus Issues](#consensus-issues).
4. **Configuration or performance?** → See [Configuration Issues](#configuration-issues), [Performance Issues](#performance-issues), or [Additional Common Questions](#additional-common-questions).

---

## Container Issues

### Q: Container won't start

**Symptoms**: `pct status <vmid>` shows "stopped" or errors during startup

**Solutions**:
```bash
# Check container status
pct status <vmid>

# View container console
pct console <vmid>

# Check logs
journalctl -u pve-container@<vmid>

# Check container configuration
pct config <vmid>

# Try starting manually
pct start <vmid>
```

**Common Causes**:
- Insufficient resources (RAM, disk)
- Network configuration errors
- Invalid container configuration
- OS template issues

<details>
<summary>Click to expand advanced troubleshooting steps</summary>

**Advanced Diagnostics:**
```bash
# Check container resources
pct list --full | grep <vmid>

# Check Proxmox host resources
free -h
df -h

# Check container logs in detail
journalctl -u pve-container@<vmid> -n 100 --no-pager

# Verify container template
pveam list | grep <template-name>
```

</details>

---

### Q: Container runs out of disk space

**Symptoms**: Services fail, "No space left on device" errors

**Solutions**:
```bash
# Check disk usage
pct exec <vmid> -- df -h

# Check Besu database size
pct exec <vmid> -- du -sh /data/besu/database/

# Clean up old logs
pct exec <vmid> -- journalctl --vacuum-time=7d

# Increase disk size (if using LVM)
pct resize <vmid> rootfs +10G
```

---

### Q: Container network issues

**Symptoms**: Cannot ping, cannot connect to services

**Solutions**:
```bash
# Check network configuration
pct config <vmid> | grep net0

# Check if container has IP
pct exec <vmid> -- ip addr show

# Check routing
pct exec <vmid> -- ip route

# Restart container networking
pct stop <vmid>
pct start <vmid>
```

---

## Service Issues

### Q: Besu service won't start

**Symptoms**: `systemctl status besu-validator` shows failed

**Solutions**:
```bash
# Check service status
pct exec <vmid> -- systemctl status besu-validator

# View service logs
pct exec <vmid> -- journalctl -u besu-validator -n 100

# Check for configuration errors
pct exec <vmid> -- besu --config-file=/etc/besu/config-validator.toml --help

# Verify configuration file syntax
pct exec <vmid> -- cat /etc/besu/config-validator.toml
```

**Common Causes**:
- Missing configuration files
- Invalid configuration syntax
- Missing validator keys
- Port conflicts
- Insufficient resources

---

### Q: Service starts but crashes

**Symptoms**: Service starts then stops, high restart count

**Solutions**:
```bash
# Check crash logs
pct exec <vmid> -- journalctl -u besu-validator --since "10 minutes ago"

# Check for out of memory
pct exec <vmid> -- dmesg | grep -i "out of memory"

# Check system resources
pct exec <vmid> -- free -h
pct exec <vmid> -- df -h

# Check JVM heap settings
pct exec <vmid> -- cat /etc/systemd/system/besu-validator.service | grep BESU_OPTS
```

---

### Q: Service shows as active but not responding

**Symptoms**: Service status shows "active" but RPC/P2P not responding

**Solutions**:
```bash
# Check if process is actually running
pct exec <vmid> -- ps aux | grep besu

# Check if ports are listening
pct exec <vmid> -- netstat -tuln | grep -E "30303|8545|9545"

# Check firewall rules
pct exec <vmid> -- iptables -L -n

# Test connectivity
pct exec <vmid> -- curl -s http://localhost:8545
```

---

## Network Issues

### Q: Nodes cannot connect to peers

**Symptoms**: Low or zero peer count, "No peers" in logs

**Solutions**:
```bash
# Check static-nodes.json
pct exec <vmid> -- cat /etc/besu/static-nodes.json

# Check permissions-nodes.toml
pct exec <vmid> -- cat /etc/besu/permissions-nodes.toml

# Verify enode URLs are correct
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode

# Check P2P port is open
pct exec <vmid> -- netstat -tuln | grep 30303

# Test connectivity to peer
pct exec <vmid> -- ping -c 3 <peer-ip>
```

**Common Causes**:
- Incorrect enode URLs in static-nodes.json
- Firewall blocking P2P port (30303)
- Nodes not in permissions-nodes.toml
- Network connectivity issues

---

### Q: Invalid enode URL errors

**Symptoms**: "Invalid enode URL syntax" or "Invalid node ID" in logs

**Solutions**:
```bash
# Check node ID length (must be 128 hex chars)
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode | \
    sed 's|^enode://||' | cut -d'@' -f1 | wc -c

# Should output 129 (128 chars + newline)

# Fix node IDs using allowlist scripts
./scripts/besu-collect-all-enodes.sh
./scripts/besu-generate-allowlist.sh
./scripts/besu-deploy-allowlist.sh
```

---

### Q: RPC endpoint not accessible

**Symptoms**: Cannot connect to RPC on port 8545

**Solutions**:
```bash
# Check if RPC is enabled (validators typically don't have RPC)
pct exec <vmid> -- grep -i "rpc-http-enabled" /etc/besu/config-*.toml

# Check if RPC port is listening
pct exec <vmid> -- netstat -tuln | grep 8545

# Check firewall
pct exec <vmid> -- iptables -L -n | grep 8545

# Test from container
pct exec <vmid> -- curl -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    http://localhost:8545

# Check host allowlist in config
pct exec <vmid> -- grep -i "host-allowlist\|rpc-http-host" /etc/besu/config-*.toml
```

---

## Consensus Issues

### Q: No blocks being produced

**Symptoms**: Block height not increasing, "No blocks" in logs

**Solutions**:
```bash
# Check validator service is running
pct exec <vmid> -- systemctl status besu-validator

# Check validator keys
pct exec <vmid> -- ls -la /keys/validators/

# Check consensus logs
pct exec <vmid> -- journalctl -u besu-validator | grep -i "consensus\|qbft\|proposing"

# Verify validators are in genesis (if static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | grep -A 20 "qbft"

# Check peer connectivity
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
    http://localhost:8545
```

**Common Causes**:
- Validator keys missing or incorrect
- Not enough validators online
- Network connectivity issues
- Consensus configuration errors

---

### Q: Validator not participating in consensus

**Symptoms**: Validator running but not producing blocks

**Solutions**:
```bash
# Verify validator address
pct exec <vmid> -- cat /keys/validators/validator-*/address.txt

# Check if address is in validator contract (for dynamic validators)
# Or check genesis.json (for static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | python3 -m json.tool | grep -A 10 "qbft"

# Verify validator keys are loaded
pct exec <vmid> -- journalctl -u besu-validator | grep -i "validator.*key"

# Check for permission errors
pct exec <vmid> -- journalctl -u besu-validator | grep -i "permission\|denied"
```

---

## Configuration Issues

### Q: Configuration file not found

**Symptoms**: "File not found" errors, service won't start

**Solutions**:
```bash
# List all config files
pct exec <vmid> -- ls -la /etc/besu/

# Verify required files exist
pct exec <vmid> -- test -f /etc/besu/genesis.json && echo "genesis.json OK" || echo "genesis.json MISSING"
pct exec <vmid> -- test -f /etc/besu/config-validator.toml && echo "config OK" || echo "config MISSING"

# Copy missing files
# (Use copy-besu-config.sh script)
./scripts/copy-besu-config.sh /path/to/smom-dbis-138
```

---

### Q: Invalid configuration syntax

**Symptoms**: "Invalid option" or syntax errors in logs

**Solutions**:
```bash
# Validate TOML syntax
pct exec <vmid> -- python3 -c "import tomllib; open('/etc/besu/config-validator.toml').read()" 2>&1

# Validate JSON syntax
pct exec <vmid> -- python3 -m json.tool /etc/besu/genesis.json > /dev/null

# Check for deprecated options
pct exec <vmid> -- journalctl -u besu-validator | grep -i "deprecated\|unknown option"

# Review Besu documentation for current options
```

---

### Q: Path errors in configuration

**Symptoms**: "File not found" errors with paths like "/config/genesis.json"

**Solutions**:
```bash
# Check configuration file paths
pct exec <vmid> -- grep -E "genesis-file|data-path" /etc/besu/config-validator.toml

# Correct paths should be:
# genesis-file="/etc/besu/genesis.json"
# data-path="/data/besu"

# Fix paths if needed
pct exec <vmid> -- sed -i 's|/config/|/etc/besu/|g' /etc/besu/config-validator.toml
```

---

## Performance Issues

### Q: High CPU usage

**Symptoms**: Container CPU usage > 80% consistently

**Solutions**:
```bash
# Check CPU usage
pct exec <vmid> -- top -bn1 | head -20

# Check JVM GC activity
pct exec <vmid> -- journalctl -u besu-validator | grep -i "gc\|pause"

# Adjust JVM settings if needed
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS and JAVA_OPTS

# Consider allocating more CPU cores
pct set <vmid> --cores 4
```

---

### Q: High memory usage

**Symptoms**: Container running out of memory, OOM kills

**Solutions**:
```bash
# Check memory usage
pct exec <vmid> -- free -h

# Check JVM heap settings
pct exec <vmid> -- ps aux | grep besu | grep -oP 'Xm[xs]\K[0-9]+[gm]'

# Reduce heap size if too large
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS="-Xmx4g" to appropriate size

# Or increase container memory
pct set <vmid> --memory 8192
```

---

### Q: Slow sync or block processing

**Symptoms**: Blocks processing slowly, falling behind

**Solutions**:
```bash
# Check database size and health
pct exec <vmid> -- du -sh /data/besu/database/

# Check disk I/O
pct exec <vmid> -- iostat -x 1 5

# Consider using SSD storage
# Check network latency
pct exec <vmid> -- ping -c 10 <peer-ip>

# Verify sufficient peers
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
    http://localhost:8545 | python3 -c "import sys, json; print(len(json.load(sys.stdin).get('result', [])))"
```

---

## General Troubleshooting Commands

```bash
# View all container statuses
for vmid in 1000 1001 1002 1003 1004 1500 1501 1502 1503 2500 2501 2502; do
    echo "=== Container $vmid ==="
    pct status $vmid
done

# Check all service statuses
for vmid in 1000 1001 1002 1003 1004; do
    pct exec $vmid -- systemctl status besu-validator --no-pager -l | head -10
done

# View recent logs from all nodes
for vmid in 1000 1001 1002 1003 1004; do
    echo "=== Logs for container $vmid ==="
    pct exec $vmid -- journalctl -u besu-validator -n 20 --no-pager
done

# Check network connectivity between nodes
pct exec 1000 -- ping -c 3 192.168.11.14  # validator to validator

# Verify RPC endpoint (RPC nodes only)
pct exec 2500 -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    http://localhost:8545 | python3 -m json.tool
```

---

## Getting Help

If issues persist:

1. **Collect Information**:
   - Service logs: `journalctl -u besu-validator -n 100`
   - Container status: `pct status <vmid>`
   - Configuration: `pct exec <vmid> -- cat /etc/besu/config-validator.toml`
   - Network: `pct exec <vmid> -- ip addr show`

2. **Check Documentation**:
   - [Besu Nodes File Reference](../06-besu/BESU_NODES_FILE_REFERENCE.md)
   - [Deployment Guide](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)
   - [Besu Documentation](https://besu.hyperledger.org/)

3. **Validate Configuration**:
   - Run prerequisites check: `./scripts/validation/check-prerequisites.sh`
   - Validate validators: `./scripts/validation/validate-validator-set.sh`

4. **Review Logs**:
   - Check deployment logs: `logs/deploy-validated-set-*.log`
   - Check service logs in containers
   - Check Proxmox host logs

---

## Additional Common Questions

### Q: How do I add a new VMID?

**Answer:**
1. Check available VMID ranges in [VMID_ALLOCATION_FINAL.md](../02-architecture/VMID_ALLOCATION_FINAL.md)
2. Select an appropriate VMID from the designated range for your service
3. Verify the VMID is not already in use: `pct list | grep <vmid>` or `qm list | grep <vmid>`
4. Document the assignment in VMID_ALLOCATION_FINAL.md
5. Use the VMID when creating containers/VMs

**Example:**
```bash
# Check if VMID 2503 is available
pct list | grep 2503
qm list | grep 2503

# If available, create container with VMID 2503
pct create 2503 ...
```

**Related Documentation:**
- [VMID Allocation Registry](../02-architecture/VMID_ALLOCATION_FINAL.md) ⭐⭐⭐
- [Quick Reference Cards](../12-quick-reference/QUICK_REFERENCE_CARDS.md) (VMID and network) ⭐⭐⭐

---

### Q: What's the difference between public and private RPC?

**Answer:**

| Feature | Public RPC | Private RPC |
|---------|-----------|-------------|
| **Discovery** | Enabled | Disabled |
| **Permissioning** | Disabled | Enabled |
| **Access** | Public (CORS: *) | Restricted (internal only) |
| **APIs** | ETH, NET, WEB3 (read-only) | ETH, NET, WEB3, ADMIN, DEBUG (full) |
| **Use Case** | dApps, external users | Internal services, admin |
| **ChainID** | 0x8a (138) or 0x1 (wallet compatibility) | 0x8a (138) |
| **Domain** | rpc-http-pub.d-bis.org | rpc-http-prv.d-bis.org |

**Public RPC:**
- Accessible from the internet
- Used by dApps and external tools
- Read-only APIs for security
- May report chainID 0x1 for MetaMask compatibility

**Private RPC:**
- Internal network only
- Used by internal services and administration
- Full API access including ADMIN and DEBUG
- Strict permissioning and access control

**Related Documentation:**
- [RPC Node Types Architecture](../05-network/RPC_NODE_TYPES_ARCHITECTURE.md) ⭐⭐
- [RPC Template Types](../05-network/RPC_TEMPLATE_TYPES.md) ⭐

---

### Q: How do I troubleshoot Cloudflare tunnel issues?

**Answer:**

**Step 1: Check Tunnel Status**
```bash
# Check cloudflared container status
pct status 102

# Check tunnel logs
pct logs 102 --tail 50

# Verify tunnel is running
pct exec 102 -- ps aux | grep cloudflared
```

**Step 2: Verify Configuration**
```bash
# Check tunnel configuration
pct exec 102 -- cat /etc/cloudflared/config.yaml

# Verify credentials file exists
pct exec 102 -- ls -la /etc/cloudflared/*.json
```

**Step 3: Test Connectivity**
```bash
# Test from internal network
curl -I http://192.168.11.21:80

# Test from external (through Cloudflare)
curl -I https://explorer.d-bis.org
```

**Step 4: Check Cloudflare Dashboard**
- Verify tunnel is healthy in Cloudflare Zero Trust dashboard
- Check ingress rules are configured correctly
- Verify DNS records point to tunnel

**Common Issues:**
- Tunnel not running → Restart: `pct restart 102`
- Configuration error → Check YAML syntax
- Credentials invalid → Regenerate tunnel token
- DNS not resolving → Check Cloudflare DNS settings

**Related Documentation:**
- [Cloudflare Tunnel Routing Architecture](../05-network/CLOUDFLARE_TUNNEL_ROUTING_ARCHITECTURE.md) ⭐⭐⭐
- [Cloudflare Routing Master Reference](../05-network/CLOUDFLARE_ROUTING_MASTER.md) ⭐⭐⭐
- [Troubleshooting Quick Reference](../12-quick-reference/TROUBLESHOOTING_QUICK_REFERENCE.md) ⭐⭐⭐

---

### Q: What's the recommended storage configuration?

**Answer:**

**For R630 Compute Nodes:**
- **Boot drives (2×600GB):** ZFS mirror (recommended) or hardware RAID1
- **Data SSDs (6×250GB):** ZFS pool with one of:
  - Striped mirrors (if pairs available)
  - RAIDZ1 (single parity, 5 drives usable)
  - RAIDZ2 (double parity, 4 drives usable)
- **High-write workloads:** Dedicated dataset with quotas

**For ML110 Management Node:**
- Standard Proxmox storage configuration
- Sufficient space for templates and backups

**Storage Best Practices:**
- Use ZFS for data integrity and snapshots
- Enable compression for space efficiency
- Set quotas for containers to prevent disk exhaustion
- Regular backups to external storage

**Related Documentation:**
- [Network Architecture - Storage Orchestration](../02-architecture/NETWORK_ARCHITECTURE.md#53-storage-orchestration-r630) ⭐⭐⭐
- [Backup and Restore](../03-deployment/BACKUP_AND_RESTORE.md) ⭐⭐

---

### Q: How do I migrate from flat LAN to VLANs?

**Answer:**

**Phase 1: Preparation**
1. Review VLAN plan in [NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)
2. Document current IP assignments
3. Plan IP address migration for each service
4. Create rollback plan

**Phase 2: Network Configuration**
1. Configure ES216G switches with VLAN trunks
2. Enable VLAN-aware bridge on Proxmox hosts
3. Create VLAN interfaces on ER605 router
4. Test VLAN connectivity

**Phase 3: Service Migration**
1. Migrate services one VLAN at a time
2. Start with non-critical services
3. Update container/VM network configuration
4. Verify connectivity after each migration

**Phase 4: Validation**
1. Test all services on new VLANs
2. Verify routing between VLANs
3. Test egress NAT pools
4. Document final configuration

**Migration Order (Recommended):**
1. Management services (VLAN 11) - Already active
2. Monitoring/observability (VLAN 120, 121)
3. Besu network (VLANs 110, 111, 112)
4. CCIP network (VLANs 130, 132, 133, 134)
5. Service layer (VLAN 160)
6. Sovereign tenants (VLANs 200-203)

**Related Documentation:**
- [Network Architecture - VLAN Orchestration](../02-architecture/NETWORK_ARCHITECTURE.md#3-layer-2--vlan-orchestration-plan) ⭐⭐⭐
- [Orchestration Deployment Guide - VLAN Enablement](../02-architecture/ORCHESTRATION_DEPLOYMENT_GUIDE.md#phase-1--vlan-enablement) ⭐⭐⭐

---

## Additional Common Questions (Expanded)

### Q: How do I find which VMID uses a given IP?

**Answer:** See [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) for IP ranges by service type and VMID. Use `pct list` or `qm list` on the Proxmox host to list containers/VMs and their config (including IP).

### Q: What's the difference between public and private RPC?

**Answer:** **Public RPC** (e.g. rpc-http-pub.d-bis.org) is exposed for external clients; may have rate limits and JWT. **Private RPC** (e.g. rpc-http-prv.d-bis.org) is for internal or trusted clients. See [05-network/CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md) for domain → backend mapping.

### Q: Cloudflare tunnel not connecting – where do I start?

**Answer:** 1) Check cloudflared service on the tunnel host (VMID 102 or NPMplus). 2) Verify credentials and tunnel ID. 3) Check [04-configuration/cloudflare/CLOUDFLARE_TUNNEL_CONFIGURATION_GUIDE.md](../04-configuration/cloudflare/CLOUDFLARE_TUNNEL_CONFIGURATION_GUIDE.md) and [05-network/CLOUDFLARE_ROUTING_MASTER.md](../05-network/CLOUDFLARE_ROUTING_MASTER.md). 4) Confirm NPMplus (192.168.11.167) is reachable from UDM Pro port forward.

### Q: Recommended storage configuration for RPC nodes?

**Answer:** Use SSD for Besu data directory; avoid NFS for Besu unless tested. See [02-architecture/NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md) and deployment guides for node layout. Run `scripts/audit-proxmox-rpc-storage.sh` to check restrictions.

---

## Related Documentation

### Operational Procedures
- **[OPERATIONAL_RUNBOOKS.md](../03-deployment/OPERATIONAL_RUNBOOKS.md)** - Complete operational runbooks
- **[QBFT_TROUBLESHOOTING.md](QBFT_TROUBLESHOOTING.md)** - QBFT consensus troubleshooting
- **[BESU_ALLOWLIST_QUICK_START.md](../06-besu/BESU_ALLOWLIST_QUICK_START.md)** - Allowlist troubleshooting

### Deployment & Configuration
- **[DEPLOYMENT_STATUS_CONSOLIDATED.md](../03-deployment/DEPLOYMENT_STATUS_CONSOLIDATED.md)** - Current deployment status
- **[NETWORK_ARCHITECTURE.md](../02-architecture/NETWORK_ARCHITECTURE.md)** - Network architecture reference
- **[VALIDATED_SET_DEPLOYMENT_GUIDE.md](../03-deployment/VALIDATED_SET_DEPLOYMENT_GUIDE.md)** - Deployment guide

### Monitoring
- **[MONITORING_SUMMARY.md](../08-monitoring/MONITORING_SUMMARY.md)** - Monitoring setup
- **[BLOCK_PRODUCTION_MONITORING.md](../08-monitoring/BLOCK_PRODUCTION_MONITORING.md)** - Block production monitoring

### Reference
- **[MASTER_INDEX.md](../MASTER_INDEX.md)** - Complete documentation index

---

**Last Updated:** 2025-01-20
**Version:** 1.0