Files
proxmox/docs/09-troubleshooting/TROUBLESHOOTING_FAQ.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

21 KiB
Raw Blame History

Troubleshooting FAQ

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Common issues and solutions for Besu validated set deployment.

Table of Contents

Estimated Reading Time: 30 minutes
Progress: Check off sections as you read

  1. Container Issues - Container troubleshooting
  2. Service Issues - Service troubleshooting
  3. Network Issues - Network troubleshooting
  4. Consensus Issues - Consensus troubleshooting
  5. Configuration Issues - Configuration troubleshooting
  6. Performance Issues - Performance troubleshooting
  7. Additional Common Questions - More FAQs

Troubleshooting Flow (Decision Tree)

  1. Is the service/container down? → Check logs (journalctl -u pve-container@<vmid>, systemctl status), then Container Issues or Service Issues.
  2. Network/connectivity issue? → Check ping, curl, DNS, firewall; see Network Issues.
  3. Consensus / QBFT? → See QBFT_TROUBLESHOOTING.md and Consensus Issues.
  4. Configuration or performance? → See Configuration Issues, Performance Issues, or Additional Common Questions.

Container Issues

Q: Container won't start

Symptoms: pct status <vmid> shows "stopped" or errors during startup

Solutions:

# Check container status
pct status <vmid>

# View container console
pct console <vmid>

# Check logs
journalctl -u pve-container@<vmid>

# Check container configuration
pct config <vmid>

# Try starting manually
pct start <vmid>

Common Causes:

  • Insufficient resources (RAM, disk)
  • Network configuration errors
  • Invalid container configuration
  • OS template issues
Click to expand advanced troubleshooting steps

Advanced Diagnostics:

# Check container resources
pct list --full | grep <vmid>

# Check Proxmox host resources
free -h
df -h

# Check container logs in detail
journalctl -u pve-container@<vmid> -n 100 --no-pager

# Verify container template
pveam list | grep <template-name>

Q: Container runs out of disk space

Symptoms: Services fail, "No space left on device" errors

Solutions:

# Check disk usage
pct exec <vmid> -- df -h

# Check Besu database size
pct exec <vmid> -- du -sh /data/besu/database/

# Clean up old logs
pct exec <vmid> -- journalctl --vacuum-time=7d

# Increase disk size (if using LVM)
pct resize <vmid> rootfs +10G

Q: Container network issues

Symptoms: Cannot ping, cannot connect to services

Solutions:

# Check network configuration
pct config <vmid> | grep net0

# Check if container has IP
pct exec <vmid> -- ip addr show

# Check routing
pct exec <vmid> -- ip route

# Restart container networking
pct stop <vmid>
pct start <vmid>

Service Issues

Q: Besu service won't start

Symptoms: systemctl status besu-validator shows failed

Solutions:

# Check service status
pct exec <vmid> -- systemctl status besu-validator

# View service logs
pct exec <vmid> -- journalctl -u besu-validator -n 100

# Check for configuration errors
pct exec <vmid> -- besu --config-file=/etc/besu/config-validator.toml --help

# Verify configuration file syntax
pct exec <vmid> -- cat /etc/besu/config-validator.toml

Common Causes:

  • Missing configuration files
  • Invalid configuration syntax
  • Missing validator keys
  • Port conflicts
  • Insufficient resources

Q: Service starts but crashes

Symptoms: Service starts then stops, high restart count

Solutions:

# Check crash logs
pct exec <vmid> -- journalctl -u besu-validator --since "10 minutes ago"

# Check for out of memory
pct exec <vmid> -- dmesg | grep -i "out of memory"

# Check system resources
pct exec <vmid> -- free -h
pct exec <vmid> -- df -h

# Check JVM heap settings
pct exec <vmid> -- cat /etc/systemd/system/besu-validator.service | grep BESU_OPTS

Q: Service shows as active but not responding

Symptoms: Service status shows "active" but RPC/P2P not responding

Solutions:

# Check if process is actually running
pct exec <vmid> -- ps aux | grep besu

# Check if ports are listening
pct exec <vmid> -- netstat -tuln | grep -E "30303|8545|9545"

# Check firewall rules
pct exec <vmid> -- iptables -L -n

# Test connectivity
pct exec <vmid> -- curl -s http://localhost:8545

Network Issues

Q: Nodes cannot connect to peers

Symptoms: Low or zero peer count, "No peers" in logs

Solutions:

# Check static-nodes.json
pct exec <vmid> -- cat /etc/besu/static-nodes.json

# Check permissions-nodes.toml
pct exec <vmid> -- cat /etc/besu/permissions-nodes.toml

# Verify enode URLs are correct
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode

# Check P2P port is open
pct exec <vmid> -- netstat -tuln | grep 30303

# Test connectivity to peer
pct exec <vmid> -- ping -c 3 <peer-ip>

Common Causes:

  • Incorrect enode URLs in static-nodes.json
  • Firewall blocking P2P port (30303)
  • Nodes not in permissions-nodes.toml
  • Network connectivity issues

Q: Invalid enode URL errors

Symptoms: "Invalid enode URL syntax" or "Invalid node ID" in logs

Solutions:

# Check node ID length (must be 128 hex chars)
pct exec <vmid> -- besu public-key export --node-private-key-file=/data/besu/nodekey --format=enode | \
    sed 's|^enode://||' | cut -d'@' -f1 | wc -c

# Should output 129 (128 chars + newline)

# Fix node IDs using allowlist scripts
./scripts/besu-collect-all-enodes.sh
./scripts/besu-generate-allowlist.sh
./scripts/besu-deploy-allowlist.sh

Q: RPC endpoint not accessible

Symptoms: Cannot connect to RPC on port 8545

Solutions:

# Check if RPC is enabled (validators typically don't have RPC)
pct exec <vmid> -- grep -i "rpc-http-enabled" /etc/besu/config-*.toml

# Check if RPC port is listening
pct exec <vmid> -- netstat -tuln | grep 8545

# Check firewall
pct exec <vmid> -- iptables -L -n | grep 8545

# Test from container
pct exec <vmid> -- curl -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    http://localhost:8545

# Check host allowlist in config
pct exec <vmid> -- grep -i "host-allowlist\|rpc-http-host" /etc/besu/config-*.toml

Consensus Issues

Q: No blocks being produced

Symptoms: Block height not increasing, "No blocks" in logs

Solutions:

# Check validator service is running
pct exec <vmid> -- systemctl status besu-validator

# Check validator keys
pct exec <vmid> -- ls -la /keys/validators/

# Check consensus logs
pct exec <vmid> -- journalctl -u besu-validator | grep -i "consensus\|qbft\|proposing"

# Verify validators are in genesis (if static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | grep -A 20 "qbft"

# Check peer connectivity
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
    http://localhost:8545

Common Causes:

  • Validator keys missing or incorrect
  • Not enough validators online
  • Network connectivity issues
  • Consensus configuration errors

Q: Validator not participating in consensus

Symptoms: Validator running but not producing blocks

Solutions:

# Verify validator address
pct exec <vmid> -- cat /keys/validators/validator-*/address.txt

# Check if address is in validator contract (for dynamic validators)
# Or check genesis.json (for static validators)
pct exec <vmid> -- cat /etc/besu/genesis.json | python3 -m json.tool | grep -A 10 "qbft"

# Verify validator keys are loaded
pct exec <vmid> -- journalctl -u besu-validator | grep -i "validator.*key"

# Check for permission errors
pct exec <vmid> -- journalctl -u besu-validator | grep -i "permission\|denied"

Configuration Issues

Q: Configuration file not found

Symptoms: "File not found" errors, service won't start

Solutions:

# List all config files
pct exec <vmid> -- ls -la /etc/besu/

# Verify required files exist
pct exec <vmid> -- test -f /etc/besu/genesis.json && echo "genesis.json OK" || echo "genesis.json MISSING"
pct exec <vmid> -- test -f /etc/besu/config-validator.toml && echo "config OK" || echo "config MISSING"

# Copy missing files
# (Use copy-besu-config.sh script)
./scripts/copy-besu-config.sh /path/to/smom-dbis-138

Q: Invalid configuration syntax

Symptoms: "Invalid option" or syntax errors in logs

Solutions:

# Validate TOML syntax
pct exec <vmid> -- python3 -c "import tomllib; open('/etc/besu/config-validator.toml').read()" 2>&1

# Validate JSON syntax
pct exec <vmid> -- python3 -m json.tool /etc/besu/genesis.json > /dev/null

# Check for deprecated options
pct exec <vmid> -- journalctl -u besu-validator | grep -i "deprecated\|unknown option"

# Review Besu documentation for current options

Q: Path errors in configuration

Symptoms: "File not found" errors with paths like "/config/genesis.json"

Solutions:

# Check configuration file paths
pct exec <vmid> -- grep -E "genesis-file|data-path" /etc/besu/config-validator.toml

# Correct paths should be:
# genesis-file="/etc/besu/genesis.json"
# data-path="/data/besu"

# Fix paths if needed
pct exec <vmid> -- sed -i 's|/config/|/etc/besu/|g' /etc/besu/config-validator.toml

Performance Issues

Q: High CPU usage

Symptoms: Container CPU usage > 80% consistently

Solutions:

# Check CPU usage
pct exec <vmid> -- top -bn1 | head -20

# Check JVM GC activity
pct exec <vmid> -- journalctl -u besu-validator | grep -i "gc\|pause"

# Adjust JVM settings if needed
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS and JAVA_OPTS

# Consider allocating more CPU cores
pct set <vmid> --cores 4

Q: High memory usage

Symptoms: Container running out of memory, OOM kills

Solutions:

# Check memory usage
pct exec <vmid> -- free -h

# Check JVM heap settings
pct exec <vmid> -- ps aux | grep besu | grep -oP 'Xm[xs]\K[0-9]+[gm]'

# Reduce heap size if too large
# Edit /etc/systemd/system/besu-validator.service
# Adjust BESU_OPTS="-Xmx4g" to appropriate size

# Or increase container memory
pct set <vmid> --memory 8192

Q: Slow sync or block processing

Symptoms: Blocks processing slowly, falling behind

Solutions:

# Check database size and health
pct exec <vmid> -- du -sh /data/besu/database/

# Check disk I/O
pct exec <vmid> -- iostat -x 1 5

# Consider using SSD storage
# Check network latency
pct exec <vmid> -- ping -c 10 <peer-ip>

# Verify sufficient peers
pct exec <vmid> -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"admin_peers","params":[],"id":1}' \
    http://localhost:8545 | python3 -c "import sys, json; print(len(json.load(sys.stdin).get('result', [])))"

General Troubleshooting Commands

# View all container statuses
for vmid in 1000 1001 1002 1003 1004 1500 1501 1502 1503 2500 2501 2502; do
    echo "=== Container $vmid ==="
    pct status $vmid
done

# Check all service statuses
for vmid in 1000 1001 1002 1003 1004; do
    pct exec $vmid -- systemctl status besu-validator --no-pager -l | head -10
done

# View recent logs from all nodes
for vmid in 1000 1001 1002 1003 1004; do
    echo "=== Logs for container $vmid ==="
    pct exec $vmid -- journalctl -u besu-validator -n 20 --no-pager
done

# Check network connectivity between nodes
pct exec 1000 -- ping -c 3 192.168.11.14  # validator to validator

# Verify RPC endpoint (RPC nodes only)
pct exec 2500 -- curl -s -X POST -H "Content-Type: application/json" \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    http://localhost:8545 | python3 -m json.tool

Getting Help

If issues persist:

  1. Collect Information:

    • Service logs: journalctl -u besu-validator -n 100
    • Container status: pct status <vmid>
    • Configuration: pct exec <vmid> -- cat /etc/besu/config-validator.toml
    • Network: pct exec <vmid> -- ip addr show
  2. Check Documentation:

  3. Validate Configuration:

    • Run prerequisites check: ./scripts/validation/check-prerequisites.sh
    • Validate validators: ./scripts/validation/validate-validator-set.sh
  4. Review Logs:

    • Check deployment logs: logs/deploy-validated-set-*.log
    • Check service logs in containers
    • Check Proxmox host logs

Additional Common Questions

Q: How do I add a new VMID?

Answer:

  1. Check available VMID ranges in VMID_ALLOCATION_FINAL.md
  2. Select an appropriate VMID from the designated range for your service
  3. Verify the VMID is not already in use: pct list | grep <vmid> or qm list | grep <vmid>
  4. Document the assignment in VMID_ALLOCATION_FINAL.md
  5. Use the VMID when creating containers/VMs

Example:

# Check if VMID 2503 is available
pct list | grep 2503
qm list | grep 2503

# If available, create container with VMID 2503
pct create 2503 ...

Related Documentation:


Q: What's the difference between public and private RPC?

Answer:

Feature Public RPC Private RPC
Discovery Enabled Disabled
Permissioning Disabled Enabled
Access Public (CORS: *) Restricted (internal only)
APIs ETH, NET, WEB3 (read-only) ETH, NET, WEB3, ADMIN, DEBUG (full)
Use Case dApps, external users Internal services, admin
ChainID 0x8a (138) or 0x1 (wallet compatibility) 0x8a (138)
Domain rpc-http-pub.d-bis.org rpc-http-prv.d-bis.org

Public RPC:

  • Accessible from the internet
  • Used by dApps and external tools
  • Read-only APIs for security
  • May report chainID 0x1 for MetaMask compatibility

Private RPC:

  • Internal network only
  • Used by internal services and administration
  • Full API access including ADMIN and DEBUG
  • Strict permissioning and access control

Related Documentation:


Q: How do I troubleshoot Cloudflare tunnel issues?

Answer:

Step 1: Check Tunnel Status

# Check cloudflared container status
pct status 102

# Check tunnel logs
pct logs 102 --tail 50

# Verify tunnel is running
pct exec 102 -- ps aux | grep cloudflared

Step 2: Verify Configuration

# Check tunnel configuration
pct exec 102 -- cat /etc/cloudflared/config.yaml

# Verify credentials file exists
pct exec 102 -- ls -la /etc/cloudflared/*.json

Step 3: Test Connectivity

# Test from internal network
curl -I http://192.168.11.21:80

# Test from external (through Cloudflare)
curl -I https://explorer.d-bis.org

Step 4: Check Cloudflare Dashboard

  • Verify tunnel is healthy in Cloudflare Zero Trust dashboard
  • Check ingress rules are configured correctly
  • Verify DNS records point to tunnel

Common Issues:

  • Tunnel not running → Restart: pct restart 102
  • Configuration error → Check YAML syntax
  • Credentials invalid → Regenerate tunnel token
  • DNS not resolving → Check Cloudflare DNS settings

Related Documentation:


Answer:

For R630 Compute Nodes:

  • Boot drives (2×600GB): ZFS mirror (recommended) or hardware RAID1
  • Data SSDs (6×250GB): ZFS pool with one of:
    • Striped mirrors (if pairs available)
    • RAIDZ1 (single parity, 5 drives usable)
    • RAIDZ2 (double parity, 4 drives usable)
  • High-write workloads: Dedicated dataset with quotas

For ML110 Management Node:

  • Standard Proxmox storage configuration
  • Sufficient space for templates and backups

Storage Best Practices:

  • Use ZFS for data integrity and snapshots
  • Enable compression for space efficiency
  • Set quotas for containers to prevent disk exhaustion
  • Regular backups to external storage

Related Documentation:


Q: How do I migrate from flat LAN to VLANs?

Answer:

Phase 1: Preparation

  1. Review VLAN plan in NETWORK_ARCHITECTURE.md
  2. Document current IP assignments
  3. Plan IP address migration for each service
  4. Create rollback plan

Phase 2: Network Configuration

  1. Configure ES216G switches with VLAN trunks
  2. Enable VLAN-aware bridge on Proxmox hosts
  3. Create VLAN interfaces on ER605 router
  4. Test VLAN connectivity

Phase 3: Service Migration

  1. Migrate services one VLAN at a time
  2. Start with non-critical services
  3. Update container/VM network configuration
  4. Verify connectivity after each migration

Phase 4: Validation

  1. Test all services on new VLANs
  2. Verify routing between VLANs
  3. Test egress NAT pools
  4. Document final configuration

Migration Order (Recommended):

  1. Management services (VLAN 11) - Already active
  2. Monitoring/observability (VLAN 120, 121)
  3. Besu network (VLANs 110, 111, 112)
  4. CCIP network (VLANs 130, 132, 133, 134)
  5. Service layer (VLAN 160)
  6. Sovereign tenants (VLANs 200-203)

Related Documentation:


Additional Common Questions (Expanded)

Q: How do I find which VMID uses a given IP?

Answer: See NETWORK_CONFIGURATION_MASTER.md for IP ranges by service type and VMID. Use pct list or qm list on the Proxmox host to list containers/VMs and their config (including IP).

Q: What's the difference between public and private RPC?

Answer: Public RPC (e.g. rpc-http-pub.d-bis.org) is exposed for external clients; may have rate limits and JWT. Private RPC (e.g. rpc-http-prv.d-bis.org) is for internal or trusted clients. See 05-network/CLOUDFLARE_ROUTING_MASTER.md for domain → backend mapping.

Q: Cloudflare tunnel not connecting where do I start?

Answer: 1) Check cloudflared service on the tunnel host (VMID 102 or NPMplus). 2) Verify credentials and tunnel ID. 3) Check 04-configuration/cloudflare/CLOUDFLARE_TUNNEL_CONFIGURATION_GUIDE.md and 05-network/CLOUDFLARE_ROUTING_MASTER.md. 4) Confirm NPMplus (192.168.11.167) is reachable from UDM Pro port forward.

Answer: Use SSD for Besu data directory; avoid NFS for Besu unless tested. See 02-architecture/NETWORK_ARCHITECTURE.md and deployment guides for node layout. Run scripts/audit-proxmox-rpc-storage.sh to check restrictions.


Operational Procedures

Deployment & Configuration

Monitoring

Reference


Last Updated: 2025-01-20
Version: 1.0