Files
proxmox/rpc-translator-138/CLOUDFLARE_TUNNEL_INVESTIGATION.md
defiQUG cb47cce074 Complete markdown files cleanup and organization
- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.
2026-01-06 01:46:25 -08:00

4.7 KiB

Cloudflare Tunnel Investigation Report

Date: 2026-01-05
Status: Investigation Complete
Priority: High


Investigation Summary

Investigated Cloudflare tunnel issues causing 40-60% failure rate on public RPC endpoint. Found timeout errors and connection issues in tunnel logs.


Current Status

Cloudflared Service Status

  • Service: cloudflared.service
  • Status: Active (running)
  • Uptime: 15+ hours
  • Location: VMID 2400
  • Memory: 20.8M
  • CPU: 3min 25.004s

Current Success Rate

  • Test Results: 60% success rate (6/10 requests)
  • Pattern: Intermittent failures, not time-based
  • Error: "502 Bad Gateway" from Cloudflare

Findings

Service Status

Service Running: Cloudflared is active and running

Error Patterns Identified

Critical Errors Found:

  1. Timeout Errors:

    • timeout: no recent network activity
    • failed to accept QUIC stream: timeout: no recent network activity
    • datagram manager encountered a failure while serving
  2. Connection Issues:

    • Connection terminations and retries
    • Multiple connection indices (connIndex=2, connIndex=3)
    • Retrying connections in up to 1s
  3. Pattern:

    • Errors occur intermittently
    • Connections are being retried automatically
    • Multiple tunnel connections registered (lax01, lax05 locations)

Configuration Analysis

Cloudflared Service Configuration:

[Service]
TimeoutStartSec=15
Type=notify
ExecStart=/usr/bin/cloudflared --no-autoupdate tunnel run --token ...
Restart=on-failure
RestartSec=5s

Nginx Proxy Timeouts:

  • proxy_connect_timeout: 300s Good
  • proxy_send_timeout: 300s Good
  • proxy_read_timeout: 300s Good

Issues Identified:

  1. No explicit tunnel connection pool configuration
  2. No tunnel timeout settings visible in service file
  3. Timeout errors suggest network activity issues
  4. Multiple connections but some failing

Root Cause Analysis

Primary Issues

  1. Network Activity Timeouts: Tunnel connections timing out due to lack of network activity
  2. QUIC Stream Failures: QUIC protocol streams failing to accept
  3. Connection Pool Exhaustion: Possible connection pool issues (not explicitly configured)

Contributing Factors

  1. No Keep-Alive Configuration: Tunnel may need keep-alive settings
  2. No Connection Pool Limits: Default pool size may be insufficient
  3. Network Latency: Possible latency between Cloudflare edge and origin
  4. Tunnel Token Configuration: Using token-based auth (may have limitations)

Recommendations

Immediate Actions (High Priority)

  1. Configure Tunnel Keep-Alive

    • Add --heartbeat-count and --heartbeat-interval flags
    • Ensure connections stay alive
  2. Increase Connection Pool

    • Configure multiple tunnel connections
    • Add --protocol quic explicitly
    • Consider --retries configuration
  3. Add Tunnel Metrics

    • Enable metrics endpoint
    • Monitor connection health
    • Track timeout patterns
  4. Review Cloudflare Dashboard

    • Check tunnel status in Cloudflare dashboard
    • Review tunnel metrics and errors
    • Check for rate limiting or throttling

Short-term Improvements

  1. Implement Client-Side Retry Logic (Workaround)

    • Add exponential backoff for 502 errors
    • Retry up to 3 times
    • This will improve user experience immediately
  2. Monitor Tunnel Health

    • Set up alerts for tunnel errors
    • Track timeout frequency
    • Monitor connection pool usage
  3. Optimize Nginx Configuration

    • Add keep-alive settings
    • Configure connection pooling
    • Optimize proxy settings

Long-term Solutions

  1. Multiple Tunnel Endpoints

    • Set up secondary tunnel
    • Load balance between tunnels
    • Automatic failover
  2. Direct Connection Option

    • Provide direct IP access for critical clients
    • Bypass Cloudflare for trusted clients

Next Steps

  1. Review Cloudflare dashboard for tunnel errors (Manual - requires dashboard access)
  2. ⚠️ Configure tunnel keep-alive settings
  3. ⚠️ Add connection pool configuration
  4. ⚠️ Implement client-side retry logic (immediate workaround)
  5. ⚠️ Set up tunnel health monitoring
  6. ⚠️ Review Cloudflare tunnel metrics in dashboard

Configuration Changes Needed

Cloudflared Service Update

[Service]
ExecStart=/usr/bin/cloudflared --no-autoupdate \
  --protocol quic \
  --heartbeat-count 0 \
  --heartbeat-interval 5s \
  tunnel run --token ...

Nginx Keep-Alive (if needed)

proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 65;
keepalive_requests 100;

Status: Investigation complete. Root causes identified. Recommendations provided.