- Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards.
4.7 KiB
4.7 KiB
Cloudflare Tunnel Investigation Report
Date: 2026-01-05
Status: ✅ Investigation Complete
Priority: High
Investigation Summary
Investigated Cloudflare tunnel issues causing 40-60% failure rate on public RPC endpoint. Found timeout errors and connection issues in tunnel logs.
Current Status
Cloudflared Service Status
- Service:
cloudflared.service - Status: ✅ Active (running)
- Uptime: 15+ hours
- Location: VMID 2400
- Memory: 20.8M
- CPU: 3min 25.004s
Current Success Rate
- Test Results: 60% success rate (6/10 requests)
- Pattern: Intermittent failures, not time-based
- Error: "502 Bad Gateway" from Cloudflare
Findings
Service Status
✅ Service Running: Cloudflared is active and running
Error Patterns Identified
Critical Errors Found:
-
Timeout Errors:
timeout: no recent network activityfailed to accept QUIC stream: timeout: no recent network activitydatagram manager encountered a failure while serving
-
Connection Issues:
- Connection terminations and retries
- Multiple connection indices (connIndex=2, connIndex=3)
- Retrying connections in up to 1s
-
Pattern:
- Errors occur intermittently
- Connections are being retried automatically
- Multiple tunnel connections registered (lax01, lax05 locations)
Configuration Analysis
Cloudflared Service Configuration:
[Service]
TimeoutStartSec=15
Type=notify
ExecStart=/usr/bin/cloudflared --no-autoupdate tunnel run --token ...
Restart=on-failure
RestartSec=5s
Nginx Proxy Timeouts:
proxy_connect_timeout: 300s ✅ Goodproxy_send_timeout: 300s ✅ Goodproxy_read_timeout: 300s ✅ Good
Issues Identified:
- No explicit tunnel connection pool configuration
- No tunnel timeout settings visible in service file
- Timeout errors suggest network activity issues
- Multiple connections but some failing
Root Cause Analysis
Primary Issues
- Network Activity Timeouts: Tunnel connections timing out due to lack of network activity
- QUIC Stream Failures: QUIC protocol streams failing to accept
- Connection Pool Exhaustion: Possible connection pool issues (not explicitly configured)
Contributing Factors
- No Keep-Alive Configuration: Tunnel may need keep-alive settings
- No Connection Pool Limits: Default pool size may be insufficient
- Network Latency: Possible latency between Cloudflare edge and origin
- Tunnel Token Configuration: Using token-based auth (may have limitations)
Recommendations
Immediate Actions (High Priority)
-
Configure Tunnel Keep-Alive
- Add
--heartbeat-countand--heartbeat-intervalflags - Ensure connections stay alive
- Add
-
Increase Connection Pool
- Configure multiple tunnel connections
- Add
--protocol quicexplicitly - Consider
--retriesconfiguration
-
Add Tunnel Metrics
- Enable metrics endpoint
- Monitor connection health
- Track timeout patterns
-
Review Cloudflare Dashboard
- Check tunnel status in Cloudflare dashboard
- Review tunnel metrics and errors
- Check for rate limiting or throttling
Short-term Improvements
-
Implement Client-Side Retry Logic (Workaround)
- Add exponential backoff for 502 errors
- Retry up to 3 times
- This will improve user experience immediately
-
Monitor Tunnel Health
- Set up alerts for tunnel errors
- Track timeout frequency
- Monitor connection pool usage
-
Optimize Nginx Configuration
- Add keep-alive settings
- Configure connection pooling
- Optimize proxy settings
Long-term Solutions
-
Multiple Tunnel Endpoints
- Set up secondary tunnel
- Load balance between tunnels
- Automatic failover
-
Direct Connection Option
- Provide direct IP access for critical clients
- Bypass Cloudflare for trusted clients
Next Steps
- ✅ Review Cloudflare dashboard for tunnel errors (Manual - requires dashboard access)
- ⚠️ Configure tunnel keep-alive settings
- ⚠️ Add connection pool configuration
- ⚠️ Implement client-side retry logic (immediate workaround)
- ⚠️ Set up tunnel health monitoring
- ⚠️ Review Cloudflare tunnel metrics in dashboard
Configuration Changes Needed
Cloudflared Service Update
[Service]
ExecStart=/usr/bin/cloudflared --no-autoupdate \
--protocol quic \
--heartbeat-count 0 \
--heartbeat-interval 5s \
tunnel run --token ...
Nginx Keep-Alive (if needed)
proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_timeout 65;
keepalive_requests 100;
Status: Investigation complete. Root causes identified. Recommendations provided.