Files
proxmox/scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md
defiQUG cb47cce074 Complete markdown files cleanup and organization
- Organized 252 files across project
- Root directory: 187 → 2 files (98.9% reduction)
- Moved configuration guides to docs/04-configuration/
- Moved troubleshooting guides to docs/09-troubleshooting/
- Moved quick start guides to docs/01-getting-started/
- Moved reports to reports/ directory
- Archived temporary files
- Generated comprehensive reports and documentation
- Created maintenance scripts and guides

All files organized according to established standards.
2026-01-06 01:46:25 -08:00

364 lines
7.1 KiB
Markdown

# Monitoring Guide
Complete guide for monitoring Cloudflare tunnels.
## Overview
Monitoring ensures your tunnels are healthy and alerts you to issues before they impact users.
## Monitoring Components
1. **Health Checks** - Verify tunnels are running
2. **Connectivity Tests** - Verify DNS and HTTPS work
3. **Log Monitoring** - Watch for errors
4. **Alerting** - Notify on failures
## Quick Start
### One-Time Health Check
```bash
./scripts/check-tunnel-health.sh
```
### Continuous Monitoring
```bash
# Foreground (see output)
./scripts/monitor-tunnels.sh
# Background (daemon mode)
./scripts/monitor-tunnels.sh --daemon
```
## Health Check Script
The `check-tunnel-health.sh` script performs comprehensive checks:
### Checks Performed
1. **Service Status** - Is the systemd service running?
2. **Log Errors** - Are there recent errors in logs?
3. **DNS Resolution** - Does DNS resolve correctly?
4. **HTTPS Connectivity** - Can we connect via HTTPS?
5. **Internal Connectivity** - Can VMID 102 reach Proxmox hosts?
### Usage
```bash
# Run health check
./scripts/check-tunnel-health.sh
# Output shows:
# - Service status for each tunnel
# - DNS resolution status
# - HTTPS connectivity
# - Internal connectivity
# - Recent errors
```
### Example Output
```
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tunnel: ml110 (ml110-01.d-bis.org)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[✓] Service is running
[✓] No recent errors in logs
[✓] DNS resolution: OK
→ 104.16.132.229
[✓] HTTPS connectivity: OK
[✓] Internal connectivity to 192.168.11.10:8006: OK
```
## Monitoring Script
The `monitor-tunnels.sh` script provides continuous monitoring:
### Features
- ✅ Continuous health checks
- ✅ Automatic restart on failure
- ✅ Alerting on failures
- ✅ Logging to file
- ✅ Daemon mode support
### Usage
```bash
# Foreground mode (see output)
./scripts/monitor-tunnels.sh
# Daemon mode (background)
./scripts/monitor-tunnels.sh --daemon
# Check if daemon is running
ps aux | grep monitor-tunnels
# Stop daemon
kill $(cat /tmp/cloudflared-monitor.pid)
```
### Configuration
Edit the script to customize:
```bash
CHECK_INTERVAL=60 # Check every 60 seconds
LOG_FILE="/var/log/cloudflared-monitor.log"
ALERT_SCRIPT="./scripts/alert-tunnel-failure.sh"
```
## Alerting
### Email Alerts
Configure email alerts in `alert-tunnel-failure.sh`:
```bash
# Set email address
export ALERT_EMAIL="admin@yourdomain.com"
# Ensure mail/sendmail is installed
apt-get install -y mailutils
```
### Webhook Alerts
Configure webhook alerts (Slack, Discord, etc.):
```bash
# Set webhook URL
export ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```
### Test Alerts
```bash
# Test alert script
./scripts/alert-tunnel-failure.sh ml110 service_down
```
## Log Monitoring
### View Logs
```bash
# All tunnels
journalctl -u cloudflared-* -f
# Specific tunnel
journalctl -u cloudflared-ml110 -f
# Last 100 lines
journalctl -u cloudflared-ml110 -n 100
# Since specific time
journalctl -u cloudflared-ml110 --since "1 hour ago"
```
### Log Rotation
Systemd handles log rotation automatically. To customize:
```bash
# Edit logrotate config
sudo nano /etc/logrotate.d/cloudflared
# Add:
/var/log/cloudflared/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
```
## Metrics
### Cloudflare Dashboard
View tunnel metrics in Cloudflare dashboard:
1. **Go to:** Zero Trust → Networks → Tunnels
2. **Click on tunnel** to view:
- Connection status
- Uptime
- Traffic statistics
- Error rates
### Local Metrics
Tunnels expose metrics endpoints (if configured):
```bash
# ml110 tunnel metrics
curl http://127.0.0.1:9091/metrics
# r630-01 tunnel metrics
curl http://127.0.0.1:9092/metrics
# r630-02 tunnel metrics
curl http://127.0.0.1:9093/metrics
```
## Automated Monitoring Setup
### Systemd Timer (Recommended)
Create a systemd timer for automated health checks:
```bash
# Create timer unit
sudo nano /etc/systemd/system/cloudflared-healthcheck.timer
# Add:
[Unit]
Description=Cloudflare Tunnel Health Check Timer
Requires=cloudflared-healthcheck.service
[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
Unit=cloudflared-healthcheck.service
[Install]
WantedBy=timers.target
```
```bash
# Create service unit
sudo nano /etc/systemd/system/cloudflared-healthcheck.service
# Add:
[Unit]
Description=Cloudflare Tunnel Health Check
After=network.target
[Service]
Type=oneshot
ExecStart=/path/to/scripts/check-tunnel-health.sh
StandardOutput=journal
StandardError=journal
```
```bash
# Enable and start
sudo systemctl enable cloudflared-healthcheck.timer
sudo systemctl start cloudflared-healthcheck.timer
```
### Cron Job (Alternative)
```bash
# Edit crontab
crontab -e
# Add (check every 5 minutes):
*/5 * * * * /path/to/scripts/check-tunnel-health.sh >> /var/log/tunnel-health.log 2>&1
```
## Monitoring Best Practices
1.**Run health checks regularly** - At least every 5 minutes
2.**Monitor logs** - Watch for errors
3.**Set up alerts** - Get notified immediately on failures
4.**Review metrics** - Track trends over time
5.**Test alerts** - Verify alerting works
6.**Document incidents** - Keep track of issues
## Integration with Monitoring Systems
### Prometheus
If using Prometheus, you can scrape tunnel metrics:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'cloudflared'
static_configs:
- targets: ['127.0.0.1:9091', '127.0.0.1:9092', '127.0.0.1:9093']
```
### Grafana
Create dashboards in Grafana:
- Tunnel uptime
- Connection status
- Error rates
- Response times
### Nagios/Icinga
Create service checks:
```bash
# Check service status
check_nrpe -H localhost -c check_cloudflared_ml110
# Check connectivity
check_http -H ml110-01.d-bis.org -S
```
## Troubleshooting Monitoring
### Health Check Fails
```bash
# Run manually with verbose output
bash -x ./scripts/check-tunnel-health.sh
# Check individual components
systemctl status cloudflared-ml110
dig ml110-01.d-bis.org
curl -I https://ml110-01.d-bis.org
```
### Monitor Script Not Working
```bash
# Check if daemon is running
ps aux | grep monitor-tunnels
# Check log file
tail -f /var/log/cloudflared-monitor.log
# Run in foreground to see errors
./scripts/monitor-tunnels.sh
```
### Alerts Not Sending
```bash
# Test alert script
./scripts/alert-tunnel-failure.sh ml110 service_down
# Check email configuration
echo "Test" | mail -s "Test" admin@yourdomain.com
# Check webhook
curl -X POST -H "Content-Type: application/json" \
-d '{"text":"test"}' $ALERT_WEBHOOK
```
## Next Steps
After setting up monitoring:
1. ✅ Verify health checks run successfully
2. ✅ Test alerting (trigger a test failure)
3. ✅ Set up log aggregation (if needed)
4. ✅ Create dashboards (if using Grafana)
5. ✅ Document monitoring procedures
## Support
For monitoring issues:
1. Check [Troubleshooting Guide](TROUBLESHOOTING.md)
2. Review script logs
3. Test components individually
4. Check systemd service status