proxmox/scripts/cloudflare-tunnels/docs/MONITORING_GUIDE.md

# Monitoring Guide

Complete guide for monitoring Cloudflare tunnels.

## Overview

Monitoring ensures your tunnels are healthy and alerts you to issues before they impact users.

## Monitoring Components

1. **Health Checks** - Verify tunnels are running
2. **Connectivity Tests** - Verify DNS and HTTPS work
3. **Log Monitoring** - Watch for errors
4. **Alerting** - Notify on failures

## Quick Start

### One-Time Health Check

```bash
./scripts/check-tunnel-health.sh
```

### Continuous Monitoring

```bash
# Foreground (see output)
./scripts/monitor-tunnels.sh

# Background (daemon mode)
./scripts/monitor-tunnels.sh --daemon
```

## Health Check Script

The `check-tunnel-health.sh` script performs comprehensive checks:

### Checks Performed

1. **Service Status** - Is the systemd service running?
2. **Log Errors** - Are there recent errors in logs?
3. **DNS Resolution** - Does DNS resolve correctly?
4. **HTTPS Connectivity** - Can we connect via HTTPS?
5. **Internal Connectivity** - Can VMID 102 reach Proxmox hosts?

### Usage

```bash
# Run health check
./scripts/check-tunnel-health.sh

# Output shows:
# - Service status for each tunnel
# - DNS resolution status
# - HTTPS connectivity
# - Internal connectivity
# - Recent errors
```

### Example Output

```
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tunnel: ml110 (ml110-01.d-bis.org)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[✓] Service is running
[✓] No recent errors in logs
[✓] DNS resolution: OK
  → 104.16.132.229
[✓] HTTPS connectivity: OK
[✓] Internal connectivity to 192.168.11.10:8006: OK
```

## Monitoring Script

The `monitor-tunnels.sh` script provides continuous monitoring:

### Features

- ✅ Continuous health checks
- ✅ Automatic restart on failure
- ✅ Alerting on failures
- ✅ Logging to file
- ✅ Daemon mode support

### Usage

```bash
# Foreground mode (see output)
./scripts/monitor-tunnels.sh

# Daemon mode (background)
./scripts/monitor-tunnels.sh --daemon

# Check if daemon is running
ps aux | grep monitor-tunnels

# Stop daemon
kill $(cat /tmp/cloudflared-monitor.pid)
```

### Configuration

Edit the script to customize:

```bash
CHECK_INTERVAL=60        # Check every 60 seconds
LOG_FILE="/var/log/cloudflared-monitor.log"
ALERT_SCRIPT="./scripts/alert-tunnel-failure.sh"
```

## Alerting

### Email Alerts

Configure email alerts in `alert-tunnel-failure.sh`:

```bash
# Set email address
export ALERT_EMAIL="admin@yourdomain.com"

# Ensure mail/sendmail is installed
apt-get install -y mailutils
```

### Webhook Alerts

Configure webhook alerts (Slack, Discord, etc.):

```bash
# Set webhook URL
export ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```

### Test Alerts

```bash
# Test alert script
./scripts/alert-tunnel-failure.sh ml110 service_down
```

## Log Monitoring

### View Logs

```bash
# All tunnels
journalctl -u cloudflared-* -f

# Specific tunnel
journalctl -u cloudflared-ml110 -f

# Last 100 lines
journalctl -u cloudflared-ml110 -n 100

# Since specific time
journalctl -u cloudflared-ml110 --since "1 hour ago"
```

### Log Rotation

Systemd handles log rotation automatically. To customize:

```bash
# Edit logrotate config
sudo nano /etc/logrotate.d/cloudflared

# Add:
/var/log/cloudflared/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
}
```

## Metrics

### Cloudflare Dashboard

View tunnel metrics in Cloudflare dashboard:

1. **Go to:** Zero Trust → Networks → Tunnels
2. **Click on tunnel** to view:
   - Connection status
   - Uptime
   - Traffic statistics
   - Error rates

### Local Metrics

Tunnels expose metrics endpoints (if configured):

```bash
# ml110 tunnel metrics
curl http://127.0.0.1:9091/metrics

# r630-01 tunnel metrics
curl http://127.0.0.1:9092/metrics

# r630-02 tunnel metrics
curl http://127.0.0.1:9093/metrics
```

## Automated Monitoring Setup

### Systemd Timer (Recommended)

Create a systemd timer for automated health checks:

```bash
# Create timer unit
sudo nano /etc/systemd/system/cloudflared-healthcheck.timer

# Add:
[Unit]
Description=Cloudflare Tunnel Health Check Timer
Requires=cloudflared-healthcheck.service

[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
Unit=cloudflared-healthcheck.service

[Install]
WantedBy=timers.target
```

```bash
# Create service unit
sudo nano /etc/systemd/system/cloudflared-healthcheck.service

# Add:
[Unit]
Description=Cloudflare Tunnel Health Check
After=network.target

[Service]
Type=oneshot
ExecStart=/path/to/scripts/check-tunnel-health.sh
StandardOutput=journal
StandardError=journal
```

```bash
# Enable and start
sudo systemctl enable cloudflared-healthcheck.timer
sudo systemctl start cloudflared-healthcheck.timer
```

### Cron Job (Alternative)

```bash
# Edit crontab
crontab -e

# Add (check every 5 minutes):
*/5 * * * * /path/to/scripts/check-tunnel-health.sh >> /var/log/tunnel-health.log 2>&1
```

## Monitoring Best Practices

1. ✅ **Run health checks regularly** - At least every 5 minutes
2. ✅ **Monitor logs** - Watch for errors
3. ✅ **Set up alerts** - Get notified immediately on failures
4. ✅ **Review metrics** - Track trends over time
5. ✅ **Test alerts** - Verify alerting works
6. ✅ **Document incidents** - Keep track of issues

## Integration with Monitoring Systems

### Prometheus

If using Prometheus, you can scrape tunnel metrics:

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'cloudflared'
    static_configs:
      - targets: ['127.0.0.1:9091', '127.0.0.1:9092', '127.0.0.1:9093']
```

### Grafana

Create dashboards in Grafana:
- Tunnel uptime
- Connection status
- Error rates
- Response times

### Nagios/Icinga

Create service checks:
```bash
# Check service status
check_nrpe -H localhost -c check_cloudflared_ml110

# Check connectivity
check_http -H ml110-01.d-bis.org -S
```

## Troubleshooting Monitoring

### Health Check Fails

```bash
# Run manually with verbose output
bash -x ./scripts/check-tunnel-health.sh

# Check individual components
systemctl status cloudflared-ml110
dig ml110-01.d-bis.org
curl -I https://ml110-01.d-bis.org
```

### Monitor Script Not Working

```bash
# Check if daemon is running
ps aux | grep monitor-tunnels

# Check log file
tail -f /var/log/cloudflared-monitor.log

# Run in foreground to see errors
./scripts/monitor-tunnels.sh
```

### Alerts Not Sending

```bash
# Test alert script
./scripts/alert-tunnel-failure.sh ml110 service_down

# Check email configuration
echo "Test" | mail -s "Test" admin@yourdomain.com

# Check webhook
curl -X POST -H "Content-Type: application/json" \
  -d '{"text":"test"}' $ALERT_WEBHOOK
```

## Next Steps

After setting up monitoring:

1. ✅ Verify health checks run successfully
2. ✅ Test alerting (trigger a test failure)
3. ✅ Set up log aggregation (if needed)
4. ✅ Create dashboards (if using Grafana)
5. ✅ Document monitoring procedures

## Support

For monitoring issues:
1. Check [Troubleshooting Guide](TROUBLESHOOTING.md)
2. Review script logs
3. Test components individually
4. Check systemd service status