Files
smom-dbis-138/docs/operations/MONITORING_SETUP_GUIDE.md

240 lines
4.5 KiB
Markdown
Raw Normal View History

# Monitoring Setup Guide
**Last Updated**: 2025-01-27
**Status**: Active
This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.
## Table of Contents
- [Overview](#overview)
- [Monitoring Stack](#monitoring-stack)
- [Setup Instructions](#setup-instructions)
- [Dashboards](#dashboards)
- [Alerts](#alerts)
- [Troubleshooting](#troubleshooting)
## Overview
The monitoring stack consists of:
- **Prometheus** - Metrics collection
- **Grafana** - Visualization and dashboards
- **Loki** - Log aggregation
- **Alertmanager** - Alert routing and notification
- **Jaeger** - Distributed tracing
- **OpenTelemetry** - Observability framework
## Monitoring Stack
### Prometheus
**Purpose**: Metrics collection and storage
**Features**:
- Scrapes metrics from all Besu nodes
- Custom metrics for oracle updates
- Alert rules for node health
### Grafana
**Purpose**: Visualization and dashboards
**Dashboards**:
- Besu node health
- Block production metrics
- RPC performance metrics
- Oracle feed status
- CCIP monitoring
### Loki
**Purpose**: Log aggregation
**Features**:
- Centralized log collection
- Structured logging
- Log retention policies
### Alertmanager
**Purpose**: Alert routing and notification
**Features**:
- Alert routing
- Notification channels (email, Slack, PagerDuty)
- Alert inhibition rules
## Setup Instructions
### 1. Deploy Prometheus
```bash
# Deploy Prometheus
kubectl apply -f monitoring/k8s/prometheus.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=prometheus
```
### 2. Deploy Grafana
```bash
# Deploy Grafana using Helm
helm install grafana grafana/grafana -n monitoring
# Get admin password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
```
### 3. Deploy Loki
```bash
# Deploy Loki
kubectl apply -f monitoring/k8s/loki.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=loki
```
### 4. Deploy Alertmanager
```bash
# Deploy Alertmanager
kubectl apply -f monitoring/k8s/alertmanager.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=alertmanager
```
### 5. Configure Service Discovery
Prometheus needs to discover Besu nodes:
```yaml
# prometheus-config.yaml
scrape_configs:
- job_name: 'besu-nodes'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- besu-network
```
## Dashboards
### Besu Node Dashboard
**Metrics**:
- Block production rate
- Transaction throughput
- Gas usage
- Peer connections
- Sync status
**Access**: Grafana → Dashboards → Besu Node Health
### RPC Performance Dashboard
**Metrics**:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Method distribution
**Access**: Grafana → Dashboards → RPC Performance
### Oracle Dashboard
**Metrics**:
- Update frequency
- Round completion time
- Deviation from sources
- Transmitter status
**Access**: Grafana → Dashboards → Oracle Status
### CCIP Dashboard
**Metrics**:
- Message throughput
- Cross-chain latency
- Fee accumulation
- Error rate
**Access**: Grafana → Dashboards → CCIP Monitoring
## Alerts
### Critical Alerts
- **Node Down**: Besu node not responding
- **Block Production Stopped**: No blocks produced in 30 seconds
- **High Error Rate**: Error rate > 5%
- **Oracle Down**: Oracle not updating
### Warning Alerts
- **High Latency**: P95 latency > 300ms
- **Low Throughput**: Throughput < 50% of normal
- **High Gas Usage**: Gas usage > 80% of limit
### Alert Configuration
```yaml
# alertmanager-config.yaml
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
```
## Troubleshooting
### Prometheus Not Scraping
**Symptoms**: No metrics in Prometheus
**Solution**:
1. Check service discovery configuration
2. Verify node labels match
3. Check network connectivity
4. Review Prometheus logs
### Grafana Not Showing Data
**Symptoms**: Dashboards show "No data"
**Solution**:
1. Verify Prometheus data source
2. Check query syntax
3. Verify time range
4. Check metric names
### Alerts Not Firing
**Symptoms**: Conditions met but no alerts
**Solution**:
1. Check alert rule syntax
2. Verify Alertmanager configuration
3. Check notification channels
4. Review Alertmanager logs
## Related Documentation
- [Architecture Documentation](../architecture/ARCHITECTURE.md)
- [Deployment Guide](../deployment/DEPLOYMENT.md)
- [Troubleshooting Guide](../guides/TROUBLESHOOTING.md)
---
**Last Updated**: 2025-01-27