240 lines
4.5 KiB
Markdown
240 lines
4.5 KiB
Markdown
|
|
# Monitoring Setup Guide
|
||
|
|
|
||
|
|
**Last Updated**: 2025-01-27
|
||
|
|
**Status**: Active
|
||
|
|
|
||
|
|
This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.
|
||
|
|
|
||
|
|
## Table of Contents
|
||
|
|
|
||
|
|
- [Overview](#overview)
|
||
|
|
- [Monitoring Stack](#monitoring-stack)
|
||
|
|
- [Setup Instructions](#setup-instructions)
|
||
|
|
- [Dashboards](#dashboards)
|
||
|
|
- [Alerts](#alerts)
|
||
|
|
- [Troubleshooting](#troubleshooting)
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The monitoring stack consists of:
|
||
|
|
- **Prometheus** - Metrics collection
|
||
|
|
- **Grafana** - Visualization and dashboards
|
||
|
|
- **Loki** - Log aggregation
|
||
|
|
- **Alertmanager** - Alert routing and notification
|
||
|
|
- **Jaeger** - Distributed tracing
|
||
|
|
- **OpenTelemetry** - Observability framework
|
||
|
|
|
||
|
|
## Monitoring Stack
|
||
|
|
|
||
|
|
### Prometheus
|
||
|
|
|
||
|
|
**Purpose**: Metrics collection and storage
|
||
|
|
|
||
|
|
**Features**:
|
||
|
|
- Scrapes metrics from all Besu nodes
|
||
|
|
- Custom metrics for oracle updates
|
||
|
|
- Alert rules for node health
|
||
|
|
|
||
|
|
### Grafana
|
||
|
|
|
||
|
|
**Purpose**: Visualization and dashboards
|
||
|
|
|
||
|
|
**Dashboards**:
|
||
|
|
- Besu node health
|
||
|
|
- Block production metrics
|
||
|
|
- RPC performance metrics
|
||
|
|
- Oracle feed status
|
||
|
|
- CCIP monitoring
|
||
|
|
|
||
|
|
### Loki
|
||
|
|
|
||
|
|
**Purpose**: Log aggregation
|
||
|
|
|
||
|
|
**Features**:
|
||
|
|
- Centralized log collection
|
||
|
|
- Structured logging
|
||
|
|
- Log retention policies
|
||
|
|
|
||
|
|
### Alertmanager
|
||
|
|
|
||
|
|
**Purpose**: Alert routing and notification
|
||
|
|
|
||
|
|
**Features**:
|
||
|
|
- Alert routing
|
||
|
|
- Notification channels (email, Slack, PagerDuty)
|
||
|
|
- Alert inhibition rules
|
||
|
|
|
||
|
|
## Setup Instructions
|
||
|
|
|
||
|
|
### 1. Deploy Prometheus
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Deploy Prometheus
|
||
|
|
kubectl apply -f monitoring/k8s/prometheus.yaml
|
||
|
|
|
||
|
|
# Verify deployment
|
||
|
|
kubectl get pods -n monitoring -l app=prometheus
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Deploy Grafana
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Deploy Grafana using Helm
|
||
|
|
helm install grafana grafana/grafana -n monitoring
|
||
|
|
|
||
|
|
# Get admin password
|
||
|
|
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Deploy Loki
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Deploy Loki
|
||
|
|
kubectl apply -f monitoring/k8s/loki.yaml
|
||
|
|
|
||
|
|
# Verify deployment
|
||
|
|
kubectl get pods -n monitoring -l app=loki
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Deploy Alertmanager
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Deploy Alertmanager
|
||
|
|
kubectl apply -f monitoring/k8s/alertmanager.yaml
|
||
|
|
|
||
|
|
# Verify deployment
|
||
|
|
kubectl get pods -n monitoring -l app=alertmanager
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Configure Service Discovery
|
||
|
|
|
||
|
|
Prometheus needs to discover Besu nodes:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# prometheus-config.yaml
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: 'besu-nodes'
|
||
|
|
kubernetes_sd_configs:
|
||
|
|
- role: pod
|
||
|
|
namespaces:
|
||
|
|
names:
|
||
|
|
- besu-network
|
||
|
|
```
|
||
|
|
|
||
|
|
## Dashboards
|
||
|
|
|
||
|
|
### Besu Node Dashboard
|
||
|
|
|
||
|
|
**Metrics**:
|
||
|
|
- Block production rate
|
||
|
|
- Transaction throughput
|
||
|
|
- Gas usage
|
||
|
|
- Peer connections
|
||
|
|
- Sync status
|
||
|
|
|
||
|
|
**Access**: Grafana → Dashboards → Besu Node Health
|
||
|
|
|
||
|
|
### RPC Performance Dashboard
|
||
|
|
|
||
|
|
**Metrics**:
|
||
|
|
- Request rate
|
||
|
|
- Response time (p50, p95, p99)
|
||
|
|
- Error rate
|
||
|
|
- Method distribution
|
||
|
|
|
||
|
|
**Access**: Grafana → Dashboards → RPC Performance
|
||
|
|
|
||
|
|
### Oracle Dashboard
|
||
|
|
|
||
|
|
**Metrics**:
|
||
|
|
- Update frequency
|
||
|
|
- Round completion time
|
||
|
|
- Deviation from sources
|
||
|
|
- Transmitter status
|
||
|
|
|
||
|
|
**Access**: Grafana → Dashboards → Oracle Status
|
||
|
|
|
||
|
|
### CCIP Dashboard
|
||
|
|
|
||
|
|
**Metrics**:
|
||
|
|
- Message throughput
|
||
|
|
- Cross-chain latency
|
||
|
|
- Fee accumulation
|
||
|
|
- Error rate
|
||
|
|
|
||
|
|
**Access**: Grafana → Dashboards → CCIP Monitoring
|
||
|
|
|
||
|
|
## Alerts
|
||
|
|
|
||
|
|
### Critical Alerts
|
||
|
|
|
||
|
|
- **Node Down**: Besu node not responding
|
||
|
|
- **Block Production Stopped**: No blocks produced in 30 seconds
|
||
|
|
- **High Error Rate**: Error rate > 5%
|
||
|
|
- **Oracle Down**: Oracle not updating
|
||
|
|
|
||
|
|
### Warning Alerts
|
||
|
|
|
||
|
|
- **High Latency**: P95 latency > 300ms
|
||
|
|
- **Low Throughput**: Throughput < 50% of normal
|
||
|
|
- **High Gas Usage**: Gas usage > 80% of limit
|
||
|
|
|
||
|
|
### Alert Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# alertmanager-config.yaml
|
||
|
|
route:
|
||
|
|
group_by: ['alertname']
|
||
|
|
group_wait: 10s
|
||
|
|
group_interval: 10s
|
||
|
|
repeat_interval: 12h
|
||
|
|
receiver: 'default'
|
||
|
|
routes:
|
||
|
|
- match:
|
||
|
|
severity: critical
|
||
|
|
receiver: 'critical-alerts'
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Prometheus Not Scraping
|
||
|
|
|
||
|
|
**Symptoms**: No metrics in Prometheus
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Check service discovery configuration
|
||
|
|
2. Verify node labels match
|
||
|
|
3. Check network connectivity
|
||
|
|
4. Review Prometheus logs
|
||
|
|
|
||
|
|
### Grafana Not Showing Data
|
||
|
|
|
||
|
|
**Symptoms**: Dashboards show "No data"
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Verify Prometheus data source
|
||
|
|
2. Check query syntax
|
||
|
|
3. Verify time range
|
||
|
|
4. Check metric names
|
||
|
|
|
||
|
|
### Alerts Not Firing
|
||
|
|
|
||
|
|
**Symptoms**: Conditions met but no alerts
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
1. Check alert rule syntax
|
||
|
|
2. Verify Alertmanager configuration
|
||
|
|
3. Check notification channels
|
||
|
|
4. Review Alertmanager logs
|
||
|
|
|
||
|
|
## Related Documentation
|
||
|
|
|
||
|
|
- [Architecture Documentation](../architecture/ARCHITECTURE.md)
|
||
|
|
- [Deployment Guide](../deployment/DEPLOYMENT.md)
|
||
|
|
- [Troubleshooting Guide](../guides/TROUBLESHOOTING.md)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: 2025-01-27
|
||
|
|
|