- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts.
4.5 KiB
Monitoring Setup Guide
Last Updated: 2025-01-27
Status: Active
This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.
Table of Contents
Overview
The monitoring stack consists of:
- Prometheus - Metrics collection
- Grafana - Visualization and dashboards
- Loki - Log aggregation
- Alertmanager - Alert routing and notification
- Jaeger - Distributed tracing
- OpenTelemetry - Observability framework
Monitoring Stack
Prometheus
Purpose: Metrics collection and storage
Features:
- Scrapes metrics from all Besu nodes
- Custom metrics for oracle updates
- Alert rules for node health
Grafana
Purpose: Visualization and dashboards
Dashboards:
- Besu node health
- Block production metrics
- RPC performance metrics
- Oracle feed status
- CCIP monitoring
Loki
Purpose: Log aggregation
Features:
- Centralized log collection
- Structured logging
- Log retention policies
Alertmanager
Purpose: Alert routing and notification
Features:
- Alert routing
- Notification channels (email, Slack, PagerDuty)
- Alert inhibition rules
Setup Instructions
1. Deploy Prometheus
# Deploy Prometheus
kubectl apply -f monitoring/k8s/prometheus.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=prometheus
2. Deploy Grafana
# Deploy Grafana using Helm
helm install grafana grafana/grafana -n monitoring
# Get admin password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
3. Deploy Loki
# Deploy Loki
kubectl apply -f monitoring/k8s/loki.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=loki
4. Deploy Alertmanager
# Deploy Alertmanager
kubectl apply -f monitoring/k8s/alertmanager.yaml
# Verify deployment
kubectl get pods -n monitoring -l app=alertmanager
5. Configure Service Discovery
Prometheus needs to discover Besu nodes:
# prometheus-config.yaml
scrape_configs:
- job_name: 'besu-nodes'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- besu-network
Dashboards
Besu Node Dashboard
Metrics:
- Block production rate
- Transaction throughput
- Gas usage
- Peer connections
- Sync status
Access: Grafana → Dashboards → Besu Node Health
RPC Performance Dashboard
Metrics:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Method distribution
Access: Grafana → Dashboards → RPC Performance
Oracle Dashboard
Metrics:
- Update frequency
- Round completion time
- Deviation from sources
- Transmitter status
Access: Grafana → Dashboards → Oracle Status
CCIP Dashboard
Metrics:
- Message throughput
- Cross-chain latency
- Fee accumulation
- Error rate
Access: Grafana → Dashboards → CCIP Monitoring
Alerts
Critical Alerts
- Node Down: Besu node not responding
- Block Production Stopped: No blocks produced in 30 seconds
- High Error Rate: Error rate > 5%
- Oracle Down: Oracle not updating
Warning Alerts
- High Latency: P95 latency > 300ms
- Low Throughput: Throughput < 50% of normal
- High Gas Usage: Gas usage > 80% of limit
Alert Configuration
# alertmanager-config.yaml
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
Troubleshooting
Prometheus Not Scraping
Symptoms: No metrics in Prometheus
Solution:
- Check service discovery configuration
- Verify node labels match
- Check network connectivity
- Review Prometheus logs
Grafana Not Showing Data
Symptoms: Dashboards show "No data"
Solution:
- Verify Prometheus data source
- Check query syntax
- Verify time range
- Check metric names
Alerts Not Firing
Symptoms: Conditions met but no alerts
Solution:
- Check alert rule syntax
- Verify Alertmanager configuration
- Check notification channels
- Review Alertmanager logs
Related Documentation
Last Updated: 2025-01-27