Files
smom-dbis-138/docs/operations/MONITORING_SETUP_GUIDE.md
defiQUG 1fb7266469 Add Oracle Aggregator and CCIP Integration
- Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control.
- Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities.
- Created .gitmodules to include OpenZeppelin contracts as a submodule.
- Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment.
- Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks.
- Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring.
- Created scripts for resource import and usage validation across non-US regions.
- Added tests for CCIP error handling and integration to ensure robust functionality.
- Included various new files and directories for the orchestration portal and deployment scripts.
2025-12-12 14:57:48 -08:00

4.5 KiB

Monitoring Setup Guide

Last Updated: 2025-01-27
Status: Active

This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.

Table of Contents

Overview

The monitoring stack consists of:

  • Prometheus - Metrics collection
  • Grafana - Visualization and dashboards
  • Loki - Log aggregation
  • Alertmanager - Alert routing and notification
  • Jaeger - Distributed tracing
  • OpenTelemetry - Observability framework

Monitoring Stack

Prometheus

Purpose: Metrics collection and storage

Features:

  • Scrapes metrics from all Besu nodes
  • Custom metrics for oracle updates
  • Alert rules for node health

Grafana

Purpose: Visualization and dashboards

Dashboards:

  • Besu node health
  • Block production metrics
  • RPC performance metrics
  • Oracle feed status
  • CCIP monitoring

Loki

Purpose: Log aggregation

Features:

  • Centralized log collection
  • Structured logging
  • Log retention policies

Alertmanager

Purpose: Alert routing and notification

Features:

  • Alert routing
  • Notification channels (email, Slack, PagerDuty)
  • Alert inhibition rules

Setup Instructions

1. Deploy Prometheus

# Deploy Prometheus
kubectl apply -f monitoring/k8s/prometheus.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=prometheus

2. Deploy Grafana

# Deploy Grafana using Helm
helm install grafana grafana/grafana -n monitoring

# Get admin password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode

3. Deploy Loki

# Deploy Loki
kubectl apply -f monitoring/k8s/loki.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=loki

4. Deploy Alertmanager

# Deploy Alertmanager
kubectl apply -f monitoring/k8s/alertmanager.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=alertmanager

5. Configure Service Discovery

Prometheus needs to discover Besu nodes:

# prometheus-config.yaml
scrape_configs:
  - job_name: 'besu-nodes'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - besu-network

Dashboards

Besu Node Dashboard

Metrics:

  • Block production rate
  • Transaction throughput
  • Gas usage
  • Peer connections
  • Sync status

Access: Grafana → Dashboards → Besu Node Health

RPC Performance Dashboard

Metrics:

  • Request rate
  • Response time (p50, p95, p99)
  • Error rate
  • Method distribution

Access: Grafana → Dashboards → RPC Performance

Oracle Dashboard

Metrics:

  • Update frequency
  • Round completion time
  • Deviation from sources
  • Transmitter status

Access: Grafana → Dashboards → Oracle Status

CCIP Dashboard

Metrics:

  • Message throughput
  • Cross-chain latency
  • Fee accumulation
  • Error rate

Access: Grafana → Dashboards → CCIP Monitoring

Alerts

Critical Alerts

  • Node Down: Besu node not responding
  • Block Production Stopped: No blocks produced in 30 seconds
  • High Error Rate: Error rate > 5%
  • Oracle Down: Oracle not updating

Warning Alerts

  • High Latency: P95 latency > 300ms
  • Low Throughput: Throughput < 50% of normal
  • High Gas Usage: Gas usage > 80% of limit

Alert Configuration

# alertmanager-config.yaml
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'

Troubleshooting

Prometheus Not Scraping

Symptoms: No metrics in Prometheus

Solution:

  1. Check service discovery configuration
  2. Verify node labels match
  3. Check network connectivity
  4. Review Prometheus logs

Grafana Not Showing Data

Symptoms: Dashboards show "No data"

Solution:

  1. Verify Prometheus data source
  2. Check query syntax
  3. Verify time range
  4. Check metric names

Alerts Not Firing

Symptoms: Conditions met but no alerts

Solution:

  1. Check alert rule syntax
  2. Verify Alertmanager configuration
  3. Check notification channels
  4. Review Alertmanager logs

Last Updated: 2025-01-27