docs/operations/MONITORING_SETUP_GUIDE.md

# Monitoring Setup Guide

**Last Updated**: 2025-01-27  
**Status**: Active

This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.

## Table of Contents

- [Overview](#overview)
- [Monitoring Stack](#monitoring-stack)
- [Setup Instructions](#setup-instructions)
- [Dashboards](#dashboards)
- [Alerts](#alerts)
- [Troubleshooting](#troubleshooting)

## Overview

The monitoring stack consists of:
- **Prometheus** - Metrics collection
- **Grafana** - Visualization and dashboards
- **Loki** - Log aggregation
- **Alertmanager** - Alert routing and notification
- **Jaeger** - Distributed tracing
- **OpenTelemetry** - Observability framework

## Monitoring Stack

### Prometheus

**Purpose**: Metrics collection and storage

**Features**:
- Scrapes metrics from all Besu nodes
- Custom metrics for oracle updates
- Alert rules for node health

### Grafana

**Purpose**: Visualization and dashboards

**Dashboards**:
- Besu node health
- Block production metrics
- RPC performance metrics
- Oracle feed status
- CCIP monitoring

### Loki

**Purpose**: Log aggregation

**Features**:
- Centralized log collection
- Structured logging
- Log retention policies

### Alertmanager

**Purpose**: Alert routing and notification

**Features**:
- Alert routing
- Notification channels (email, Slack, PagerDuty)
- Alert inhibition rules

## Setup Instructions

### 1. Deploy Prometheus

```bash
# Deploy Prometheus
kubectl apply -f monitoring/k8s/prometheus.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=prometheus
```

### 2. Deploy Grafana

```bash
# Deploy Grafana using Helm
helm install grafana grafana/grafana -n monitoring

# Get admin password
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode
```

### 3. Deploy Loki

```bash
# Deploy Loki
kubectl apply -f monitoring/k8s/loki.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=loki
```

### 4. Deploy Alertmanager

```bash
# Deploy Alertmanager
kubectl apply -f monitoring/k8s/alertmanager.yaml

# Verify deployment
kubectl get pods -n monitoring -l app=alertmanager
```

### 5. Configure Service Discovery

Prometheus needs to discover Besu nodes:

```yaml
# prometheus-config.yaml
scrape_configs:
  - job_name: 'besu-nodes'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - besu-network
```

## Dashboards

### Besu Node Dashboard

**Metrics**:
- Block production rate
- Transaction throughput
- Gas usage
- Peer connections
- Sync status

**Access**: Grafana → Dashboards → Besu Node Health

### RPC Performance Dashboard

**Metrics**:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Method distribution

**Access**: Grafana → Dashboards → RPC Performance

### Oracle Dashboard

**Metrics**:
- Update frequency
- Round completion time
- Deviation from sources
- Transmitter status

**Access**: Grafana → Dashboards → Oracle Status

### CCIP Dashboard

**Metrics**:
- Message throughput
- Cross-chain latency
- Fee accumulation
- Error rate

**Access**: Grafana → Dashboards → CCIP Monitoring

## Alerts

### Critical Alerts

- **Node Down**: Besu node not responding
- **Block Production Stopped**: No blocks produced in 30 seconds
- **High Error Rate**: Error rate > 5%
- **Oracle Down**: Oracle not updating

### Warning Alerts

- **High Latency**: P95 latency > 300ms
- **Low Throughput**: Throughput < 50% of normal
- **High Gas Usage**: Gas usage > 80% of limit

### Alert Configuration

```yaml
# alertmanager-config.yaml
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
```

## Troubleshooting

### Prometheus Not Scraping

**Symptoms**: No metrics in Prometheus

**Solution**:
1. Check service discovery configuration
2. Verify node labels match
3. Check network connectivity
4. Review Prometheus logs

### Grafana Not Showing Data

**Symptoms**: Dashboards show "No data"

**Solution**:
1. Verify Prometheus data source
2. Check query syntax
3. Verify time range
4. Check metric names

### Alerts Not Firing

**Symptoms**: Conditions met but no alerts

**Solution**:
1. Check alert rule syntax
2. Verify Alertmanager configuration
3. Check notification channels
4. Review Alertmanager logs

## Related Documentation

- [Architecture Documentation](../architecture/ARCHITECTURE.md)
- [Deployment Guide](../deployment/DEPLOYMENT.md)
- [Troubleshooting Guide](../guides/TROUBLESHOOTING.md)

---

**Last Updated**: 2025-01-27
Add Oracle Aggregator and CCIP Integration - Introduced Aggregator.sol for Chainlink-compatible oracle functionality, including round-based updates and access control. - Added OracleWithCCIP.sol to extend Aggregator with CCIP cross-chain messaging capabilities. - Created .gitmodules to include OpenZeppelin contracts as a submodule. - Developed a comprehensive deployment guide in NEXT_STEPS_COMPLETE_GUIDE.md for Phase 2 and smart contract deployment. - Implemented Vite configuration for the orchestration portal, supporting both Vue and React frameworks. - Added server-side logic for the Multi-Cloud Orchestration Portal, including API endpoints for environment management and monitoring. - Created scripts for resource import and usage validation across non-US regions. - Added tests for CCIP error handling and integration to ensure robust functionality. - Included various new files and directories for the orchestration portal and deployment scripts. 2025-12-12 14:57:48 -08:00			`# Monitoring Setup Guide`

			`Last Updated: 2025-01-27`
			`Status: Active`

			`This guide explains how to set up and configure the monitoring stack for the DeFi Oracle Meta Mainnet.`

			`## Table of Contents`

			`- [Overview](#overview)`
			`- [Monitoring Stack](#monitoring-stack)`
			`- [Setup Instructions](#setup-instructions)`
			`- [Dashboards](#dashboards)`
			`- [Alerts](#alerts)`
			`- [Troubleshooting](#troubleshooting)`

			`## Overview`

			`The monitoring stack consists of:`
			`- Prometheus - Metrics collection`
			`- Grafana - Visualization and dashboards`
			`- Loki - Log aggregation`
			`- Alertmanager - Alert routing and notification`
			`- Jaeger - Distributed tracing`
			`- OpenTelemetry - Observability framework`

			`## Monitoring Stack`

			`### Prometheus`

			`Purpose: Metrics collection and storage`

			`Features:`
			`- Scrapes metrics from all Besu nodes`
			`- Custom metrics for oracle updates`
			`- Alert rules for node health`

			`### Grafana`

			`Purpose: Visualization and dashboards`

			`Dashboards:`
			`- Besu node health`
			`- Block production metrics`
			`- RPC performance metrics`
			`- Oracle feed status`
			`- CCIP monitoring`

			`### Loki`

			`Purpose: Log aggregation`

			`Features:`
			`- Centralized log collection`
			`- Structured logging`
			`- Log retention policies`

			`### Alertmanager`

			`Purpose: Alert routing and notification`

			`Features:`
			`- Alert routing`
			`- Notification channels (email, Slack, PagerDuty)`
			`- Alert inhibition rules`

			`## Setup Instructions`

			`### 1. Deploy Prometheus`

			```bash
			`# Deploy Prometheus`
			`kubectl apply -f monitoring/k8s/prometheus.yaml`

			`# Verify deployment`
			`kubectl get pods -n monitoring -l app=prometheus`
			```

			`### 2. Deploy Grafana`

			```bash
			`# Deploy Grafana using Helm`
			`helm install grafana grafana/grafana -n monitoring`

			`# Get admin password`
			`kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" \| base64 --decode`
			```

			`### 3. Deploy Loki`

			```bash
			`# Deploy Loki`
			`kubectl apply -f monitoring/k8s/loki.yaml`

			`# Verify deployment`
			`kubectl get pods -n monitoring -l app=loki`
			```

			`### 4. Deploy Alertmanager`

			```bash
			`# Deploy Alertmanager`
			`kubectl apply -f monitoring/k8s/alertmanager.yaml`

			`# Verify deployment`
			`kubectl get pods -n monitoring -l app=alertmanager`
			```

			`### 5. Configure Service Discovery`

			`Prometheus needs to discover Besu nodes:`

			```yaml
			`# prometheus-config.yaml`
			`scrape_configs:`
			`- job_name: 'besu-nodes'`
			`kubernetes_sd_configs:`
			`- role: pod`
			`namespaces:`
			`names:`
			`- besu-network`
			```

			`## Dashboards`

			`### Besu Node Dashboard`

			`Metrics:`
			`- Block production rate`
			`- Transaction throughput`
			`- Gas usage`
			`- Peer connections`
			`- Sync status`

			`Access: Grafana → Dashboards → Besu Node Health`

			`### RPC Performance Dashboard`

			`Metrics:`
			`- Request rate`
			`- Response time (p50, p95, p99)`
			`- Error rate`
			`- Method distribution`

			`Access: Grafana → Dashboards → RPC Performance`

			`### Oracle Dashboard`

			`Metrics:`
			`- Update frequency`
			`- Round completion time`
			`- Deviation from sources`
			`- Transmitter status`

			`Access: Grafana → Dashboards → Oracle Status`

			`### CCIP Dashboard`

			`Metrics:`
			`- Message throughput`
			`- Cross-chain latency`
			`- Fee accumulation`
			`- Error rate`

			`Access: Grafana → Dashboards → CCIP Monitoring`

			`## Alerts`

			`### Critical Alerts`

			`- Node Down: Besu node not responding`
			`- Block Production Stopped: No blocks produced in 30 seconds`
			`- High Error Rate: Error rate > 5%`
			`- Oracle Down: Oracle not updating`

			`### Warning Alerts`

			`- High Latency: P95 latency > 300ms`
			`- Low Throughput: Throughput < 50% of normal`
			`- High Gas Usage: Gas usage > 80% of limit`

			`### Alert Configuration`

			```yaml
			`# alertmanager-config.yaml`
			`route:`
			`group_by: ['alertname']`
			`group_wait: 10s`
			`group_interval: 10s`
			`repeat_interval: 12h`
			`receiver: 'default'`
			`routes:`
			`- match:`
			`severity: critical`
			`receiver: 'critical-alerts'`
			```

			`## Troubleshooting`

			`### Prometheus Not Scraping`

			`Symptoms: No metrics in Prometheus`

			`Solution:`
			`1. Check service discovery configuration`
			`2. Verify node labels match`
			`3. Check network connectivity`
			`4. Review Prometheus logs`

			`### Grafana Not Showing Data`

			`Symptoms: Dashboards show "No data"`

			`Solution:`
			`1. Verify Prometheus data source`
			`2. Check query syntax`
			`3. Verify time range`
			`4. Check metric names`

			`### Alerts Not Firing`

			`Symptoms: Conditions met but no alerts`

			`Solution:`
			`1. Check alert rule syntax`
			`2. Verify Alertmanager configuration`
			`3. Check notification channels`
			`4. Review Alertmanager logs`

			`## Related Documentation`

			`- [Architecture Documentation](../architecture/ARCHITECTURE.md)`
			`- [Deployment Guide](../deployment/DEPLOYMENT.md)`
			`- [Troubleshooting Guide](../guides/TROUBLESHOOTING.md)`

			`---`

			`Last Updated: 2025-01-27`