285 lines
6.0 KiB
Markdown
285 lines
6.0 KiB
Markdown
# Operational Runbook
|
|
|
|
## Table of Contents
|
|
1. [System Overview](#system-overview)
|
|
2. [Monitoring & Alerts](#monitoring--alerts)
|
|
3. [Common Operations](#common-operations)
|
|
4. [Troubleshooting](#troubleshooting)
|
|
5. [Disaster Recovery](#disaster-recovery)
|
|
|
|
## System Overview
|
|
|
|
### Architecture
|
|
- **Application**: Node.js/TypeScript Express server
|
|
- **Database**: PostgreSQL 14+
|
|
- **Cache/Sessions**: Redis (optional)
|
|
- **Metrics**: Prometheus format on `/metrics`
|
|
- **Health Check**: `/health` endpoint
|
|
|
|
### Key Endpoints
|
|
- API Base: `/api/v1`
|
|
- Terminal UI: `/`
|
|
- Health: `/health`
|
|
- Metrics: `/metrics`
|
|
- API Docs: `/api-docs`
|
|
|
|
## Monitoring & Alerts
|
|
|
|
### Key Metrics to Monitor
|
|
|
|
#### Payment Metrics
|
|
- `payments_initiated_total` - Total payments initiated
|
|
- `payments_approved_total` - Total payments approved
|
|
- `payments_completed_total` - Total payments completed
|
|
- `payments_failed_total` - Total payments failed
|
|
- `payment_processing_duration_seconds` - Processing latency
|
|
|
|
#### TLS Metrics
|
|
- `tls_connections_active` - Active TLS connections
|
|
- `tls_connection_errors_total` - TLS connection errors
|
|
- `tls_acks_received_total` - ACKs received
|
|
- `tls_nacks_received_total` - NACKs received
|
|
|
|
#### System Metrics
|
|
- `http_request_duration_seconds` - HTTP request latency
|
|
- `process_cpu_user_seconds_total` - CPU usage
|
|
- `process_resident_memory_bytes` - Memory usage
|
|
|
|
### Alert Thresholds
|
|
|
|
**Critical Alerts:**
|
|
- Payment failure rate > 5% in 5 minutes
|
|
- TLS connection errors > 10 in 1 minute
|
|
- Database connection pool exhaustion
|
|
- Health check failing
|
|
|
|
**Warning Alerts:**
|
|
- Payment processing latency p95 > 30s
|
|
- Unmatched reconciliation items > 10
|
|
- TLS circuit breaker OPEN state
|
|
|
|
## Common Operations
|
|
|
|
### Start System
|
|
|
|
```bash
|
|
# Using npm
|
|
npm start
|
|
|
|
# Using Docker Compose
|
|
docker-compose up -d
|
|
|
|
# Verify health
|
|
curl http://localhost:3000/health
|
|
```
|
|
|
|
### Stop System
|
|
|
|
```bash
|
|
# Graceful shutdown
|
|
docker-compose down
|
|
|
|
# Or send SIGTERM to process
|
|
kill -TERM <pid>
|
|
```
|
|
|
|
### Check System Status
|
|
|
|
```bash
|
|
# Health check
|
|
curl http://localhost:3000/health
|
|
|
|
# Metrics
|
|
curl http://localhost:3000/metrics
|
|
|
|
# Database connection
|
|
psql $DATABASE_URL -c "SELECT 1"
|
|
```
|
|
|
|
### View Logs
|
|
|
|
```bash
|
|
# Application logs
|
|
tail -f logs/application-*.log
|
|
|
|
# Docker logs
|
|
docker-compose logs -f app
|
|
|
|
# Audit logs (database)
|
|
psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100"
|
|
```
|
|
|
|
### Run Reconciliation
|
|
|
|
```bash
|
|
# Via API
|
|
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \
|
|
-H "Authorization: Bearer <token>"
|
|
|
|
# Check aging items
|
|
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \
|
|
-H "Authorization: Bearer <token>"
|
|
```
|
|
|
|
### Database Operations
|
|
|
|
```bash
|
|
# Run migrations
|
|
npm run migrate
|
|
|
|
# Rollback last migration
|
|
npm run migrate:rollback
|
|
|
|
# Seed operators
|
|
npm run seed
|
|
|
|
# Backup database
|
|
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql
|
|
|
|
# Restore database
|
|
psql $DATABASE_URL < backup_20240101.sql
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Payment Stuck in Processing
|
|
|
|
**Symptoms:**
|
|
- Payment status is `APPROVED` but not progressing
|
|
- No ledger posting or message generation
|
|
|
|
**Diagnosis:**
|
|
```sql
|
|
SELECT id, status, created_at, updated_at
|
|
FROM payments
|
|
WHERE status = 'APPROVED'
|
|
AND updated_at < NOW() - INTERVAL '5 minutes';
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Check application logs for errors
|
|
2. Verify compliance screening status
|
|
3. Check ledger adapter connectivity
|
|
4. Manually trigger processing if needed
|
|
|
|
### TLS Connection Issues
|
|
|
|
**Symptoms:**
|
|
- `tls_connection_errors_total` increasing
|
|
- Circuit breaker in OPEN state
|
|
- Messages not transmitting
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check TLS pool stats
|
|
curl http://localhost:3000/metrics | grep tls
|
|
|
|
# Check receiver connectivity
|
|
openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Verify receiver IP/port configuration
|
|
2. Check certificate validity
|
|
3. Verify network connectivity
|
|
4. Review TLS pool logs
|
|
5. Reset circuit breaker if needed
|
|
|
|
### Database Connection Issues
|
|
|
|
**Symptoms:**
|
|
- Health check shows database error
|
|
- High connection pool usage
|
|
- Query timeouts
|
|
|
|
**Diagnosis:**
|
|
```sql
|
|
-- Check active connections
|
|
SELECT count(*) FROM pg_stat_activity;
|
|
|
|
-- Check connection pool stats
|
|
SELECT * FROM pg_stat_database WHERE datname = 'dbis_core';
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Increase connection pool size in config
|
|
2. Check for long-running queries
|
|
3. Restart database if needed
|
|
4. Review connection pool settings
|
|
|
|
### Reconciliation Exceptions
|
|
|
|
**Symptoms:**
|
|
- High number of unmatched payments
|
|
- Aging items accumulating
|
|
|
|
**Resolution:**
|
|
1. Review reconciliation report
|
|
2. Check exception queue
|
|
3. Manually reconcile exceptions
|
|
4. Investigate root cause (missing ACK, ledger mismatch, etc.)
|
|
|
|
## Disaster Recovery
|
|
|
|
### Backup Procedures
|
|
|
|
**Daily Backups:**
|
|
```bash
|
|
# Database backup
|
|
pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz
|
|
|
|
# Audit logs export (for compliance)
|
|
psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER"
|
|
```
|
|
|
|
### Recovery Procedures
|
|
|
|
**Database Recovery:**
|
|
```bash
|
|
# Stop application
|
|
docker-compose stop app
|
|
|
|
# Restore database
|
|
gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL
|
|
|
|
# Run migrations
|
|
npm run migrate
|
|
|
|
# Restart application
|
|
docker-compose start app
|
|
```
|
|
|
|
### Data Retention
|
|
|
|
- **Audit Logs**: 7-10 years (configurable)
|
|
- **Payment Records**: Indefinite (archived after 7 years)
|
|
- **Application Logs**: 30 days
|
|
|
|
### Failover Procedures
|
|
|
|
1. **Application Failover:**
|
|
- Deploy to secondary server
|
|
- Update load balancer
|
|
- Verify health checks
|
|
|
|
2. **Database Failover:**
|
|
- Promote replica to primary
|
|
- Update DATABASE_URL
|
|
- Restart application
|
|
|
|
## Emergency Contacts
|
|
|
|
- **System Administrator**: [Contact]
|
|
- **Database Administrator**: [Contact]
|
|
- **Security Team**: [Contact]
|
|
- **On-Call Engineer**: [Contact]
|
|
|
|
## Change Management
|
|
|
|
All changes to production must:
|
|
1. Be tested in staging environment
|
|
2. Have rollback plan documented
|
|
3. Be approved by technical lead
|
|
4. Be performed during maintenance window
|
|
5. Be monitored post-deployment
|