Files
dbis_core-lite/docs/operations/runbook.md
2026-02-09 21:51:45 -08:00

6.0 KiB

Operational Runbook

Table of Contents

  1. System Overview
  2. Monitoring & Alerts
  3. Common Operations
  4. Troubleshooting
  5. Disaster Recovery

System Overview

Architecture

  • Application: Node.js/TypeScript Express server
  • Database: PostgreSQL 14+
  • Cache/Sessions: Redis (optional)
  • Metrics: Prometheus format on /metrics
  • Health Check: /health endpoint

Key Endpoints

  • API Base: /api/v1
  • Terminal UI: /
  • Health: /health
  • Metrics: /metrics
  • API Docs: /api-docs

Monitoring & Alerts

Key Metrics to Monitor

Payment Metrics

  • payments_initiated_total - Total payments initiated
  • payments_approved_total - Total payments approved
  • payments_completed_total - Total payments completed
  • payments_failed_total - Total payments failed
  • payment_processing_duration_seconds - Processing latency

TLS Metrics

  • tls_connections_active - Active TLS connections
  • tls_connection_errors_total - TLS connection errors
  • tls_acks_received_total - ACKs received
  • tls_nacks_received_total - NACKs received

System Metrics

  • http_request_duration_seconds - HTTP request latency
  • process_cpu_user_seconds_total - CPU usage
  • process_resident_memory_bytes - Memory usage

Alert Thresholds

Critical Alerts:

  • Payment failure rate > 5% in 5 minutes
  • TLS connection errors > 10 in 1 minute
  • Database connection pool exhaustion
  • Health check failing

Warning Alerts:

  • Payment processing latency p95 > 30s
  • Unmatched reconciliation items > 10
  • TLS circuit breaker OPEN state

Common Operations

Start System

# Using npm
npm start

# Using Docker Compose
docker-compose up -d

# Verify health
curl http://localhost:3000/health

Stop System

# Graceful shutdown
docker-compose down

# Or send SIGTERM to process
kill -TERM <pid>

Check System Status

# Health check
curl http://localhost:3000/health

# Metrics
curl http://localhost:3000/metrics

# Database connection
psql $DATABASE_URL -c "SELECT 1"

View Logs

# Application logs
tail -f logs/application-*.log

# Docker logs
docker-compose logs -f app

# Audit logs (database)
psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100"

Run Reconciliation

# Via API
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \
  -H "Authorization: Bearer <token>"

# Check aging items
curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \
  -H "Authorization: Bearer <token>"

Database Operations

# Run migrations
npm run migrate

# Rollback last migration
npm run migrate:rollback

# Seed operators
npm run seed

# Backup database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql

# Restore database
psql $DATABASE_URL < backup_20240101.sql

Troubleshooting

Payment Stuck in Processing

Symptoms:

  • Payment status is APPROVED but not progressing
  • No ledger posting or message generation

Diagnosis:

SELECT id, status, created_at, updated_at 
FROM payments 
WHERE status = 'APPROVED' 
  AND updated_at < NOW() - INTERVAL '5 minutes';

Resolution:

  1. Check application logs for errors
  2. Verify compliance screening status
  3. Check ledger adapter connectivity
  4. Manually trigger processing if needed

TLS Connection Issues

Symptoms:

  • tls_connection_errors_total increasing
  • Circuit breaker in OPEN state
  • Messages not transmitting

Diagnosis:

# Check TLS pool stats
curl http://localhost:3000/metrics | grep tls

# Check receiver connectivity
openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com

Resolution:

  1. Verify receiver IP/port configuration
  2. Check certificate validity
  3. Verify network connectivity
  4. Review TLS pool logs
  5. Reset circuit breaker if needed

Database Connection Issues

Symptoms:

  • Health check shows database error
  • High connection pool usage
  • Query timeouts

Diagnosis:

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Check connection pool stats
SELECT * FROM pg_stat_database WHERE datname = 'dbis_core';

Resolution:

  1. Increase connection pool size in config
  2. Check for long-running queries
  3. Restart database if needed
  4. Review connection pool settings

Reconciliation Exceptions

Symptoms:

  • High number of unmatched payments
  • Aging items accumulating

Resolution:

  1. Review reconciliation report
  2. Check exception queue
  3. Manually reconcile exceptions
  4. Investigate root cause (missing ACK, ledger mismatch, etc.)

Disaster Recovery

Backup Procedures

Daily Backups:

# Database backup
pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz

# Audit logs export (for compliance)
psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER"

Recovery Procedures

Database Recovery:

# Stop application
docker-compose stop app

# Restore database
gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL

# Run migrations
npm run migrate

# Restart application
docker-compose start app

Data Retention

  • Audit Logs: 7-10 years (configurable)
  • Payment Records: Indefinite (archived after 7 years)
  • Application Logs: 30 days

Failover Procedures

  1. Application Failover:

    • Deploy to secondary server
    • Update load balancer
    • Verify health checks
  2. Database Failover:

    • Promote replica to primary
    • Update DATABASE_URL
    • Restart application

Emergency Contacts

  • System Administrator: [Contact]
  • Database Administrator: [Contact]
  • Security Team: [Contact]
  • On-Call Engineer: [Contact]

Change Management

All changes to production must:

  1. Be tested in staging environment
  2. Have rollback plan documented
  3. Be approved by technical lead
  4. Be performed during maintenance window
  5. Be monitored post-deployment