Sankofa/docs/TROUBLESHOOTING_GUIDE.md

# Troubleshooting Guide

Common issues and solutions for Sankofa Phoenix.

## Table of Contents

1. [API Issues](#api-issues)
2. [Database Issues](#database-issues)
3. [Authentication Issues](#authentication-issues)
4. [Resource Provisioning](#resource-provisioning)
5. [Billing Issues](#billing-issues)
6. [Performance Issues](#performance-issues)
7. [Deployment Issues](#deployment-issues)

## API Issues

### API Not Responding

**Symptoms:**
- 503 Service Unavailable
- Connection timeout
- Health check fails

**Diagnosis:**
```bash
# Check pod status
kubectl get pods -n api

# Check logs
kubectl logs -n api deployment/api --tail=100

# Check service
kubectl get svc -n api api
```

**Solutions:**
1. Restart API deployment:
   ```bash
   kubectl rollout restart deployment/api -n api
   ```

2. Check resource limits:
   ```bash
   kubectl describe pod -n api -l app=api
   ```

3. Verify database connection:
   ```bash
   kubectl exec -it -n api deployment/api -- \
     psql $DATABASE_URL -c "SELECT 1"
   ```

### GraphQL Query Errors

**Symptoms:**
- GraphQL errors in response
- "Internal server error"
- Query timeouts

**Diagnosis:**
```bash
# Check API logs for errors
kubectl logs -n api deployment/api | grep -i error

# Test GraphQL endpoint
curl -X POST https://api.sankofa.nexus/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ health { status } }"}'
```

**Solutions:**
1. Check query syntax
2. Verify authentication token
3. Check database query performance
4. Review resolver logs

### Rate Limiting

**Symptoms:**
- 429 Too Many Requests
- Rate limit headers present

**Solutions:**
1. Implement request batching
2. Use subscriptions for real-time updates
3. Request rate limit increase (admin)
4. Implement client-side caching

## Database Issues

### Connection Pool Exhausted

**Symptoms:**
- "Too many connections" errors
- Slow query responses
- Database connection timeouts

**Diagnosis:**
```bash
# Check active connections
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"

# Check connection pool metrics
curl https://api.sankofa.nexus/metrics | grep db_connections
```

**Solutions:**
1. Increase connection pool size:
   ```yaml
   env:
     - name: DB_POOL_SIZE
       value: "30"
   ```

2. Close idle connections:
   ```sql
   SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
   ```

3. Restart API to reset connections

### Slow Queries

**Symptoms:**
- High query latency
- Timeout errors
- Database CPU high

**Diagnosis:**
```sql
-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```

**Solutions:**
1. Add database indexes:
   ```sql
   CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
   CREATE INDEX idx_resources_status ON resources(status);
   ```

2. Analyze tables:
   ```sql
   ANALYZE resources;
   ```

3. Optimize queries
4. Consider read replicas for heavy read workloads

### Database Lock Issues

**Symptoms:**
- Queries hanging
- "Lock timeout" errors
- Deadlock errors

**Solutions:**
1. Check for long-running transactions:
   ```sql
   SELECT pid, state, query, now() - xact_start AS duration
   FROM pg_stat_activity
   WHERE state = 'active' AND xact_start IS NOT NULL
   ORDER BY duration DESC;
   ```

2. Terminate blocking queries (if safe)
3. Review transaction isolation levels
4. Break up large transactions

## Authentication Issues

### Token Expired

**Symptoms:**
- 401 Unauthorized
- "Token expired" error
- Keycloak errors

**Solutions:**
1. Refresh token via Keycloak
2. Re-authenticate
3. Check token expiration settings in Keycloak

### Invalid Token

**Symptoms:**
- 401 Unauthorized
- "Invalid token" error

**Diagnosis:**
```bash
# Verify Keycloak is accessible
curl https://keycloak.sankofa.nexus/health

# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
```

**Solutions:**
1. Verify token format
2. Check Keycloak client configuration
3. Verify token signature
4. Check clock synchronization

### Permission Denied

**Symptoms:**
- 403 Forbidden
- "Access denied" error

**Solutions:**
1. Verify user role in Keycloak
2. Check tenant context
3. Review RBAC policies
4. Verify resource ownership

## Resource Provisioning

### VM Creation Fails

**Symptoms:**
- Resource stuck in PENDING
- Proxmox errors
- Crossplane errors

**Diagnosis:**
```bash
# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox

# Check ProxmoxVM resource
kubectl describe proxmoxvm -n default test-vm

# Check Proxmox connectivity
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
  curl https://proxmox-endpoint:8006/api2/json/version
```

**Solutions:**
1. Verify Proxmox credentials
2. Check Proxmox node availability
3. Verify resource quotas
4. Check Crossplane provider logs

### Resource Update Fails

**Symptoms:**
- Update mutation fails
- Resource not updating
- Status mismatch

**Solutions:**
1. Check resource state
2. Verify update permissions
3. Review resource constraints
4. Check for conflicting updates

## Billing Issues

### Incorrect Costs

**Symptoms:**
- Unexpected charges
- Missing usage records
- Cost discrepancies

**Diagnosis:**
```sql
-- Check usage records
SELECT * FROM usage_records
WHERE tenant_id = 'tenant-id'
ORDER BY timestamp DESC
LIMIT 100;

-- Check billing calculations
SELECT * FROM invoices
WHERE tenant_id = 'tenant-id'
ORDER BY created_at DESC;
```

**Solutions:**
1. Review usage records
2. Verify pricing configuration
3. Check for duplicate records
4. Recalculate costs if needed

### Budget Alerts Not Triggering

**Symptoms:**
- Budget exceeded but no alert
- Alerts not sent

**Diagnosis:**
```sql
-- Check budget status
SELECT * FROM budgets
WHERE tenant_id = 'tenant-id';

-- Check alert configuration
SELECT * FROM billing_alerts
WHERE tenant_id = 'tenant-id' AND enabled = true;
```

**Solutions:**
1. Verify alert configuration
2. Check alert evaluation schedule
3. Review notification channels
4. Test alert manually

### Invoice Generation Fails

**Symptoms:**
- Invoice creation error
- Missing line items
- PDF generation fails

**Solutions:**
1. Check usage records exist
2. Verify billing period
3. Check PDF service
4. Review invoice template

## Performance Issues

### High Latency

**Symptoms:**
- Slow API responses
- Timeout errors
- High P95 latency

**Diagnosis:**
```bash
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep request_duration

# Check database performance
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
```

**Solutions:**
1. Add caching layer
2. Optimize database queries
3. Scale API horizontally
4. Review N+1 query problems

### High Memory Usage

**Symptoms:**
- OOM kills
- Pod restarts
- Memory warnings

**Solutions:**
1. Increase memory limits
2. Review memory leaks
3. Optimize data structures
4. Implement pagination

### High CPU Usage

**Symptoms:**
- Slow responses
- CPU throttling
- Pod evictions

**Solutions:**
1. Scale horizontally
2. Optimize algorithms
3. Add caching
4. Review expensive operations

## Deployment Issues

### Pods Not Starting

**Symptoms:**
- Pods in Pending/CrashLoopBackOff
- Image pull errors
- Init container failures

**Diagnosis:**
```bash
# Check pod status
kubectl describe pod -n api <pod-name>

# Check events
kubectl get events -n api --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n api <pod-name>
```

**Solutions:**
1. Check image availability
2. Verify resource requests/limits
3. Check node resources
4. Review init container logs

### Service Not Accessible

**Symptoms:**
- Service unreachable
- DNS resolution fails
- Ingress errors

**Diagnosis:**
```bash
# Check service
kubectl get svc -n api

# Check ingress
kubectl describe ingress -n api api

# Test service directly
kubectl port-forward -n api svc/api 8080:80
curl http://localhost:8080/health
```

**Solutions:**
1. Verify service selector matches pods
2. Check ingress configuration
3. Verify DNS records
4. Check network policies

### Configuration Issues

**Symptoms:**
- Wrong environment variables
- Missing secrets
- ConfigMap errors

**Solutions:**
1. Verify environment variables:
   ```bash
   kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
   ```

2. Check secrets:
   ```bash
   kubectl get secrets -n api
   ```

3. Review ConfigMaps:
   ```bash
   kubectl get configmaps -n api
   ```

## Getting Help

### Logs

```bash
# API logs
kubectl logs -n api deployment/api --tail=100 -f

# Database logs
kubectl logs -n api deployment/postgres --tail=100

# Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100

# Crossplane logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
```

### Metrics

```bash
# Prometheus queries
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'

# Grafana dashboards
# Access: https://grafana.sankofa.nexus
```

### Support

- **Documentation**: See `docs/` directory
- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
- **API Documentation**: `docs/API_DOCUMENTATION.md`

## Common Error Messages

### "Database connection failed"
- Check database pod status
- Verify connection string
- Check network policies

### "Authentication required"
- Verify token in request
- Check token expiration
- Verify Keycloak is accessible

### "Quota exceeded"
- Review tenant quotas
- Check resource usage
- Request quota increase

### "Resource not found"
- Verify resource ID
- Check tenant context
- Review access permissions

### "Internal server error"
- Check application logs
- Review error details
- Check system resources