Files
Sankofa/docs/TROUBLESHOOTING_GUIDE.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

522 lines
9.8 KiB
Markdown

# Troubleshooting Guide
Common issues and solutions for Sankofa Phoenix.
## Table of Contents
1. [API Issues](#api-issues)
2. [Database Issues](#database-issues)
3. [Authentication Issues](#authentication-issues)
4. [Resource Provisioning](#resource-provisioning)
5. [Billing Issues](#billing-issues)
6. [Performance Issues](#performance-issues)
7. [Deployment Issues](#deployment-issues)
## API Issues
### API Not Responding
**Symptoms:**
- 503 Service Unavailable
- Connection timeout
- Health check fails
**Diagnosis:**
```bash
# Check pod status
kubectl get pods -n api
# Check logs
kubectl logs -n api deployment/api --tail=100
# Check service
kubectl get svc -n api api
```
**Solutions:**
1. Restart API deployment:
```bash
kubectl rollout restart deployment/api -n api
```
2. Check resource limits:
```bash
kubectl describe pod -n api -l app=api
```
3. Verify database connection:
```bash
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1"
```
### GraphQL Query Errors
**Symptoms:**
- GraphQL errors in response
- "Internal server error"
- Query timeouts
**Diagnosis:**
```bash
# Check API logs for errors
kubectl logs -n api deployment/api | grep -i error
# Test GraphQL endpoint
curl -X POST https://api.sankofa.nexus/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ health { status } }"}'
```
**Solutions:**
1. Check query syntax
2. Verify authentication token
3. Check database query performance
4. Review resolver logs
### Rate Limiting
**Symptoms:**
- 429 Too Many Requests
- Rate limit headers present
**Solutions:**
1. Implement request batching
2. Use subscriptions for real-time updates
3. Request rate limit increase (admin)
4. Implement client-side caching
## Database Issues
### Connection Pool Exhausted
**Symptoms:**
- "Too many connections" errors
- Slow query responses
- Database connection timeouts
**Diagnosis:**
```bash
# Check active connections
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
# Check connection pool metrics
curl https://api.sankofa.nexus/metrics | grep db_connections
```
**Solutions:**
1. Increase connection pool size:
```yaml
env:
- name: DB_POOL_SIZE
value: "30"
```
2. Close idle connections:
```sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
```
3. Restart API to reset connections
### Slow Queries
**Symptoms:**
- High query latency
- Timeout errors
- Database CPU high
**Diagnosis:**
```sql
-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```
**Solutions:**
1. Add database indexes:
```sql
CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
CREATE INDEX idx_resources_status ON resources(status);
```
2. Analyze tables:
```sql
ANALYZE resources;
```
3. Optimize queries
4. Consider read replicas for heavy read workloads
### Database Lock Issues
**Symptoms:**
- Queries hanging
- "Lock timeout" errors
- Deadlock errors
**Solutions:**
1. Check for long-running transactions:
```sql
SELECT pid, state, query, now() - xact_start AS duration
FROM pg_stat_activity
WHERE state = 'active' AND xact_start IS NOT NULL
ORDER BY duration DESC;
```
2. Terminate blocking queries (if safe)
3. Review transaction isolation levels
4. Break up large transactions
## Authentication Issues
### Token Expired
**Symptoms:**
- 401 Unauthorized
- "Token expired" error
- Keycloak errors
**Solutions:**
1. Refresh token via Keycloak
2. Re-authenticate
3. Check token expiration settings in Keycloak
### Invalid Token
**Symptoms:**
- 401 Unauthorized
- "Invalid token" error
**Diagnosis:**
```bash
# Verify Keycloak is accessible
curl https://keycloak.sankofa.nexus/health
# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
```
**Solutions:**
1. Verify token format
2. Check Keycloak client configuration
3. Verify token signature
4. Check clock synchronization
### Permission Denied
**Symptoms:**
- 403 Forbidden
- "Access denied" error
**Solutions:**
1. Verify user role in Keycloak
2. Check tenant context
3. Review RBAC policies
4. Verify resource ownership
## Resource Provisioning
### VM Creation Fails
**Symptoms:**
- Resource stuck in PENDING
- Proxmox errors
- Crossplane errors
**Diagnosis:**
```bash
# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox
# Check ProxmoxVM resource
kubectl describe proxmoxvm -n default test-vm
# Check Proxmox connectivity
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
curl https://proxmox-endpoint:8006/api2/json/version
```
**Solutions:**
1. Verify Proxmox credentials
2. Check Proxmox node availability
3. Verify resource quotas
4. Check Crossplane provider logs
### Resource Update Fails
**Symptoms:**
- Update mutation fails
- Resource not updating
- Status mismatch
**Solutions:**
1. Check resource state
2. Verify update permissions
3. Review resource constraints
4. Check for conflicting updates
## Billing Issues
### Incorrect Costs
**Symptoms:**
- Unexpected charges
- Missing usage records
- Cost discrepancies
**Diagnosis:**
```sql
-- Check usage records
SELECT * FROM usage_records
WHERE tenant_id = 'tenant-id'
ORDER BY timestamp DESC
LIMIT 100;
-- Check billing calculations
SELECT * FROM invoices
WHERE tenant_id = 'tenant-id'
ORDER BY created_at DESC;
```
**Solutions:**
1. Review usage records
2. Verify pricing configuration
3. Check for duplicate records
4. Recalculate costs if needed
### Budget Alerts Not Triggering
**Symptoms:**
- Budget exceeded but no alert
- Alerts not sent
**Diagnosis:**
```sql
-- Check budget status
SELECT * FROM budgets
WHERE tenant_id = 'tenant-id';
-- Check alert configuration
SELECT * FROM billing_alerts
WHERE tenant_id = 'tenant-id' AND enabled = true;
```
**Solutions:**
1. Verify alert configuration
2. Check alert evaluation schedule
3. Review notification channels
4. Test alert manually
### Invoice Generation Fails
**Symptoms:**
- Invoice creation error
- Missing line items
- PDF generation fails
**Solutions:**
1. Check usage records exist
2. Verify billing period
3. Check PDF service
4. Review invoice template
## Performance Issues
### High Latency
**Symptoms:**
- Slow API responses
- Timeout errors
- High P95 latency
**Diagnosis:**
```bash
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep request_duration
# Check database performance
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
```
**Solutions:**
1. Add caching layer
2. Optimize database queries
3. Scale API horizontally
4. Review N+1 query problems
### High Memory Usage
**Symptoms:**
- OOM kills
- Pod restarts
- Memory warnings
**Solutions:**
1. Increase memory limits
2. Review memory leaks
3. Optimize data structures
4. Implement pagination
### High CPU Usage
**Symptoms:**
- Slow responses
- CPU throttling
- Pod evictions
**Solutions:**
1. Scale horizontally
2. Optimize algorithms
3. Add caching
4. Review expensive operations
## Deployment Issues
### Pods Not Starting
**Symptoms:**
- Pods in Pending/CrashLoopBackOff
- Image pull errors
- Init container failures
**Diagnosis:**
```bash
# Check pod status
kubectl describe pod -n api <pod-name>
# Check events
kubectl get events -n api --sort-by='.lastTimestamp'
# Check logs
kubectl logs -n api <pod-name>
```
**Solutions:**
1. Check image availability
2. Verify resource requests/limits
3. Check node resources
4. Review init container logs
### Service Not Accessible
**Symptoms:**
- Service unreachable
- DNS resolution fails
- Ingress errors
**Diagnosis:**
```bash
# Check service
kubectl get svc -n api
# Check ingress
kubectl describe ingress -n api api
# Test service directly
kubectl port-forward -n api svc/api 8080:80
curl http://localhost:8080/health
```
**Solutions:**
1. Verify service selector matches pods
2. Check ingress configuration
3. Verify DNS records
4. Check network policies
### Configuration Issues
**Symptoms:**
- Wrong environment variables
- Missing secrets
- ConfigMap errors
**Solutions:**
1. Verify environment variables:
```bash
kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
```
2. Check secrets:
```bash
kubectl get secrets -n api
```
3. Review ConfigMaps:
```bash
kubectl get configmaps -n api
```
## Getting Help
### Logs
```bash
# API logs
kubectl logs -n api deployment/api --tail=100 -f
# Database logs
kubectl logs -n api deployment/postgres --tail=100
# Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
# Crossplane logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
```
### Metrics
```bash
# Prometheus queries
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
# Grafana dashboards
# Access: https://grafana.sankofa.nexus
```
### Support
- **Documentation**: See `docs/` directory
- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
- **API Documentation**: `docs/API_DOCUMENTATION.md`
## Common Error Messages
### "Database connection failed"
- Check database pod status
- Verify connection string
- Check network policies
### "Authentication required"
- Verify token in request
- Check token expiration
- Verify Keycloak is accessible
### "Quota exceeded"
- Review tenant quotas
- Check resource usage
- Request quota increase
### "Resource not found"
- Verify resource ID
- Check tenant context
- Review access permissions
### "Internal server error"
- Check application logs
- Review error details
- Check system resources