# Troubleshooting Guide Common issues and solutions for Sankofa Phoenix. ## Table of Contents 1. [API Issues](#api-issues) 2. [Database Issues](#database-issues) 3. [Authentication Issues](#authentication-issues) 4. [Resource Provisioning](#resource-provisioning) 5. [Billing Issues](#billing-issues) 6. [Performance Issues](#performance-issues) 7. [Deployment Issues](#deployment-issues) ## API Issues ### API Not Responding **Symptoms:** - 503 Service Unavailable - Connection timeout - Health check fails **Diagnosis:** ```bash # Check pod status kubectl get pods -n api # Check logs kubectl logs -n api deployment/api --tail=100 # Check service kubectl get svc -n api api ``` **Solutions:** 1. Restart API deployment: ```bash kubectl rollout restart deployment/api -n api ``` 2. Check resource limits: ```bash kubectl describe pod -n api -l app=api ``` 3. Verify database connection: ```bash kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT 1" ``` ### GraphQL Query Errors **Symptoms:** - GraphQL errors in response - "Internal server error" - Query timeouts **Diagnosis:** ```bash # Check API logs for errors kubectl logs -n api deployment/api | grep -i error # Test GraphQL endpoint curl -X POST https://api.sankofa.nexus/graphql \ -H "Content-Type: application/json" \ -d '{"query": "{ health { status } }"}' ``` **Solutions:** 1. Check query syntax 2. Verify authentication token 3. Check database query performance 4. Review resolver logs ### Rate Limiting **Symptoms:** - 429 Too Many Requests - Rate limit headers present **Solutions:** 1. Implement request batching 2. Use subscriptions for real-time updates 3. Request rate limit increase (admin) 4. Implement client-side caching ## Database Issues ### Connection Pool Exhausted **Symptoms:** - "Too many connections" errors - Slow query responses - Database connection timeouts **Diagnosis:** ```bash # Check active connections kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity" # Check connection pool metrics curl https://api.sankofa.nexus/metrics | grep db_connections ``` **Solutions:** 1. Increase connection pool size: ```yaml env: - name: DB_POOL_SIZE value: "30" ``` 2. Close idle connections: ```sql SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes'; ``` 3. Restart API to reset connections ### Slow Queries **Symptoms:** - High query latency - Timeout errors - Database CPU high **Diagnosis:** ```sql -- Find slow queries SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10; -- Check table sizes SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; ``` **Solutions:** 1. Add database indexes: ```sql CREATE INDEX idx_resources_tenant_id ON resources(tenant_id); CREATE INDEX idx_resources_status ON resources(status); ``` 2. Analyze tables: ```sql ANALYZE resources; ``` 3. Optimize queries 4. Consider read replicas for heavy read workloads ### Database Lock Issues **Symptoms:** - Queries hanging - "Lock timeout" errors - Deadlock errors **Solutions:** 1. Check for long-running transactions: ```sql SELECT pid, state, query, now() - xact_start AS duration FROM pg_stat_activity WHERE state = 'active' AND xact_start IS NOT NULL ORDER BY duration DESC; ``` 2. Terminate blocking queries (if safe) 3. Review transaction isolation levels 4. Break up large transactions ## Authentication Issues ### Token Expired **Symptoms:** - 401 Unauthorized - "Token expired" error - Keycloak errors **Solutions:** 1. Refresh token via Keycloak 2. Re-authenticate 3. Check token expiration settings in Keycloak ### Invalid Token **Symptoms:** - 401 Unauthorized - "Invalid token" error **Diagnosis:** ```bash # Verify Keycloak is accessible curl https://keycloak.sankofa.nexus/health # Check Keycloak logs kubectl logs -n keycloak deployment/keycloak --tail=100 ``` **Solutions:** 1. Verify token format 2. Check Keycloak client configuration 3. Verify token signature 4. Check clock synchronization ### Permission Denied **Symptoms:** - 403 Forbidden - "Access denied" error **Solutions:** 1. Verify user role in Keycloak 2. Check tenant context 3. Review RBAC policies 4. Verify resource ownership ## Resource Provisioning ### VM Creation Fails **Symptoms:** - Resource stuck in PENDING - Proxmox errors - Crossplane errors **Diagnosis:** ```bash # Check Crossplane provider kubectl get pods -n crossplane-system | grep proxmox # Check ProxmoxVM resource kubectl describe proxmoxvm -n default test-vm # Check Proxmox connectivity kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \ curl https://proxmox-endpoint:8006/api2/json/version ``` **Solutions:** 1. Verify Proxmox credentials 2. Check Proxmox node availability 3. Verify resource quotas 4. Check Crossplane provider logs ### Resource Update Fails **Symptoms:** - Update mutation fails - Resource not updating - Status mismatch **Solutions:** 1. Check resource state 2. Verify update permissions 3. Review resource constraints 4. Check for conflicting updates ## Billing Issues ### Incorrect Costs **Symptoms:** - Unexpected charges - Missing usage records - Cost discrepancies **Diagnosis:** ```sql -- Check usage records SELECT * FROM usage_records WHERE tenant_id = 'tenant-id' ORDER BY timestamp DESC LIMIT 100; -- Check billing calculations SELECT * FROM invoices WHERE tenant_id = 'tenant-id' ORDER BY created_at DESC; ``` **Solutions:** 1. Review usage records 2. Verify pricing configuration 3. Check for duplicate records 4. Recalculate costs if needed ### Budget Alerts Not Triggering **Symptoms:** - Budget exceeded but no alert - Alerts not sent **Diagnosis:** ```sql -- Check budget status SELECT * FROM budgets WHERE tenant_id = 'tenant-id'; -- Check alert configuration SELECT * FROM billing_alerts WHERE tenant_id = 'tenant-id' AND enabled = true; ``` **Solutions:** 1. Verify alert configuration 2. Check alert evaluation schedule 3. Review notification channels 4. Test alert manually ### Invoice Generation Fails **Symptoms:** - Invoice creation error - Missing line items - PDF generation fails **Solutions:** 1. Check usage records exist 2. Verify billing period 3. Check PDF service 4. Review invoice template ## Performance Issues ### High Latency **Symptoms:** - Slow API responses - Timeout errors - High P95 latency **Diagnosis:** ```bash # Check API metrics curl https://api.sankofa.nexus/metrics | grep request_duration # Check database performance kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10" ``` **Solutions:** 1. Add caching layer 2. Optimize database queries 3. Scale API horizontally 4. Review N+1 query problems ### High Memory Usage **Symptoms:** - OOM kills - Pod restarts - Memory warnings **Solutions:** 1. Increase memory limits 2. Review memory leaks 3. Optimize data structures 4. Implement pagination ### High CPU Usage **Symptoms:** - Slow responses - CPU throttling - Pod evictions **Solutions:** 1. Scale horizontally 2. Optimize algorithms 3. Add caching 4. Review expensive operations ## Deployment Issues ### Pods Not Starting **Symptoms:** - Pods in Pending/CrashLoopBackOff - Image pull errors - Init container failures **Diagnosis:** ```bash # Check pod status kubectl describe pod -n api # Check events kubectl get events -n api --sort-by='.lastTimestamp' # Check logs kubectl logs -n api ``` **Solutions:** 1. Check image availability 2. Verify resource requests/limits 3. Check node resources 4. Review init container logs ### Service Not Accessible **Symptoms:** - Service unreachable - DNS resolution fails - Ingress errors **Diagnosis:** ```bash # Check service kubectl get svc -n api # Check ingress kubectl describe ingress -n api api # Test service directly kubectl port-forward -n api svc/api 8080:80 curl http://localhost:8080/health ``` **Solutions:** 1. Verify service selector matches pods 2. Check ingress configuration 3. Verify DNS records 4. Check network policies ### Configuration Issues **Symptoms:** - Wrong environment variables - Missing secrets - ConfigMap errors **Solutions:** 1. Verify environment variables: ```bash kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_" ``` 2. Check secrets: ```bash kubectl get secrets -n api ``` 3. Review ConfigMaps: ```bash kubectl get configmaps -n api ``` ## Getting Help ### Logs ```bash # API logs kubectl logs -n api deployment/api --tail=100 -f # Database logs kubectl logs -n api deployment/postgres --tail=100 # Keycloak logs kubectl logs -n keycloak deployment/keycloak --tail=100 # Crossplane logs kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100 ``` ### Metrics ```bash # Prometheus queries curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up' # Grafana dashboards # Access: https://grafana.sankofa.nexus ``` ### Support - **Documentation**: See `docs/` directory - **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md` - **API Documentation**: `docs/API_DOCUMENTATION.md` ## Common Error Messages ### "Database connection failed" - Check database pod status - Verify connection string - Check network policies ### "Authentication required" - Verify token in request - Check token expiration - Verify Keycloak is accessible ### "Quota exceeded" - Review tenant quotas - Check resource usage - Request quota increase ### "Resource not found" - Verify resource ID - Check tenant context - Review access permissions ### "Internal server error" - Check application logs - Review error details - Check system resources