- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
522 lines
9.8 KiB
Markdown
522 lines
9.8 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues and solutions for Sankofa Phoenix.
|
|
|
|
## Table of Contents
|
|
|
|
1. [API Issues](#api-issues)
|
|
2. [Database Issues](#database-issues)
|
|
3. [Authentication Issues](#authentication-issues)
|
|
4. [Resource Provisioning](#resource-provisioning)
|
|
5. [Billing Issues](#billing-issues)
|
|
6. [Performance Issues](#performance-issues)
|
|
7. [Deployment Issues](#deployment-issues)
|
|
|
|
## API Issues
|
|
|
|
### API Not Responding
|
|
|
|
**Symptoms:**
|
|
- 503 Service Unavailable
|
|
- Connection timeout
|
|
- Health check fails
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods -n api
|
|
|
|
# Check logs
|
|
kubectl logs -n api deployment/api --tail=100
|
|
|
|
# Check service
|
|
kubectl get svc -n api api
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Restart API deployment:
|
|
```bash
|
|
kubectl rollout restart deployment/api -n api
|
|
```
|
|
|
|
2. Check resource limits:
|
|
```bash
|
|
kubectl describe pod -n api -l app=api
|
|
```
|
|
|
|
3. Verify database connection:
|
|
```bash
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT 1"
|
|
```
|
|
|
|
### GraphQL Query Errors
|
|
|
|
**Symptoms:**
|
|
- GraphQL errors in response
|
|
- "Internal server error"
|
|
- Query timeouts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check API logs for errors
|
|
kubectl logs -n api deployment/api | grep -i error
|
|
|
|
# Test GraphQL endpoint
|
|
curl -X POST https://api.sankofa.nexus/graphql \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"query": "{ health { status } }"}'
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Check query syntax
|
|
2. Verify authentication token
|
|
3. Check database query performance
|
|
4. Review resolver logs
|
|
|
|
### Rate Limiting
|
|
|
|
**Symptoms:**
|
|
- 429 Too Many Requests
|
|
- Rate limit headers present
|
|
|
|
**Solutions:**
|
|
1. Implement request batching
|
|
2. Use subscriptions for real-time updates
|
|
3. Request rate limit increase (admin)
|
|
4. Implement client-side caching
|
|
|
|
## Database Issues
|
|
|
|
### Connection Pool Exhausted
|
|
|
|
**Symptoms:**
|
|
- "Too many connections" errors
|
|
- Slow query responses
|
|
- Database connection timeouts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check active connections
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
|
|
|
|
# Check connection pool metrics
|
|
curl https://api.sankofa.nexus/metrics | grep db_connections
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Increase connection pool size:
|
|
```yaml
|
|
env:
|
|
- name: DB_POOL_SIZE
|
|
value: "30"
|
|
```
|
|
|
|
2. Close idle connections:
|
|
```sql
|
|
SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
|
|
```
|
|
|
|
3. Restart API to reset connections
|
|
|
|
### Slow Queries
|
|
|
|
**Symptoms:**
|
|
- High query latency
|
|
- Timeout errors
|
|
- Database CPU high
|
|
|
|
**Diagnosis:**
|
|
```sql
|
|
-- Find slow queries
|
|
SELECT query, mean_exec_time, calls
|
|
FROM pg_stat_statements
|
|
ORDER BY mean_exec_time DESC
|
|
LIMIT 10;
|
|
|
|
-- Check table sizes
|
|
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
|
FROM pg_tables
|
|
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Add database indexes:
|
|
```sql
|
|
CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
|
|
CREATE INDEX idx_resources_status ON resources(status);
|
|
```
|
|
|
|
2. Analyze tables:
|
|
```sql
|
|
ANALYZE resources;
|
|
```
|
|
|
|
3. Optimize queries
|
|
4. Consider read replicas for heavy read workloads
|
|
|
|
### Database Lock Issues
|
|
|
|
**Symptoms:**
|
|
- Queries hanging
|
|
- "Lock timeout" errors
|
|
- Deadlock errors
|
|
|
|
**Solutions:**
|
|
1. Check for long-running transactions:
|
|
```sql
|
|
SELECT pid, state, query, now() - xact_start AS duration
|
|
FROM pg_stat_activity
|
|
WHERE state = 'active' AND xact_start IS NOT NULL
|
|
ORDER BY duration DESC;
|
|
```
|
|
|
|
2. Terminate blocking queries (if safe)
|
|
3. Review transaction isolation levels
|
|
4. Break up large transactions
|
|
|
|
## Authentication Issues
|
|
|
|
### Token Expired
|
|
|
|
**Symptoms:**
|
|
- 401 Unauthorized
|
|
- "Token expired" error
|
|
- Keycloak errors
|
|
|
|
**Solutions:**
|
|
1. Refresh token via Keycloak
|
|
2. Re-authenticate
|
|
3. Check token expiration settings in Keycloak
|
|
|
|
### Invalid Token
|
|
|
|
**Symptoms:**
|
|
- 401 Unauthorized
|
|
- "Invalid token" error
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Verify Keycloak is accessible
|
|
curl https://keycloak.sankofa.nexus/health
|
|
|
|
# Check Keycloak logs
|
|
kubectl logs -n keycloak deployment/keycloak --tail=100
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify token format
|
|
2. Check Keycloak client configuration
|
|
3. Verify token signature
|
|
4. Check clock synchronization
|
|
|
|
### Permission Denied
|
|
|
|
**Symptoms:**
|
|
- 403 Forbidden
|
|
- "Access denied" error
|
|
|
|
**Solutions:**
|
|
1. Verify user role in Keycloak
|
|
2. Check tenant context
|
|
3. Review RBAC policies
|
|
4. Verify resource ownership
|
|
|
|
## Resource Provisioning
|
|
|
|
### VM Creation Fails
|
|
|
|
**Symptoms:**
|
|
- Resource stuck in PENDING
|
|
- Proxmox errors
|
|
- Crossplane errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check Crossplane provider
|
|
kubectl get pods -n crossplane-system | grep proxmox
|
|
|
|
# Check ProxmoxVM resource
|
|
kubectl describe proxmoxvm -n default test-vm
|
|
|
|
# Check Proxmox connectivity
|
|
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
|
curl https://proxmox-endpoint:8006/api2/json/version
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify Proxmox credentials
|
|
2. Check Proxmox node availability
|
|
3. Verify resource quotas
|
|
4. Check Crossplane provider logs
|
|
|
|
### Resource Update Fails
|
|
|
|
**Symptoms:**
|
|
- Update mutation fails
|
|
- Resource not updating
|
|
- Status mismatch
|
|
|
|
**Solutions:**
|
|
1. Check resource state
|
|
2. Verify update permissions
|
|
3. Review resource constraints
|
|
4. Check for conflicting updates
|
|
|
|
## Billing Issues
|
|
|
|
### Incorrect Costs
|
|
|
|
**Symptoms:**
|
|
- Unexpected charges
|
|
- Missing usage records
|
|
- Cost discrepancies
|
|
|
|
**Diagnosis:**
|
|
```sql
|
|
-- Check usage records
|
|
SELECT * FROM usage_records
|
|
WHERE tenant_id = 'tenant-id'
|
|
ORDER BY timestamp DESC
|
|
LIMIT 100;
|
|
|
|
-- Check billing calculations
|
|
SELECT * FROM invoices
|
|
WHERE tenant_id = 'tenant-id'
|
|
ORDER BY created_at DESC;
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Review usage records
|
|
2. Verify pricing configuration
|
|
3. Check for duplicate records
|
|
4. Recalculate costs if needed
|
|
|
|
### Budget Alerts Not Triggering
|
|
|
|
**Symptoms:**
|
|
- Budget exceeded but no alert
|
|
- Alerts not sent
|
|
|
|
**Diagnosis:**
|
|
```sql
|
|
-- Check budget status
|
|
SELECT * FROM budgets
|
|
WHERE tenant_id = 'tenant-id';
|
|
|
|
-- Check alert configuration
|
|
SELECT * FROM billing_alerts
|
|
WHERE tenant_id = 'tenant-id' AND enabled = true;
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify alert configuration
|
|
2. Check alert evaluation schedule
|
|
3. Review notification channels
|
|
4. Test alert manually
|
|
|
|
### Invoice Generation Fails
|
|
|
|
**Symptoms:**
|
|
- Invoice creation error
|
|
- Missing line items
|
|
- PDF generation fails
|
|
|
|
**Solutions:**
|
|
1. Check usage records exist
|
|
2. Verify billing period
|
|
3. Check PDF service
|
|
4. Review invoice template
|
|
|
|
## Performance Issues
|
|
|
|
### High Latency
|
|
|
|
**Symptoms:**
|
|
- Slow API responses
|
|
- Timeout errors
|
|
- High P95 latency
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check API metrics
|
|
curl https://api.sankofa.nexus/metrics | grep request_duration
|
|
|
|
# Check database performance
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Add caching layer
|
|
2. Optimize database queries
|
|
3. Scale API horizontally
|
|
4. Review N+1 query problems
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOM kills
|
|
- Pod restarts
|
|
- Memory warnings
|
|
|
|
**Solutions:**
|
|
1. Increase memory limits
|
|
2. Review memory leaks
|
|
3. Optimize data structures
|
|
4. Implement pagination
|
|
|
|
### High CPU Usage
|
|
|
|
**Symptoms:**
|
|
- Slow responses
|
|
- CPU throttling
|
|
- Pod evictions
|
|
|
|
**Solutions:**
|
|
1. Scale horizontally
|
|
2. Optimize algorithms
|
|
3. Add caching
|
|
4. Review expensive operations
|
|
|
|
## Deployment Issues
|
|
|
|
### Pods Not Starting
|
|
|
|
**Symptoms:**
|
|
- Pods in Pending/CrashLoopBackOff
|
|
- Image pull errors
|
|
- Init container failures
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check pod status
|
|
kubectl describe pod -n api <pod-name>
|
|
|
|
# Check events
|
|
kubectl get events -n api --sort-by='.lastTimestamp'
|
|
|
|
# Check logs
|
|
kubectl logs -n api <pod-name>
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Check image availability
|
|
2. Verify resource requests/limits
|
|
3. Check node resources
|
|
4. Review init container logs
|
|
|
|
### Service Not Accessible
|
|
|
|
**Symptoms:**
|
|
- Service unreachable
|
|
- DNS resolution fails
|
|
- Ingress errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check service
|
|
kubectl get svc -n api
|
|
|
|
# Check ingress
|
|
kubectl describe ingress -n api api
|
|
|
|
# Test service directly
|
|
kubectl port-forward -n api svc/api 8080:80
|
|
curl http://localhost:8080/health
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify service selector matches pods
|
|
2. Check ingress configuration
|
|
3. Verify DNS records
|
|
4. Check network policies
|
|
|
|
### Configuration Issues
|
|
|
|
**Symptoms:**
|
|
- Wrong environment variables
|
|
- Missing secrets
|
|
- ConfigMap errors
|
|
|
|
**Solutions:**
|
|
1. Verify environment variables:
|
|
```bash
|
|
kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
|
|
```
|
|
|
|
2. Check secrets:
|
|
```bash
|
|
kubectl get secrets -n api
|
|
```
|
|
|
|
3. Review ConfigMaps:
|
|
```bash
|
|
kubectl get configmaps -n api
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
### Logs
|
|
|
|
```bash
|
|
# API logs
|
|
kubectl logs -n api deployment/api --tail=100 -f
|
|
|
|
# Database logs
|
|
kubectl logs -n api deployment/postgres --tail=100
|
|
|
|
# Keycloak logs
|
|
kubectl logs -n keycloak deployment/keycloak --tail=100
|
|
|
|
# Crossplane logs
|
|
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
|
|
```
|
|
|
|
### Metrics
|
|
|
|
```bash
|
|
# Prometheus queries
|
|
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
|
|
|
|
# Grafana dashboards
|
|
# Access: https://grafana.sankofa.nexus
|
|
```
|
|
|
|
### Support
|
|
|
|
- **Documentation**: See `docs/` directory
|
|
- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
|
|
- **API Documentation**: `docs/API_DOCUMENTATION.md`
|
|
|
|
## Common Error Messages
|
|
|
|
### "Database connection failed"
|
|
- Check database pod status
|
|
- Verify connection string
|
|
- Check network policies
|
|
|
|
### "Authentication required"
|
|
- Verify token in request
|
|
- Check token expiration
|
|
- Verify Keycloak is accessible
|
|
|
|
### "Quota exceeded"
|
|
- Review tenant quotas
|
|
- Check resource usage
|
|
- Request quota increase
|
|
|
|
### "Resource not found"
|
|
- Verify resource ID
|
|
- Check tenant context
|
|
- Review access permissions
|
|
|
|
### "Internal server error"
|
|
- Check application logs
|
|
- Review error details
|
|
- Check system resources
|
|
|