Files
loc_az_hci/docs/troubleshooting/common-issues.md
defiQUG c39465c2bd
Some checks failed
Test / test (push) Has been cancelled
Initial commit: loc_az_hci (smom-dbis-138 excluded via .gitignore)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-08 09:04:46 -08:00

5.3 KiB

Common Issues and Solutions

This document covers frequently encountered problems and their solutions.

Proxmox Issues

Cannot Connect to Proxmox Web UI

Symptoms:

  • Browser shows connection error
  • SSL certificate warning

Solutions:

  1. Verify IP address and port (default: 8006)
  2. Accept self-signed certificate in browser
  3. Check firewall rules: iptables -L -n
  4. Verify Proxmox service: systemctl status pveproxy

VM Won't Start

Symptoms:

  • VM shows as stopped
  • Error messages in logs

Solutions:

  1. Check VM configuration: qm config <vmid>
  2. Verify storage availability: pvesm status
  3. Check resource limits: pvesh get /nodes/<node>/status
  4. Review VM logs: journalctl -u qemu-server@<vmid>

Cluster Issues

Symptoms:

  • Nodes not showing in cluster
  • Quorum errors

Solutions:

  1. Check cluster status: pvecm status
  2. Verify network connectivity between nodes
  3. Check cluster configuration: cat /etc/pve/corosync.conf
  4. Restart cluster services: systemctl restart pve-cluster

Azure Arc Issues

Agent Not Connecting

Symptoms:

  • Machine not appearing in Azure Portal
  • Connection errors in logs

Solutions:

  1. Check agent status: azcmagent status
  2. Verify network connectivity to Azure: curl -v https://management.azure.com
  3. Check agent logs: journalctl -u himdsd -f
  4. Re-register agent: azcmagent connect --resource-group <rg> --tenant-id <tenant>

Policy Not Applying

Symptoms:

  • Policies not showing as compliant
  • Assignment errors

Solutions:

  1. Verify agent is connected: azcmagent status
  2. Check policy assignment in Azure Portal
  3. Review policy logs: azcmagent show
  4. Re-assign policies if needed

Kubernetes Issues

Pods Not Starting

Symptoms:

  • Pods in Pending or CrashLoopBackOff state
  • Resource errors

Solutions:

  1. Check pod status: kubectl describe pod <pod-name>
  2. Check node resources: kubectl top nodes
  3. Review pod logs: kubectl logs <pod-name>
  4. Check events: kubectl get events --sort-by='.lastTimestamp'

Services Not Accessible

Symptoms:

  • Cannot reach service endpoints
  • Connection timeouts

Solutions:

  1. Check service configuration: kubectl get svc <service-name> -o yaml
  2. Verify endpoints: kubectl get endpoints <service-name>
  3. Check ingress configuration: kubectl get ingress
  4. Test from within cluster: kubectl run test --image=busybox --rm -it -- wget -O- <service-url>

Network Issues

VLAN Not Working

Symptoms:

  • VMs cannot communicate on VLAN
  • Network isolation not working

Solutions:

  1. Verify VLAN configuration: cat /etc/network/interfaces
  2. Check bridge configuration: ip link show
  3. Verify VLAN tagging: qm config <vmid> | grep net
  4. Test VLAN connectivity: ping <vlan-ip>

DNS Resolution Issues

Symptoms:

  • Cannot resolve hostnames
  • Service discovery not working

Solutions:

  1. Check DNS configuration: cat /etc/resolv.conf
  2. Test DNS resolution: nslookup <hostname>
  3. Verify CoreDNS in Kubernetes: kubectl get pods -n kube-system | grep coredns
  4. Check DNS service: kubectl get svc kube-dns -n kube-system

Storage Issues

Storage Not Available

Symptoms:

  • Cannot create VMs
  • Storage errors

Solutions:

  1. Check storage status: pvesm status
  2. Verify storage mounts: df -h
  3. Check storage permissions: ls -la /var/lib/vz/
  4. Review storage logs: journalctl -u pvestatd

Performance Issues

Symptoms:

  • Slow VM performance
  • High I/O wait

Solutions:

  1. Check disk I/O: iostat -x 1
  2. Verify storage type (SSD vs HDD)
  3. Check for disk errors: dmesg | grep -i error
  4. Consider storage optimization settings

Cloudflare Tunnel Issues

Tunnel Not Connecting

Symptoms:

  • Services not accessible externally
  • Tunnel errors in logs

Solutions:

  1. Check tunnel status: cloudflared tunnel info
  2. Verify tunnel token: echo $CLOUDFLARE_TUNNEL_TOKEN
  3. Check tunnel logs: journalctl -u cloudflared -f
  4. Test tunnel connection: cloudflared tunnel run <tunnel-name>

Zero Trust Not Working

Symptoms:

  • Access policies not applying
  • SSO not working

Solutions:

  1. Verify Zero Trust configuration in Cloudflare Dashboard
  2. Check policy rules and conditions
  3. Review access logs in Cloudflare Dashboard
  4. Test with different user accounts

General Troubleshooting Steps

  1. Check Logs: Always review relevant logs first
  2. Verify Configuration: Ensure all configuration files are correct
  3. Test Connectivity: Verify network connectivity between components
  4. Check Resources: Ensure sufficient CPU, memory, and storage
  5. Review Documentation: Check relevant documentation and runbooks
  6. Search Issues: Look for similar issues in logs or documentation

Getting Help

If you cannot resolve an issue:

  1. Review the relevant runbook in docs/operations/runbooks/
  2. Check the troubleshooting guide for your specific component
  3. Review logs and error messages carefully
  4. Document the issue with steps to reproduce
  5. Check for known issues in the project repository

Additional Resources