Top AKS Troubleshooting Steps for Common Issues

AKS Troubleshooting Decision Tree

AKS ISSUE
├── 1. Is the app unreachable?
│ │
│ ├── External access issue
│ │ ├── Check DNS
│ │ ├── Check Public/Internal Load Balancer
│ │ ├── Check Ingress Controller
│ │ ├── Check Service
│ │ └── Check Pods
│ │
│ └── Internal access issue
│ ├── Check Service name
│ ├── Check CoreDNS
│ ├── Check Network Policy
│ └── Check Pod-to-Pod connectivity
├── 2. Are pods not running?
│ │
│ ├── Pending
│ │ ├── Not enough CPU/memory
│ │ ├── Taints/tolerations issue
│ │ ├── Node selector/affinity issue
│ │ ├── Cluster autoscaler maxed out
│ │ └── Subnet IP exhaustion
│ │
│ ├── ImagePullBackOff
│ │ ├── Wrong image name/tag
│ │ ├── ACR permission missing
│ │ └── Network/DNS issue to registry
│ │
│ └── CrashLoopBackOff
│ ├── App bug
│ ├── Missing secret/config map
│ ├── Bad env variable
│ └── Probe misconfigured
├── 3. Are nodes unhealthy?
│ │
│ ├── Node NotReady
│ │ ├── VMSS health
│ │ ├── Kubelet issue
│ │ ├── Disk pressure
│ │ ├── Memory pressure
│ │ └── Network issue
│ │
│ └── Node pool issue
│ ├── Upgrade failed
│ ├── Scale operation failed
│ └── Quota/capacity issue
├── 4. Is it a networking issue?
│ │
│ ├── DNS failure
│ │ ├── CoreDNS
│ │ ├── Private DNS zone
│ │ └── Custom DNS forwarders
│ │
│ ├── Routing failure
│ │ ├── UDR
│ │ ├── Azure Firewall
│ │ ├── NSG
│ │ └── Route table association
│ │
│ └── Private Endpoint failure
│ ├── DNS resolves to private IP?
│ ├── VNet peering working?
│ ├── NSG allows traffic?
│ └── Private endpoint approved?
├── 5. Is it identity/security?
│ │
│ ├── Azure resource access failing
│ │ ├── Managed identity assigned?
│ │ ├── RBAC role correct?
│ │ └── Workload Identity configured?
│ │
│ └── Kubernetes access failing
│ ├── Azure AD login?
│ ├── Kubernetes RBAC?
│ └── Namespace permissions?
└── 6. Is it platform-wide?
├── Multiple services affected
│ ├── Check Azure Service Health
│ ├── Check regional outage
│ └── Check dependency outage
└── Only one app affected
├── Check recent deployment
├── Rollback if needed
└── Compare config/secrets

Memorize This Shortcut

DNS → LB/Ingress → Service → Pod → Node → Network → Identity → Azure

Interview Line

“I troubleshoot AKS layer by layer: first access path, then Kubernetes objects, then node health, then Azure networking, identity, and platform dependencies.”

Leave a comment