AKS Troubleshooting Decision Tree
AKS ISSUE│├── 1. Is the app unreachable?│ ││ ├── External access issue│ │ ├── Check DNS│ │ ├── Check Public/Internal Load Balancer│ │ ├── Check Ingress Controller│ │ ├── Check Service│ │ └── Check Pods│ ││ └── Internal access issue│ ├── Check Service name│ ├── Check CoreDNS│ ├── Check Network Policy│ └── Check Pod-to-Pod connectivity│├── 2. Are pods not running?│ ││ ├── Pending│ │ ├── Not enough CPU/memory│ │ ├── Taints/tolerations issue│ │ ├── Node selector/affinity issue│ │ ├── Cluster autoscaler maxed out│ │ └── Subnet IP exhaustion│ ││ ├── ImagePullBackOff│ │ ├── Wrong image name/tag│ │ ├── ACR permission missing│ │ └── Network/DNS issue to registry│ ││ └── CrashLoopBackOff│ ├── App bug│ ├── Missing secret/config map│ ├── Bad env variable│ └── Probe misconfigured│├── 3. Are nodes unhealthy?│ ││ ├── Node NotReady│ │ ├── VMSS health│ │ ├── Kubelet issue│ │ ├── Disk pressure│ │ ├── Memory pressure│ │ └── Network issue│ ││ └── Node pool issue│ ├── Upgrade failed│ ├── Scale operation failed│ └── Quota/capacity issue│├── 4. Is it a networking issue?│ ││ ├── DNS failure│ │ ├── CoreDNS│ │ ├── Private DNS zone│ │ └── Custom DNS forwarders│ ││ ├── Routing failure│ │ ├── UDR│ │ ├── Azure Firewall│ │ ├── NSG│ │ └── Route table association│ ││ └── Private Endpoint failure│ ├── DNS resolves to private IP?│ ├── VNet peering working?│ ├── NSG allows traffic?│ └── Private endpoint approved?│├── 5. Is it identity/security?│ ││ ├── Azure resource access failing│ │ ├── Managed identity assigned?│ │ ├── RBAC role correct?│ │ └── Workload Identity configured?│ ││ └── Kubernetes access failing│ ├── Azure AD login?│ ├── Kubernetes RBAC?│ └── Namespace permissions?│└── 6. Is it platform-wide? │ ├── Multiple services affected │ ├── Check Azure Service Health │ ├── Check regional outage │ └── Check dependency outage │ └── Only one app affected ├── Check recent deployment ├── Rollback if needed └── Compare config/secrets
Memorize This Shortcut
DNS → LB/Ingress → Service → Pod → Node → Network → Identity → Azure
Interview Line
“I troubleshoot AKS layer by layer: first access path, then Kubernetes objects, then node health, then Azure networking, identity, and platform dependencies.”