OpenShift (OCP) interview

For an OpenShift (OCP) interview in 2026, you should expect questions that move beyond basic Kubernetes concepts and focus on enterprise operations, automation (Operators), and security.

Here is a curated list of high-value interview questions categorized by role and complexity.


1. Architectural Concepts

  • What is the role of the Cluster Version Operator (CVO)?
    • Answer: The CVO is the heart of OCP 4.x upgrades. It monitors the “desired state” of the cluster’s operators (the “payload”) and ensures the cluster is updated in a safe, coordinated manner across all components.
  • Explain the difference between an Infrastructure Node and a Worker Node.
    • Answer: Infrastructure nodes are used to host “cluster-level” services like the Router (Ingress), Monitoring (Prometheus/Grafana), and Registry. By labeling nodes as infra, companies can often save on Red Hat subscription costs, as these nodes typically don’t require the same licensing as nodes running application workloads.
  • What is the “Etcd Quorum” and why is it important in OCP?
    • Answer: OpenShift typically requires an odd number of Control Plane nodes (usually 3) to maintain a quorum in the etcd database. If you lose more than half of your masters, the cluster becomes read-only to prevent data corruption.

2. Networking & Traffic (The Gateway API Era)

  • Explain Ingress vs. Route vs. Gateway API. (See previous discussion)
    • Key Focus: Interviewers want to know if you understand that Routes are OCP-native, Ingress is K8s-standard, and Gateway API is the future standard for advanced traffic management (canary, mirroring, etc.).
  • How does “Service Serving Certificate Secrets” work in OCP?
    • Answer: OCP can automatically generate a TLS certificate for a Service. You annotate a Service with service.beta.openshift.io/serving-cert-secret-name. OCP then creates a secret containing a cert/key signed by the internal Cluster CA, allowing for easy end-to-end encryption.

3. Security (The “Hardest” Category)

  • Scenario: A developer says their pod won’t start because of a “Security Context” error. What do you check?
    • Answer: I would check the Security Context Constraints (SCC). By default, OCP runs pods with the restricted-v2 SCC, which prevents running as root. If the pod requires root or host access, I’d check if the ServiceAccount has been granted a more permissive SCC like anyuid or privileged.
  • What are NetworkPolicies vs. EgressFirewalls?
    • Answer: NetworkPolicies control traffic between pods inside the cluster (East-West). EgressFirewalls (part of OCP’s OVN-Kubernetes) control traffic leaving the cluster to external IPs or CIDR blocks (North-South).

4. Troubleshooting & Operations

  • How do you recover a cluster if the Control Plane certificates have expired?
    • Answer: This usually involves using the oc adm certificate approve command to approve pending CSRs (Certificate Signing Requests) or manually rolling back the cluster clock if it’s an emergency. OCP 4.x generally tries to auto-renew these, but clock drift can break it.
  • Describe the Source-to-Image (S2I) workflow.
    • Answer: S2I takes source code from Git, injects it into a “builder image” (like Node.js or Java), and outputs a ready-to-run container image. It simplifies the CI/CD process for developers who don’t want to write Dockerfiles.

5. Advanced / 2026 Trends

  • What is OpenShift Virtualization (KubeVirt)?
    • Answer: It allows you to run legacy Virtual Machines (VMs) as pods on OpenShift. This is critical for “modernizing” apps where one part is a container and the other is a legacy Windows or Linux VM that can’t be containerized yet.
  • How does Red Hat Advanced Cluster Management (RHACM) help in a multi-cluster setup?
    • Answer: RHACM provides a single pane of glass to manage security policies, application placement, and cluster lifecycle (creation/deletion) across multiple OCP clusters on AWS, Azure, and on-prem.

Quick Tip for the Interview

Whenever you answer, use the phrase “Operator-led design.” OpenShift 4 is built entirely on Operators. If the interviewer asks, “How do I fix the registry?” the best answer starts with, “I would check the status of the Image Registry Operator using oc get clusteroperator.” This shows you understand the fundamental architecture of the platform.

As an OpenShift Administrator, your interview will focus heavily on cluster stability, lifecycle management (upgrades), security enforcement, and the “Day 2” operations that keep an enterprise cluster running.

Here are the top admin-focused interview questions for 2026, divided by functional area.


1. Cluster Lifecycle & Maintenance

  • How does the Cluster Version Operator (CVO) manage upgrades, and what do you check if an upgrade hangs at 57%?
    • Answer: The CVO coordinates with all other cluster operators to reach a specific “desired version.” If it hangs, I check oc get clusteroperators to see which specific operator is degraded. Usually, it’s the Machine Config Operator (MCO) waiting for nodes to drain or the Authentication Operator having issues with etcd.
  • What is the “Must-Gather” tool, and when would you use it?
    • Answer: oc adm must-gather is the primary diagnostic tool. It launches a pod that collects logs, CRD states, and operating system debugging info. I use it before opening a Red Hat support ticket or when a complex issue involves multiple operators.
  • Explain how to back up and restore the etcd database.
    • Answer: I use the etcd-snapshot.sh script provided on the control plane nodes. For restoration, I must stop the static pods for the API server and etcd, then use the backup to restore the data directory. It’s critical to do this on a single control plane node first to re-establish a quorum.

2. Node & Infrastructure Management

  • What is a MachineConfigPool (MCP), and why would you pause it?
    • Answer: An MCP groups nodes (like master or worker) so the MCO can apply configurations to them. I would pause an MCP during a sensitive maintenance window or when troubleshooting a configuration change that I don’t want to roll out to all nodes at once.
  • How do you add a custom SSH key or a CronJob to the underlying RHCOS nodes?
    • Answer: You don’t log into the nodes manually. You create a MachineConfig YAML. The MCO then detects this, reboots the nodes (if necessary), and applies the change to the immutable filesystem.
  • What happens if a node enters a NotReady state?
    • Answer: First, I check node pressure (CPU/Memory/Disk). Then I check the kubelet and crio services on the node using oc debug node/<node-name>. I also check for network reachability between the node and the Control Plane.

3. Networking & Security

  • What is the benefit of OVN-Kubernetes over the legacy OpenShift SDN?
    • Answer: OVN-K is the default in 4.x. It supports modern features like IPsec encryption for pod-to-pod traffic, smarter load balancing, and Egress IPs for specific projects to exit the cluster via a fixed IP address for firewall white-listing.
  • A user is complaining they can’t reach a service in another project. What do you check?
    • Answer:
      1. NetworkPolicies: Is there a policy blocking “Cross-Namespace” traffic?
      2. Service/Endpoints: Does the Service have active Endpoints (oc get endpoints)?
      3. Namespace labels: If using a high-isolation network plugin, do the namespaces have the correct labels to “talk” to each other?
  • How do you restrict a specific group of users from creating LoadBalancer type services?
    • Answer: I would use an Admission Controller or a specialized RBAC role that removes the update/create verbs for the services/status resource, or more commonly, use a Policy Engine like Gatekeeper/OPA to deny the request.

4. Storage & Capacity Planning

  • How do you handle “Volume Expansion” if a database runs out of space?
    • Answer: If the underlying StorageClass supports allowVolumeExpansion: true, I simply edit the PersistentVolumeClaim (PVC) and increase the storage value. OpenShift and the CSI driver handle the resizing of the file system on the fly.
  • What is the difference between ReadWriteOnce (RWO) and ReadWriteMany (RWX)?
    • Answer: RWO allows only one node to mount the volume (good for databases). RWX allows multiple nodes/pods to mount it simultaneously (required for shared file storage like NFS or ODF).

5. Scenario-Based: “The Midnight Call”

  • Scenario: The Web Console is down, and oc commands are timing out. Where do you start?
    • Answer: This sounds like an API Server or etcd failure. I would:
      1. Log into a Control Plane node directly via SSH.
      2. Check the status of static pods in /etc/kubernetes/manifests.
      3. Run crictl ps to see if the kube-apiserver or etcd containers are crashing.
      4. Check the node’s disk space (etcd often fails if the disk is 100% full).

Pro-Tip for Admin Interviews:

In 2026, emphasize GitOps. Mention that you prefer managing cluster configurations (like HTPasswd providers or Quota objects) via ArgoCD rather than manual oc apply commands. This shows you are an admin who values Idempotency and Disaster Recovery.

Flux (or FluxCD)

Flux (or FluxCD) is a GitOps continuous delivery tool for Kubernetes. Here’s a concise breakdown:


What it does

Flux is an operator that runs in your Kubernetes cluster, constantly comparing the cluster’s live state to the state defined in your Git repo. If they differ, Flux automatically makes changes to the cluster to match the repo. In other words, Git is the single source of truth — you push a change to Git, Flux detects it and applies it to the cluster automatically, with no manual kubectl apply needed.


How it works — core components

Core components of FluxCD (the GitOps Toolkit) include the Source Controller, Kustomize Controller, Helm Controller, and Notification Controller. Each is a separate Kubernetes controller responsible for one concern:

  • Source Controller — watches Git repos, Helm repos, OCI registries, and S3 buckets for changes
  • Kustomize Controller — applies raw YAML and Kustomize overlays to the cluster
  • Helm Controller — manages HelmRelease objects (declarative Helm chart deployments)
  • Notification Controller — sends alerts to Slack, Teams, etc. when syncs succeed or fail

Key characteristics

  • Pull-based model: Flux enables pure pull-based GitOps application deployments — no access to clusters is needed by the source repo or by any other cluster. This is more secure than push-based pipelines where your CI system needs cluster credentials.
  • Drift detection: If your live cluster diverges from Git (e.g., due to manual edits), Flux will detect the drift and revert it, ensuring deterministic deployments.
  • Kubernetes-native: Flux v2 is built from the ground up to use Kubernetes’ API extension system. Everything is a CRD — GitRepository, Kustomization, HelmRelease, etc.
  • Security-first: Flux uses true Kubernetes RBAC via impersonation and supports multiple Git repositories. It follows a pull vs. push model, least amount of privileges, and adheres to Kubernetes security policies with tight integration with security tools.
  • Multi-cluster: Flux can use one Kubernetes cluster to manage apps in either the same or other clusters, spin up additional clusters, and manage cluster fleets.

CNCF standing & adoption

Flux is a Cloud Native Computing Foundation (CNCF) graduated project, used in production by various organisations and cloud providers. Notable users include Deutsche Telekom (managing 200+ clusters with just 10 engineers), the US Department of Defense, and Microsoft Azure (which uses Flux natively in AKS and Azure Arc).


Flux vs. Argo CD (the main alternative)

Flux CD is highly composable — use only the controllers you need. It’s preferred by teams who already think in CRDs and reconciliation loops, and is excellent for infrastructure-as-code and complex dependency handling. The main trade-off is that Flux has some drawbacks such as lack of a native UI and a steep learning curve. Argo CD is the better choice if your team wants a rich visual dashboard out of the box.


Relation to OCP

Flux is commonly used with OpenShift as the GitOps engine for managing cluster configuration and application deployments. Red Hat also ships OpenShift GitOps (based on Argo CD) as an official operator, so in OCP environments you’ll encounter both — Flux tends to be chosen by platform engineering teams who want tighter Kubernetes-native control, while OpenShift GitOps is the supported out-of-the-box option from Red Hat.

Here’s a thorough breakdown of how Flux integrates with OCP:


Installation — two options

Option 1: Flux Operator via OperatorHub (recommended)

Flux can be installed on a Red Hat OpenShift cluster directly from OperatorHub using the Flux Operator — an open-source project part of the Flux ecosystem that provides a declarative API for the lifecycle management of the Flux controllers on OpenShift.

Once installed, you declare a FluxInstance CR with cluster.type: openshift:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
distribution:
version: "2.x"
registry: "ghcr.io/fluxcd"
cluster:
type: openshift # ← tells Flux it's on OCP
multitenant: true
networkPolicy: true
sync:
kind: GitRepository
url: "https://my-git-server.com/my-org/my-fleet.git"
ref: "refs/heads/main"
path: "clusters/my-cluster"

Option 2: flux bootstrap CLI

The best way to install Flux on OpenShift via CLI is to use the flux bootstrap command. This command works with GitHub, GitLab, as well as generic Git providers. You require cluster-admin privileges to install Flux on OpenShift.


The OCP-specific challenge: SCCs

OCP’s default restricted-v2 SCC blocks containers from running as root — and Flux controllers, like many Kubernetes tools, need specific adjustments to run cleanly. The official integration handles this by:

  • Shipping a scc.yaml manifest that grants Flux controllers the correct non-root SCC permissions
  • Patching the Kustomization to remove the default SecComp profile and enforce the correct UID expected by Flux images, preventing OCP from altering the container user

The cluster.type: openshift flag in the FluxInstance spec automatically applies these adjustments — no manual SCC patching needed when using the Flux Operator.


What the integration looks like end-to-end

┌─────────────────────────────────────────────────────┐
│ Git Repository │
│ clusters/my-cluster/ │
│ ├── flux-system/ (Flux bootstrap manifests) │
│ ├── namespaces/ (OCP Projects) │
│ ├── rbac/ (Roles, RoleBindings, SCCs) │
│ └── apps/ (Deployments, Routes, etc.) │
└────────────────────┬────────────────────────────────┘
│ pull (every ~1 min)
┌─────────────────────────────────────────────────────┐
│ OCP Cluster (flux-system ns) │
│ source-controller → watches Git/OCI/Helm repos │
│ kustomize-controller→ applies YAML/Kustomize │
│ helm-controller → manages HelmReleases │
│ notification-ctrl → sends alerts to Slack etc. │
└─────────────────────────────────────────────────────┘

Multi-tenancy on OCP

When multitenant: true is set, Flux uses true Kubernetes RBAC via impersonation — meaning each tenant’s Kustomization runs under its own service account, scoped to its own namespace. This maps naturally to OCP Projects, where each team or app gets an isolated namespace with its own SCC and RBAC policies.

The pattern looks like this in Git:

# tenants/team-a/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a-apps
namespace: flux-system
spec:
serviceAccountName: team-a-reconciler # impersonates this SA
targetNamespace: team-a # deploys into this OCP Project
path: ./tenants/team-a/apps
sourceRef:
kind: GitRepository
name: fleet-repo

Each team-a-reconciler service account only has permissions within team-a‘s namespace — enforced by both RBAC and the namespace’s SCC policies.


Key considerations for OCP + Flux

TopicDetail
TestingFlux v2.3 was the first release end-to-end tested on OpenShift.
Operator lifecycleWhen a subscription is applied, OpenShift’s Operator Lifecycle Manager (OLM) automatically handles upgrading Flux.
Enterprise supportBackwards compatibility with older versions of Kubernetes and OpenShift is offered by vendors such as ControlPlane that provide enterprise support for Flux.
vs. OpenShift GitOpsRed Hat ships its own GitOps operator (based on Argo CD) as the officially supported option. Flux on OCP is community/third-party supported, preferred by teams who want a more Kubernetes-native, CLI-driven approach.
NetworkPolicySetting networkPolicy: true in the FluxInstance spec automatically creates NetworkPolicies for the flux-system namespace, restricting controller-to-controller traffic.

OCP (OpenShift Container Platform) Security Best Practices


Identity & Access Control

  • RBAC & Least Privilege: Every user, service account, and process should possess only the absolute minimum permissions needed. Isolate workloads using distinct service accounts, each bound to Roles containing relevant permissions, and avoid attaching sensitive permissions directly to user accounts.
  • Strong Authentication: Implement robust authentication mechanisms such as multi-factor authentication (MFA) or integrate with existing identity management systems to prevent unauthorized access.
  • Audit Regularly: Regularly audit Roles, ClusterRoles, RoleBindings, and SCC usage to ensure they remain aligned with the principle of least privilege and current needs.
  • Avoid kubeadmin: Don’t use the default kubeadmin superuser account in production — integrate with an enterprise identity provider instead.

Cluster & Node Hardening

  • Use RHCOS for nodes: It is best to leverage OpenShift’s relationship with cloud providers and use the most recent Red Hat Enterprise Linux CoreOS (RHCOS) for all OCP cluster nodes. RHCOS is designed to be as immutable as possible, and any changes to the node must be authorized through the Red Hat Machine Operator — no direct user access needed.
  • Control plane HA: A minimum of three control-plane nodes should be configured to allow accessibility in a node outage event.
  • Network isolation: Strict network isolation prevents unauthorized external ingress to OpenShift cluster API endpoints, nodes, or pod containers. The DNS, Ingress Controller, and API server can be set to private after installation.

Container Image Security

  • Scan images continuously: Use image scanning tools to detect vulnerabilities and malware within container images. Use trusted container images from reputable sources and regularly update them to include the latest security patches.
  • Policy enforcement: Define and enforce security policies for container images, ensuring that only images meeting specific criteria — such as being signed by trusted sources or containing no known vulnerabilities — are deployed.
  • No root containers: OpenShift has stricter security policies than vanilla Kubernetes — running a container as root is forbidden by default.

Security Context Constraints (SCCs)

OpenShift uses Security Context Constraints (SCCs) that give your cluster a strong security base. By default, OpenShift prevents cluster containers from accessing protected Linux features such as shared file systems, root access, and certain core capabilities like the KILL command. Always use the most restrictive SCC that still allows your workload to function.


Network Security

  • Zero-trust networking: Apply granular access controls between individual pods, namespaces, and services in Kubernetes clusters and external resources, including databases, internal applications, and third-party cloud APIs.
  • Use NetworkPolicies to restrict east-west traffic between namespaces and pods by default.
  • Egress control: Use Egress Gateways or policies to control outbound traffic from pods.

Compliance & Monitoring

  • Compliance Operator: The OpenShift Compliance Operator supports profiles for standards including PCI-DSS versions 3.2.1 and 4.0, enabling automated compliance scanning across the cluster.
  • Continuous monitoring: Use robust logging and monitoring solutions to gain visibility into container behavior, network flows, and resource utilization. Set up alerts for abnormalities like unusually high memory or CPU usage that could indicate compromise.
  • Track CVEs proactively: Security, bug fix, and enhancement updates for OCP are released as asynchronous errata through the Red Hat Network. Registry images should be scanned upon notification and patched if affected by new vulnerabilities.

Namespace & Project Isolation

Using projects and namespaces simplifies management and enhances security by limiting the potential impact of compromised applications, segregating resources based on application/team/environment, and ensuring users can only access the resources they are authorized to use.


Key tools to leverage: Advanced Cluster Security (ACS/StackRox), Compliance Operator, OpenShift built-in image registry with scanning, and NetworkPolicy/Calico for zero-trust networking.

SCCs (Security Context Constraints) are OpenShift’s pod-level security gate — separate from RBAC. The golden rules are: always start from restricted-v2, never modify built-in SCCs, create custom ones when needed, assign them to dedicated service accounts (not users), and never grant anyuid or privileged to app workloads.

RBAC controls what users and service accounts can do via the API. The key principle is deny-by-default — bind roles to groups rather than individuals, keep bindings namespace-scoped unless cross-namespace is genuinely needed, audit regularly with oc auth can-i and oc policy who-can, and never touch default system ClusterRoles.

Network Policies implement microsegmentation at the pod level. The pattern is always: default-deny first, then explicitly open only what’s needed — ingress from the router, traffic from the same namespace, and specific app-to-app flows. For egress, use EgressNetworkPolicy to whitelist specific CIDRs or domains and block everything else.

All three layers work together: RBAC controls the API plane, SCCs control the node plane, and NetworkPolicies control the network plane. A strong OCP security posture needs all three.

AKS – Security Best Practice

For a brand-new microservices project in 2026, security isn’t just a “layer” you add at the end—it’s baked into the infrastructure. AKS has introduced several “secure-by-default” features that simplify this.

Here are the essential security best practices for your new setup:


1. Identity over Secrets (Zero Trust)

In 2026, storing connection strings or client secrets in Kubernetes “Secrets” is considered an anti-pattern.

  • Best Practice: Use Microsoft Entra Workload ID.
  • Why: Instead of your app having a password to access a database, your Pod is assigned a “Managed Identity.” Azure confirms the Pod’s identity via a signed token, granting it access without any static secrets that could be leaked.
  • New in 2026: Enable Conditional Access for Workload Identities to ensure a microservice can only connect to your database if it’s running inside your specific VNet.

2. Harden the Host (Azure Linux 3.0)

The operating system running your nodes is part of your attack surface.

  • Best Practice: Standardize on Azure Linux 3.0 (CBL-Mariner).
  • Why: It is a “distroless-adjacent” host OS. It contains ~500 packages compared to the thousands in Ubuntu, drastically reducing the number of vulnerabilities (CVEs) you have to patch.
  • Advanced Isolation: For sensitive services (like payment processing), enable Pod Sandboxing. This uses Kata Containers to run the service in a dedicated hardware-isolated micro-VM, preventing “container breakout” attacks where a hacker could jump from your app to the node.

3. Network “Blast Radius” Control

If one microservice is compromised, you don’t want the attacker to move laterally through your entire cluster.

  • Best Practice: Use Cilium for Network Policy.
  • Why: As of 2026, Cilium is the gold standard for AKS networking. It uses eBPF technology to filter traffic at the kernel level.
  • Strategy: Implement a Default Deny policy. By default, no service should be able to talk to any other service unless you explicitly write a rule allowing it.

4. API Server Protection

The Kubernetes API server is the “front door” to your cluster. If someone gets in here, they own everything.

  • Best Practice: Use API Server VNet Integration (Private Clusters).
  • Why: This ensures your cluster’s management endpoint is not reachable from the public internet. It exists only inside your private network.
  • Access Control: Use Microsoft Entra RBAC (Role-Based Access Control). Never use the “cluster-admin” local account. Link permissions to your team’s Entra ID groups so that when an employee leaves the company, their cluster access is revoked instantly.

5. Continuous Supply Chain Security

Security starts before the code even reaches AKS.

  • Best Practice: Enable Defender for Cloud and Binary Authorization.
  • Why: Defender for Cloud scans your images in the Azure Container Registry (ACR) for malware and secrets.
  • Enforcement: Use Azure Policy for Kubernetes to block any deployment that hasn’t been scanned or isn’t coming from your trusted registry.

Summary Security Checklist

Area2026 Standard
IdentityMicrosoft Entra Workload ID (No Secrets)
OSAzure Linux 3.0 with OS Guard
NetworkCilium with mTLS (Service Mesh)
AccessPrivate Cluster + Entra RBAC
GovernanceAzure Policy “Restricted” Baseline

Pro-Tip: Check your Secure Score in Microsoft Defender for Cloud weekly. It will give you a prioritized list of “quick fixes” for your specific AKS cluster based on real-time threats.

With a Service Mesh (specifically the Istio-based add-on for AKS), you are moving toward a “Zero Trust” network architecture. In this setup, the network is no longer trusted by default; every connection must be verified and encrypted.

Here is the 2026 security blueprint for running microservices with Istio on AKS.


1. Automated mTLS (Encryption in Transit)

By default, traffic between Kubernetes Pods is unencrypted. With Istio, you can enforce Strict Mutual TLS (mTLS) without changing a single line of application code.

  • The Best Practice: Apply a PeerAuthentication policy at the namespace level set to STRICT.
  • The Result: Any service that tries to connect via plain text will be instantly rejected by the sidecar proxy. This ensures that even if an attacker gains access to your internal network, they cannot “sniff” sensitive data (like headers or tokens) passing between services.

2. Identity-Based Authorization

IP addresses are ephemeral in Kubernetes and shouldn’t be used for security. Istio uses SPIFFE identities based on the service’s Kubernetes Service Account.

  • The Best Practice: Use AuthorizationPolicy to define “Who can talk to Whom.”
  • Example: You can create a rule that says the Email Service can only receive requests from the Orders Service, and only if the request is a POST to the /send-receipt endpoint. Everything else is blocked at the source.

3. Secure the “Front Door” (Ingress Gateway)

In 2026, the Kubernetes Gateway API has reached full GA (General Availability) for the AKS Istio add-on.

  • The Best Practice: Use the Gateway and HTTPRoute resources instead of the older Ingress objects.
  • Security Benefit: It allows for better separation of concerns. Your platform team can manage the physical load balancer (the Gateway), while your developers manage the routing rules (HTTPRoute) for their specific microservices.

4. Dapr + Istio: The “Power Couple”

Since you are building microservices, you might also use Dapr for state and messaging. In 2026, these two work together seamlessly but require one key configuration:

  • The Best Practice: If both are present, let Istio handle the mTLS and Observability, and disable mTLS in Dapr.
  • Why: Having two layers of encryption (“double wrapping” packets) adds significant latency and makes debugging network issues a nightmare.

5. Visualizing the “Blast Radius”

The biggest security risk in microservices is lateral movement.

  • The Best Practice: Use the Kiali dashboard (integrated with AKS) to view your service graph in real-time.
  • The Security Win: If you see a weird line of communication between your Public Web Frontend and your Internal Payment Database that shouldn’t exist, you’ve found a security hole or a misconfiguration before it becomes a breach.

Summary Security Checklist for Istio on AKS

Task2026 Recommended Tool
Transport SecurityPeerAuthentication (mode: STRICT)
Service PermissionsIstio AuthorizationPolicy
External TrafficKubernetes Gateway API (Managed Istio Ingress)
Egress (Outgoing)Service Entry (Block all traffic to external sites except specific approved domains)
AuditingAzure Monitor for Containers + Istio Access Logs

Warning for 2026: Ensure your worker nodes have enough “headroom.” Istio sidecars (Envoy proxies) consume roughly 0.5 to 1.0 vCPU and several hundred MBs of RAM per Pod. For a project with many small microservices, this “sidecar tax” can add up quickly.

AKS

At its core, Azure Kubernetes Service (AKS) is Microsoft’s managed version of Kubernetes. It’s designed to take the “scary” parts of managing a container orchestration system—like setting up the brain of the cluster, patching servers, and handling scaling— and offload them to Azure so you can focus on your code.

Think of it as Kubernetes with a personal assistant.


1. How it Works (The Architecture)

AKS splits a cluster into two distinct parts:

  • The Control Plane (Managed by Azure): This is the “brain.” It manages the API server, the scheduler, and the cluster’s state. In AKS, Microsoft manages this for you for free (or for a small fee if you want a guaranteed Uptime SLA). You don’t have to worry about its health or security patching.
  • The Data Plane (Managed by You): These are the “worker nodes” (Virtual Machines) where your applications actually run. While you pay for these VMs, AKS makes it easy to add, remove, or update them with a single click or command.

2. Key Features (2026 Standards)

As of 2026, AKS has evolved into an “AI-ready” platform. Here are the standout features:

  • AKS Automatic: A newer “Zero-Ops” tier where Azure handles almost everything—node configuration, security hardening, and even choosing the right VM sizes based on your app’s needs.
  • Smart Scaling: It uses the Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler to grow or shrink your infrastructure based on real-time traffic, saving you money during quiet hours.
  • AI & GPU Support: Native integration with the latest NVIDIA GPUs (like the NCv6 series) makes it a go-to for training LLMs or running AI inference.
  • Enterprise Security: It integrates directly with Microsoft Entra ID (formerly Azure AD), so you can control who accesses your cluster using the same identity they use for email.

3. Why Use It?

BenefitWhat it means for you
Reduced OverheadYou don’t have to be a Kubernetes expert to keep the cluster running.
Cost EfficiencyYou only pay for the worker nodes you use. Features like Spot VMs can cut costs by up to 80% for non-critical tasks.
Fast UpdatesUpgrading to a new Kubernetes version is a “push-button” experience rather than a manual nightmare.
Deep IntegrationIt plays perfectly with Azure Monitor, Azure Storage, and GitHub Actions for a smooth “code-to-cloud” pipeline.

4. Common Use Cases

  • Microservices: Breaking down a giant app into smaller, independent pieces.
  • AI/ML Workflows: Running data-heavy training jobs and deploying models.
  • DevOps Pipelines: Using containers to ensure that “it works on my machine” also means “it works in production.”

Pro Tip: If you’re just starting out in 2026, look into Azure Linux 3.0 as your node operating system. It’s a lightweight, secure OS built by Microsoft specifically for AKS that boots faster and has a smaller attack surface than traditional Linux distros.

Starting a brand-new project in 2026 is an exciting time for AKS because the platform has moved from “manual infrastructure” to “intelligent automation.”

Here is the recommended blueprint for launching a fresh project today.


1. The Big Choice: Automatic vs. Standard

The first decision you’ll face in the Azure Portal is which “tier” to use.

FeatureAKS Automatic (Recommended)AKS Standard
Philosophy“Just run my code.”“Give me all the knobs.”
ManagementAzure manages nodes, scaling, and security.You manage node pools and VM sizes.
Best ForNew startups, rapid dev, and “Zero-Ops” teams.Large enterprises with strict custom networking.
SecurityHardened by default (Azure Policy, Cilium).Configurable (you must set the guardrails).

Advice: For a brand-new project, start with AKS Automatic. It enforces modern best practices (like the Cilium network data plane) out of the box, saving you from “Day 2” configuration headaches.

Automatic Kubernetes Cluster manages these elements for you:

  • Networking and Security Azure CNI Overlay powered by Azure Cilium
  • Resource provisioning Automated node provisioning and scaling
  • On-demand scaling Optimal scaling tools like KEDA, HPA, and VPA
  • Kubernetes version upgrade Automatic updates for enhanced stability

2. Setting Up Your Foundation (The 2026 Stack)

When configuring your new cluster, stick to these current standards:

  • The OS: Choose Azure Linux 3.0. It’s Microsoft’s own cloud-optimized distro. It’s faster and more secure than Ubuntu because it contains only the bare essentials needed to run containers.
  • Networking: Use Azure CNI Overlay. It allows you to scale to thousands of Pods without burning through your private IP address space—a common pitfall in older projects.
  • Identity: Enable Microsoft Entra Workload ID. Never use “Service Principals” or hardcoded secrets to let your app talk to a database. Workload ID gives your Pod its own managed identity.

3. Integrating AI (KAITO)

If your new project involves AI (and most do in 2026), you’ll want to enable the Kubernetes AI Toolchain Operator (KAITO).

  • What it does: It simplifies running Large Language Models (LLMs) like Phi-4 or Falcon on your cluster.
  • Why use it: Instead of manually figuring out which GPU driver matches which model, KAITO automates the provisioning of GPU nodes and sets up the inference endpoint for you.

4. Cost Control for New Projects

Don’t let a “Hello World” app cost you hundreds.

  1. Use the “Dev/Test” Preset: This scales down redundancy to save money while you’re still building.
  2. Spot Instances: For background tasks or non-critical processing, use Spot node pools to get up to 80-90% off standard VM pricing.
  3. Vertical Pod Autoscaler (VPA): In 2026, the VPA is much smarter. It can watch your app and suggest (or automatically apply) the exact amount of CPU and RAM needed, so you aren’t paying for “idle” capacity.

5. Your First Steps

To get moving, I recommend the following workflow:

  1. Install the Azure CLI (ensure it’s version 2.82+ for 2026 features).
  2. Create an ACR (Azure Container Registry): This is where you’ll store your app images.
  3. Deploy via “Automated Deployments”: Use the “Automated Deployments” tab in the AKS portal. You point it at your GitHub repo, and it will generate the GitHub Actions and Kubernetes YAML files for you.

Building a microservices architecture on AKS is a “textbook” use case, but in 2026, the strategy has shifted from just “managing containers” to “managing a distributed system.”

Since each service in your project will have its own lifecycle, database, and scaling needs, here is how you should structure your new AKS environment.


1. The Connectivity Layer: Service Mesh

In a microservices setup, the network becomes your biggest challenge. You shouldn’t have your services talking to each other using hardcoded internal URLs.

  • The 2026 Choice: Istio-based Service Mesh add-on for AKS.
  • Why: It provides “Mutual TLS” (mTLS) by default, meaning all communication between your microservices is encrypted automatically. It also gives you a “Service Graph,” allowing you to visualize exactly how traffic flows from Service A to Service B.

2. Global Traffic Routing (Ingress)

You need a “front door” to route external users to the correct microservice (e.g., api.myapp.com/orders goes to the Order Service).

  • Application Gateway for Containers (ALB): This is the modern evolution of the standard Ingress Controller. It’s a managed service that sits outside your cluster, handling SSL termination and Web Application Firewall (WAF) duties so your worker nodes don’t have to waste CPU on security overhead.

3. Data Persistence & State

The golden rule of microservices is one database per service.

  • Don’t run DBs inside AKS: While you can run SQL or MongoDB as a container, it’s a headache to manage.
  • The 2026 Way: Use Azure Cosmos DB or Azure SQL and connect them to your microservices using Service Connector. Service Connector handles the networking and authentication (via Workload ID) automatically, so your code doesn’t need to store connection strings or passwords.

4. Microservices Design Pattern (Dapr)

For a brand-new project, I highly recommend using Dapr (Distributed Application Runtime), which is an integrated extension in AKS.

Dapr provides “building blocks” as sidecars to your code:

  • Pub/Sub: Easily send messages between services (e.g., the “Order” service tells the “Email” service to send a receipt).
  • State Management: A simple API to save data without writing complex database drivers.
  • Resiliency: Automatically handles retries if one microservice is temporarily down.

5. Observability (The “Where is the Bug?” Problem)

With 10+ microservices, finding an error is like finding a needle in a haystack. You need a unified view.

  • Managed Prometheus & Grafana: AKS has a “one-click” onboarding for these. Prometheus collects metrics (CPU/RAM/Request counts), and Grafana gives you the dashboard.
  • Application Insights: Use this for “Distributed Tracing.” It allows you to follow a single user’s request as it travels through five different microservices, showing you exactly where it slowed down or failed.

Summary Checklist for Your New Project

  1. Cluster: Create an AKS Automatic cluster with the Azure Linux 3.0 OS.
  2. Identity: Use Workload ID instead of secrets.
  3. Communication: Enable the Istio add-on and Dapr extension.
  4. Database: Use Cosmos DB for high-scale microservices.
  5. CI/CD: Use GitHub Actions with the “Draft” tool to generate your Dockerfiles and manifests automatically.

Azure Storage

Azure Storage is a highly durable, scalable, and secure cloud storage solution. In 2026, it has evolved significantly into an AI-ready foundational layer, optimized not just for simple files, but for the massive datasets required for training AI models and serving AI agents.

The platform is divided into several specialized “data services” depending on the type of data you are storing.


1. The Core Data Services

ServiceData TypeBest For
Blob StorageUnstructured (Objects)Images, videos, backups, and AI training data lakes.
Azure FilesFile Shares (SMB/NFS)Replacing on-premise file servers; “Lift and Shift” for legacy apps.
Azure DisksBlock StoragePersistent storage for Virtual Machines (OS and data disks).
Azure TablesNoSQL Key-ValueLarge scale, schema-less structured data (e.g., user profiles).
Azure QueuesMessagingReliable messaging between different parts of an application.

2. Modern Tiers (Cost vs. Speed)

You don’t pay the same price for data you use every second versus data you keep for 10 years. You choose an Access Tier to optimize your bill:

  • Premium: SSD-backed. Ultra-low latency for high-performance apps and AI inference.
  • Hot: For data you access frequently. Lower access cost, higher storage cost.
  • Cool (30 days): For data like short-term backups. Lower storage cost, higher access cost.
  • Cold (90 days): New in the 2020s—optimized for “infrequent” but immediate access.
  • Archive (180 days): Lowest cost, but data is “offline.” Rehydrating it takes hours.

3. Redundancy: Protecting Your Data

Azure keeps multiple copies of your data to protect against hardware failure or entire data center disasters.

  • LRS (Locally Redundant): 3 copies in a single data center. Protects against disk failure.
  • ZRS (Zone-Redundant): 3 copies across 3 different data centers in one region. Protects against a data center fire/outage.
  • GRS (Geo-Redundant): Copies data to a secondary region hundreds of miles away. Protects against a major regional disaster.

🚀 What’s New in 2026?

  • Agentic Scale: Blob storage can now handle millions of small objects per second to support AI Agents that need to “remember” long-term context or retrieve data from RAG (Retrieval-Augmented Generation) stores.
  • Elastic SAN: A fully managed Storage Area Network (SAN) service that lets you pool block storage for different workloads, making it easier to migrate heavy SQL or Oracle databases to the cloud.
  • AI Integration: Native “hooks” for frameworks like LangChain and Microsoft Foundry, allowing AI models to read directly from your storage with minimal setup.

Choosing the right storage in 2026 depends on two main factors: the structure of your data and how your application (or AI agent) needs to access it.

Here is the 2026 decision framework to help you choose the right tool for the job.


🚦 The “Quick Decision” Tree

If your project needs to…The Winner is…
Store millions of files for AI training or Data Lakes.Blob Storage (Data Lake Gen2)
Replace an on-premise file server (SMB/NFS).Azure Files
Provide high-speed block storage for Virtual Machines.Managed Disks
Pool storage across many VMs/Containers like a Cloud SAN.Elastic SAN
Send messages between different microservices.Queue Storage
Store simple Key-Value data (User profiles, logs).Table Storage

🟦 1. Blob Storage: The AI & Big Data King

In 2026, Blob storage is no longer just for “backups.” It is the central engine for Agentic Scale—supporting AI agents that need to read massive amounts of context quickly.

  • Best For: Unstructured data (PDFs, Images, Parquet files).
  • Key Feature: Data Lake Storage Gen2. This adds a “Hierarchical Namespace” (real folders) to your blobs, which makes big data analytics and AI processing 10x faster.
  • 2026 Strategy: Use Cold Tier for data you only touch once a quarter but need available instantly for AI “Reasoning” tasks.

📂 2. Azure Files: The “Lift-and-Shift” Hero

If you have an existing application that expects a “Drive Letter” (like Z:\), use Azure Files.

  • Best For: Shared folders across multiple VMs or local office computers.
  • New in 2026: Elastic ZRS (Zone Redundant Storage). This provides ultra-high availability for mission-critical file shares without the complexity of managing your own cluster.
  • Performance: Use Premium Files if you are running active databases or high-transaction apps; use Standard for simple office document sharing.

💽 3. Managed Disks vs. Elastic SAN

This is the “local” vs “network” storage debate for your servers.

  • Managed Disks (The Individual): Use Premium SSD v2. It’s the modern standard because it allows you to scale IOPS and Throughput separately, so you don’t have to buy a “huge disk” just to get “high speed.”
  • Elastic SAN (The Pool): If you are migrating a massive environment from an on-premise SAN (like Dell EMC or NetApp), Elastic SAN lets you buy one large “pool” of performance and share it across all your VMs and Kubernetes clusters.

🔍 4. Specialized: Tables & Queues

These are “developer” storage types.

  • Azure Tables: Use this if Cosmos DB is too expensive for your needs. It’s a “no-frills” NoSQL database for billions of small, structured rows.
  • Azure Queues: Use this to decouple your app. If a user uploads a photo, put a message in the Queue. A “Worker” then sees that message and processes the photo. This prevents your app from crashing under heavy load.

💡 Pro-Tip for 2026: The “Cost First” Choice

If you are still unsure, start with Blob Storage. It is the most flexible, has the best API support for modern AI frameworks (like LangChain), and offers the most aggressive cost-saving tiers (Cool, Cold, and Archive).

What is the “main action” your application will take? (e.g., “Scanning documents for AI,” “Hosting a website,” or “Running a heavy SQL database?”) I can give you the specific configuration for that.

Cost optimization in Azure is no longer just about “turning things off.” In 2026, it is a continuous lifecycle known as FinOps, focusing on three distinct phases: Inform (Visibility), Optimize (Rightsizing & Rates), and Operate (Governance).

Here is the strategic blueprint for optimizing your Azure spend today.


1. Inform: Get Full Visibility

You cannot optimize what you cannot see.

  • Tagging Enforcement: Use Azure Policy to require tags like Environment, Owner, and CostCenter. This allows you to group costs by department or project in Azure Cost Management.
  • Budget Alerts: Set thresholds at 50%, 80%, and 100% of your predicted monthly spend.
  • Azure Advisor Score: Check your “Cost Score” in Azure Advisor. It provides a “to-do list” of unused resources, such as unattached Managed Disks or idle ExpressRoute circuits.

2. Optimize: The Two-Pronged Approach

Optimization is divided into Usage (buying less) and Rate (paying less for what you use).

A. Usage Optimization (Rightsizing)

  • Shut Down Idle Resources: Azure Advisor flags VMs with <3% CPU usage. For Dev/Test environments, use Auto-shutdown or Azure Automation to turn VMs off at 7:00 PM and on at 7:00 AM.
  • Storage Tiering: Move data that hasn’t been touched in 30 days to the Cool tier, and data older than 180 days to the Archive tier. This can save up to 90% on storage costs.
  • B-Series VMs: For workloads with low average CPU but occasional spikes (like small web servers), use the B-Series (Burstable) instances to save significantly.

B. Rate Optimization (Commitment Discounts)

In 2026, you choose your discount based on how much flexibility you need.

Discount TypeSavingsBest For…
Reserved Instances (RI)Up to 72%Static workloads. You commit to a specific VM type in a specific region for 1 or 3 years.
Savings Plan for ComputeUp to 65%Dynamic workloads. A flexible $ /hour commitment that applies across VM families and regions.
Azure Hybrid BenefitUp to 85%Using your existing Windows/SQL licenses in the cloud so you don’t pay for them twice.
Spot InstancesUp to 90%Interruptible workloads like batch processing or AI model training.

3. Operate: Modern 2026 Strategies

  • AI Cost Governance: With the rise of Generative AI, monitor your Azure OpenAI and AI Agent token usage. Use Rate Limiting on your APIs to prevent a runaway AI bot from draining your budget in a single night.
  • FinOps Automation: Use Azure Resource Graph to find “orphaned” resources (like Public IPs not attached to anything) and delete them automatically via Logic Apps.
  • Sustainability & Carbon Optimization: Use the Azure Carbon Optimization tool. Often, the most “green” resource (lowest carbon footprint) is also the most cost-efficient one.

✅ The “Quick Wins” Checklist

  1. [ ] Delete Unattached Disks: When you delete a VM, the disk often stays behind and keeps billing you.
  2. [ ] Switch to Savings Plans: If your RIs are expiring, move to a Savings Plan for easier management.
  3. [ ] Check for “Zombies”: Idle Load Balancers, VPN Gateways, and App Service Plans with zero apps.
  4. [ ] Rightsize your SQL: Switch from “DTU” to the vCore model for more granular scaling and Hybrid Benefit savings.

Pro Tip: Never buy a Reserved Instance (RI) for a server that hasn’t been rightsized first. If you buy a 3-year reservation for an oversized 16-core VM, you are “locking in” waste for 36 months!

To find the “low-hanging fruit” in your Azure environment, you can use Azure Resource Graph Explorer and Log Analytics.

Here are the specific KQL (Kusto Query Language) scripts to identify common waste areas.


1. Identify Orphaned Resources (Quickest Savings)

These resources are costing you money every hour but aren’t attached to anything. Run these in the Azure Resource Graph Explorer.

A. Unattached Managed Disks

Code snippet

Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project name, resourceGroup, subscriptionId, location, diskSizeGB = properties.diskSizeGB
| order by diskSizeGB desc

B. Unattached Public IPs

Code snippet

Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project name, resourceGroup, subscriptionId, location, ipAddress = properties.ipAddress

2. Identify Underutilized VMs (Rightsizing)

To run this, your VMs must be sending performance data to a Log Analytics Workspace. Use this to find VMs that are consistently running below 5% CPU.

KQL for Underutilized VMs (Last 7 Days):

Code snippet

Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer, _ResourceId
| where AvgCPU < 5
| order by AvgCPU asc
  • Action: If MaxCPU is also low, consider “Downsizing” the VM (e.g., from a D4 to a D2) or switching it to a B-series.

3. Find Idle App Service Plans

App Service Plans cost money even if they have zero apps running on them. Run this in Resource Graph Explorer.

Code snippet

resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project name, resourceGroup, subscriptionId, Sku = sku.name, Tier = sku.tier

4. Search for “Zombie” Network Interfaces

These don’t cost money directly, but they clutter your environment and use up IP addresses in your subnets.

Code snippet

Resources
| where type =~ 'microsoft.network/networkinterfaces'
| where isnull(properties.virtualMachine)
| project name, resourceGroup, subscriptionId, location

💡 How to Automate This in 2026

Instead of running these manually, use Azure Workbooks.

  1. Search for “Workbooks” in the Azure Portal.
  2. Click Add > New.
  3. Add a “Query” cell and paste any of the KQL scripts above.
  4. Save the Workbook as “Monthly Cost Cleanup.”

Now, you can open this dashboard once a month and see exactly what needs to be deleted!

Setting up an automated alert for “Unattached Disks” is a brilliant move for cost governance. In Azure, this is handled by monitoring the Activity Log for a specific event: the “Delete Virtual Machine” action (which leaves the disk behind) or the “Detach Disk” action.

Here is the 2026 step-by-step guide to setting this up.


Step 1: Create an Action Group (The “Who” to notify)

Before you create the alert, you need to tell Azure how to contact you.

  1. Search for Monitor in the Azure Portal.
  2. Click Alerts > Action groups > + Create.
  3. Basics: Give it a name like CostAlertTeam.
  4. Notifications: Select Email/SMS message/Push/Voice.
  5. Enter your email address and name the notification EmailDevOps.
  6. Click Review + create.

Step 2: Create the Activity Log Alert (The “When”)

Now, we create the trigger that watches for disks being left alone.

  1. In Monitor, click Alerts > + Create > Alert rule.
  2. Scope: Select your Subscription.
  3. Condition: This is the most important part. Click + Add condition and search for:
    • Signal Name: Detach Disk (Microsoft.Compute/disks)
    • Alternative: You can also alert on Delete Virtual Machine, but “Detach Disk” is more accurate for catching orphaned resources.
  4. Refine the Logic: Under “Event initiated by,” you can leave it as “Any” or specify a specific automation service principal if you only want to catch manual detaches.

Step 3: Connect and Save

  1. Actions: Click Select action groups and choose the CostAlertTeam group you created in Step 1.
  2. Details: Name the rule Alert-Disk-Unattached.
  3. Severity: Set it to Informational (Sev 4) or Warning (Sev 3).
  4. Click Review + create.

💡 The “Pro” Way (2026 Strategy): Use Log Analytics

The method above tells you when a disk is detached, but it won’t tell you about disks that are already unattached. To catch those, use a Log Search Alert with a KQL query.

The KQL Query:

Code snippet

// Run this every 24 hours to find any disk with a "ManagedBy" state of NULL
resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project name, resourceGroup, subscriptionId

Why this is better:

  • Activity Log Alerts are “reactive” (they fire only at the moment of the event).
  • Log Search Alerts are “proactive” (they scan your environment every morning and email you a list of every unattached disk, even if it was detached months ago).

✅ Summary of the Workflow

  1. Detach/Delete Event happens in the VNet.
  2. Activity Log captures the event.
  3. Azure Monitor sees the event matches your rule.
  4. Action Group sends you an email immediately.

While an immediate alert is great for a “fire-drill” response, a Weekly Summary Report is the gold standard for long-term cost governance. It keeps your inbox clean and ensures your team stays accountable for “disk hygiene.”

In 2026, the best way to do this without writing custom code is using Azure Logic Apps.


🛠️ The Architecture: “The Monday Morning Cleanup”

We will build a simple 3-step workflow that runs every Monday at 9:00 AM, queries for unattached disks, and sends you a formatted HTML table.

Step 1: Create the Logic App (Recurrence)

  1. Search for Logic Apps and create a new one (select Consumption plan for lowest cost).
  2. Open the Logic App Designer and select the Recurrence trigger.
  3. Set it to:
    • Interval: 1
    • Frequency: Week
    • On these days: Monday
    • At these hours: 9

Step 2: Run the KQL Query

  1. Add a new step and search for the Azure Monitor Logs connector.
  2. Select the action: Run query and visualize results.
  3. Configure the connection:
    • Subscription/Resource Group: Select your primary management group.
    • Resource Type: Log Analytics Workspace.
  4. The Query: Paste the “Orphaned Disk” query from earlier:Code snippetResources | where type has "microsoft.compute/disks" | extend diskState = tostring(properties.diskState) | where managedBy == "" and diskState == "Unattached" | project DiskName = name, ResourceGroup = resourceGroup, SizeGB = properties.diskSizeGB, Location = location
  5. Chart Type: Select HTML Table.

Step 3: Send the Email

  1. Add a final step: Office 365 Outlook – Send an email (V2).
  2. To: Your team’s email.
  3. Subject: ⚠️ Weekly Action: Unattached Azure Disks found
  4. Body:
    • Type some text like: “The following disks are currently unattached and costing money. Please delete them if they are no longer needed.”
    • From the Dynamic Content list, select Attachment Content (this is the HTML table from Step 2).

📊 Why this is the “Pro” Move

  • Zero Maintenance: Once it’s running, you never have to check the portal manually.
  • Low Cost: A Logic App running once a week costs roughly $0.02 per month.
  • Formatted for Humans: Instead of a raw JSON blob, you get a clean table that you can forward to project owners.

✅ Bonus: Add a “Delete” Link

If you want to be a 2026 power user, you can modify the KQL to include a “Deep Link” directly to each disk in the Azure Portal:

Code snippet

| extend PortalLink = strcat("https://portal.azure.com/#@yourtenant.onmicrosoft.com/resource", id)
| project DiskName, SizeGB, PortalLink

Now, you can click the link in your email and delete the disk in seconds.

Combining the different “zombie” resources into one report is the most efficient way to manage your Azure hygiene.

By using the union operator in KQL, we can create a single list of various resource types that are currently costing you money without providing value.


1. The “Ultimate Zombie” KQL Query

Copy and paste this query into your Logic App or Azure Resource Graph Explorer. It looks for unattached disks, unassociated IPs, and empty App Service Plans all at once.

Code snippet

// Query for Orphaned Disks
Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project Name = name, Type = "Orphaned Disk", Detail = strcat(properties.diskSizeGB, " GB"), ResourceGroup, SubscriptionId
| union (
// Query for Unassociated Public IPs
Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project Name = name, Type = "Unattached IP", Detail = tostring(properties.ipAddress), ResourceGroup, SubscriptionId
)
| union (
// Query for Empty App Service Plans (Costly!)
resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project Name = name, Type = "Empty App Service Plan", Detail = strcat(sku.tier, " - ", sku.name), ResourceGroup, SubscriptionId
)
| union (
// Query for Idle Load Balancers (No Backend Pool members)
resources
| where type == "microsoft.network/loadbalancers"
| where array_length(properties.backendAddressPools) == 0
| project Name = name, Type = "Idle Load Balancer", Detail = "No Backend Pools", ResourceGroup, SubscriptionId
)
| order by Type asc

2. Updating Your Logic App Report

To make this work in your weekly email:

  1. Open your Logic App and update the “Run query” step with the new combined KQL above.
  2. Update the HTML Table: Since the new query uses consistent column names (Name, Type, Detail), your HTML table will now neatly list the different types of waste side-by-side.

3. Advanced 2026 Tip: Add “Potential Savings”

If you want to get your manager’s attention, you can add a “Estimated Monthly Waste” column. While KQL doesn’t know your exact billing, you can hardcode estimates:

Code snippet

| extend MonthlyWaste = case(
Type == "Orphaned Disk", 5.00, // Estimate $5 per month
Type == "Unattached IP", 4.00, // Estimate $4 per month
Type == "Empty App Service Plan", 50.00, // Estimate $50+ for Standard+
0.00)

✅ Your “Monday Morning” Checklist

When you receive this email every Monday, follow this triage:

  • Disks: Delete immediately unless you specifically kept it as a “manual backup” (though you should use Azure Backup for that).
  • Public IPs: Delete. Unused Public IPs are charged by the hour in Azure.
  • App Service Plans: If you aren’t using them, scale them to the Free (F1) tier or delete them. These are often the biggest hidden costs.

To turn this report into a powerful leadership tool, we need to calculate the “Total Potential Monthly Savings.” This changes the conversation from “We have a few loose disks” to “We can save $800/month by clicking these buttons.”

Here is how to update your Logic App and KQL to include a summary total.


1. Updated “Master Zombie” Query (With Estimated Costs)

We will add a hidden cost value to every “zombie” found, then summarize the total at the very end.

Code snippet

let RawData = Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project Name = name, Type = "Orphaned Disk", Detail = strcat(properties.diskSizeGB, " GB"), MonthlyWaste = 10.00, ResourceGroup
| union (
Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project Name = name, Type = "Unattached IP", Detail = tostring(properties.ipAddress), MonthlyWaste = 4.00, ResourceGroup
)
| union (
resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project Name = name, Type = "Empty App Service Plan", Detail = strcat(sku.tier, " - ", sku.name), MonthlyWaste = 55.00, ResourceGroup
);
// This part creates the final list
RawData
| order by Type asc
| union (
RawData
| summarize Name = "TOTAL POTENTIAL SAVINGS", Type = "---", Detail = "---", MonthlyWaste = sum(MonthlyWaste), ResourceGroup = "---"
)

2. Formatting the Logic App Email

Since KQL doesn’t easily format currency, we’ll use the Logic App “Compose” action to make the final total stand out in your email.

  1. Run the Query: Use the Run query and list results action in Logic Apps with the KQL above.
  2. Add a “Compose” Step: Between the Query and the Email, add a Data Operations - Compose action.
  3. The HTML Body: Use this template in the email body to make it look professional:

HTML

<h3>Azure Monthly Hygiene Report</h3>
<p>The following resources are identified as waste.
Cleaning these up will result in the estimated savings below.</p>
@{body('Create_HTML_table')}
<br>
<div style="background-color: #e1f5fe; padding: 15px; border-radius: 5px; border: 1px solid #01579b;">
<strong>💡 Quick Win Tip:</strong> Deleting these resources today
will save your department approx <strong>$@{outputs('Total_Waste_Sum')}</strong> per month.
</div>

3. Why This Works in 2026

  • The “Nudge” Effect: By showing the total dollar amount at the bottom, you create a psychological incentive for resource owners to clean up.
  • Customizable Pricing: You can adjust the MonthlyWaste numbers in the KQL to match your specific Enterprise Agreement (EA) pricing.
  • Single Pane of Glass: You now have one query that covers Compute, Network, and Web services.

✅ Final Triage Steps

  • Review: If you see a “TOTAL POTENTIAL SAVINGS” of $0.00, congratulations! Your environment is clean.
  • Action: For the “Empty App Service Plans,” check if they are in a Free (F1) or Shared (D1) tier first—those don’t cost money, but they will still show up as “Empty.”

Azure 3-tier app: enterprise landing zone version

Redraw-from-memory diagram

                              Users / Internet
                                     |
                           Azure Front Door + WAF
                                     |
                     =====================================
                     |                                  |
                  Region A                           Region B
                  Primary                            Secondary
                     |                                  |
               App Gateway/WAF                    App Gateway/WAF
                     |                                  |
          -------------------------         -------------------------
          |       Spoke: App      |         |       Spoke: App      |
          | Web / API / AKS       |         | Web / API / AKS       |
          | Managed Identity      |         | Managed Identity      |
          -------------------------         -------------------------
                     |                                  |
          -------------------------         -------------------------
          |      Spoke: Data      |         |      Spoke: Data      |
          | SQL / Storage / KV    |         | SQL / Storage / KV    |
          | Private Endpoints     |         | Private Endpoints     |
          -------------------------         -------------------------

                  \_________________ Hub VNet __________________/
                   Firewall | Bastion | Private DNS | Resolver
                   Monitoring | Shared Services | Connectivity

          On-prem / Branches
                 |
        ExpressRoute / VPN
                 |
        Global connectivity to hubs / spokes



What makes this an Azure Landing Zone design

Azure landing zones are the platform foundation for subscriptions, identity, networking, governance, security, and platform automation. Microsoft’s landing zone guidance explicitly frames these as design areas, not just one network diagram. (Microsoft Learn)

So in an interview, say this first:

“This isn’t just a 3-tier app. I’m placing the app inside an enterprise landing zone, where networking, identity, governance, and shared services are standardized at the platform layer.” (Microsoft Learn)

How to explain the architecture

Traffic enters through Azure Front Door with WAF, which is the global entry point and can distribute requests across multiple regional deployments for higher availability. Microsoft’s guidance calls out Front Door as the global load balancer in multiregion designs. (Microsoft Learn)

Each region has its own application stamp in a spoke VNet. The app tier runs in the spoke, stays mostly stateless, and uses Managed Identity to access downstream services securely without storing secrets. The data tier sits behind Private Endpoints, so services like Key Vault, SQL, and Storage are not exposed publicly. A private endpoint gives the service a private IP from the VNet. (Microsoft Learn)

Shared controls live in the hub VNet: Azure Firewall, Bastion, DNS, monitoring, and sometimes DNS Private Resolver for hybrid name resolution. Hub-and-spoke is the standard pattern for centralizing shared network services while isolating workloads in spokes. (Microsoft Learn)

The key enterprise networking points

Use hub-and-spoke so shared controls are centralized and workloads are isolated. Microsoft’s hub-spoke guidance specifically notes shared DNS and cross-premises routing as common hub responsibilities. (Microsoft Learn)

For Private Endpoint DNS, use centralized private DNS zones and link them to every VNet that needs to resolve those names. This is one of the most important details interviewers look for, because private endpoint failures are often DNS failures. (Microsoft Learn)

For multi-region, either peer regional hubs or use Azure Virtual WAN when the estate is large and needs simpler any-to-any connectivity across regions and on-premises. (Microsoft Learn)

  • “Only the front door is public.”
  • “App and data tiers stay private.”
  • “Private Endpoints are used for PaaS services.”
  • “Managed Identity removes stored credentials.”
  • “Policies and guardrails are applied at the landing zone level.”
  • “Shared inspection and egress control sit in the hub.”

That lines up with landing zone governance, security, and platform automation guidance. (Microsoft Learn)

2-minute interview answer

“I’d place the 3-tier application inside an Azure landing zone using a hub-and-spoke, multi-region design. Azure Front Door with WAF would be the global ingress layer and route traffic to regional application stacks. In each region, the web and app tiers would live in a spoke VNet, while shared services like Firewall, Bastion, private DNS, and monitoring would live in the hub. The data tier would use services like Azure SQL, Storage, and Key Vault behind Private Endpoints, with centralized private DNS linked to all VNets that need resolution. The application would use Managed Identity for secure access without secrets. For resilience, I’d deploy a secondary region and let Front Door handle failover. For larger estates or more complex connectivity, I’d consider Virtual WAN to simplify cross-region and hybrid networking.” (Microsoft Learn)

Memory trick

Remember it as:

Global edge → Regional spokes → Private data → Shared hub controls

Or even shorter:

Front Door, Spokes, Private Link, Hub

Perfect—here’s a one-page Azure interview cheat sheet you can quickly revise before interviews 👇


Azure Architecture Cheat Sheet (Landing Zone + Networking + Identity)


1. Core Architecture

👉
– Hub-and-spoke, multi-region, with centralized security and private backend services in Microsoft Azure.


2. Mental Diagram

Internet
|
Front Door (WAF)
|
Region A / Region B
|
Spoke VNet (App)
|
Private Endpoint
|
Data (SQL / Storage / Key Vault)
+ Hub VNet
Firewall | DNS | Bastion

3. Security Principles

  • “Only ingress is public”
  • “Everything else is private”
  • “Use Private Endpoints for PaaS”
  • “Use Managed Identity—no secrets”
  • “Enforce with policies and RBAC via Microsoft Entra ID”

4. Identity (VERY IMPORTANT)

  • Most secure → Managed Identity
  • Types:
    • User
    • Service Principal
    • Managed Identity

👉 Rule:

  • Inside Azure → Managed Identity
  • Outside Azure → Federated Identity / Service Principal

5. Networking (What to Remember)

Private Endpoint

  • Uses private IP
  • Needs Private DNS
  • ❗ Most common issue = DNS

Public Endpoint

  • Needs:
    • NAT Gateway or Public IP
    • Route to internet

👉 Rule:

  • Private = DNS problem
  • Public = Routing problem

6. Troubleshooting Framework

👉 Always say:

“What → When → Who → Why → Fix”

StepTool
WhatCost Mgmt / Metrics
WhenLogs (Azure Monitor)
WhoActivity Log
WhyCorrelation
FixScale / Secure / Block

7. Defender Alert Triage

👉
“100 alerts = 1 root cause”

Steps:

  1. Go to Microsoft Defender for Cloud (not emails)
  2. Group by resource/type
  3. Find pattern (VM? same alert?)
  4. Check:
    • NSG (open ports?)
    • Identity (who triggered?)
  5. Contain + prevent

8. Cost Spike Debug

  1. Cost Management → find resource
  2. Metrics → confirm usage
  3. Activity Log → who created/changed
  4. Check:
    • Autoscale
    • CI/CD
    • Compromise

9. Resource Graph (Quick Wins)

Use Azure Resource Graph for:

  • Orphaned disks
  • Unused IPs
  • Recent resources

10. 3-Tier Design (Quick Version)

WAF → Web → App → Data
Private Endpoints

11. Power Phrases

Say these to stand out:

  • “Zero trust architecture”
  • “Least privilege access”
  • “Identity-first security”
  • “Private over public endpoints”
  • “Centralized governance via landing zone”
  • “Eliminate secrets using Managed Identity”

Final Memory Trick

👉
“Front Door → Spoke → Private Link → Hub → Identity”


30-Second Killer Answer

I design Azure environments using a landing zone with hub-and-spoke networking and multi-region resilience. Traffic enters through Front Door with WAF, workloads run in spoke VNets, and backend services are secured using private endpoints. I use managed identities for authentication to eliminate secrets, and enforce governance through policies and RBAC. This ensures a secure, scalable, and enterprise-ready architecture.


Azure 3-tier app

A clean Azure 3-tier app design is:

  1. Web tier for user traffic
  2. App tier for business logic and APIs
  3. Data tier for storage and databases

That matches Azure’s n-tier guidance, where logical layers are separated and can be deployed to distinct tiers for security, scale, and manageability. (Microsoft Learn)

Simple Azure design

Users
|
Azure Front Door / WAF
|
Web Tier
(App Service or VMSS)
|
App Tier
(App Service / AKS / VMSS)
|
Data Tier
(Azure SQL / Storage / Cache)

Better interview-ready version

Internet
|
Front Door + WAF
|
Application Gateway
|
---------------- Web Subnet ----------------
Web Tier
(App Service or VM Scale Set)
|
----------- App / API Private Subnet -------
App Tier
(App Service with VNet Integration / AKS / VMSS)
|
----------- Data Private Subnet ------------
Azure SQL / Storage / Redis / Key Vault
(Private Endpoints)

What I’d choose in Azure

For a modern Azure-native design, I’d usually use:

  • Front Door + WAF for global entry and protection
  • App Service for the web tier
  • App Service or AKS for the app/API tier
  • Azure SQL for the database
  • Key Vault for secrets
  • Private Endpoints for Key Vault and database access
  • VNet integration so the app tier can reach private resources inside the virtual network. Azure App Service supports VNet integration for reaching resources in or through a VNet, and Azure supports private endpoints for services like Key Vault. (Microsoft Learn)

Security design

A strong answer should include:

  • Put the web tier behind WAF
  • Keep the app tier private
  • Put the data tier behind Private Endpoints
  • Use Managed Identity from app tier to Key Vault and database where supported
  • Use NSGs and subnet separation
  • Disable public access on back-end services when possible. Azure’s secure n-tier App Service guidance specifically uses VNet integration and private endpoints to isolate traffic within the virtual network. (Microsoft Learn)

High availability and scaling

For resilience, I’d make the web and app tiers stateless, enable autoscaling, and run across multiple availability zones or multiple instances. Azure’s web app and Well-Architected guidance emphasizes designing for reliability, scalability, and secure operation. (Microsoft Learn)

2-minute interview answer

“I’d design the 3-tier app with a web tier, app tier, and data tier. User traffic would enter through Azure Front Door with WAF, then go to the web tier, typically App Service or VM Scale Sets. The web tier would call a private app tier that hosts the business logic. The app tier would connect to the data tier, such as Azure SQL, Storage, Redis, and Key Vault. I’d use VNet integration and private endpoints so the back-end services are not publicly exposed. For security, I’d separate tiers into subnets, apply NSGs, use Managed Identity for secret and database access, and store secrets in Key Vault. For reliability, I’d keep the web and app tiers stateless and scale them horizontally.” (Microsoft Learn)

Easy memory trick

Remember it as:

Ingress → Web → Logic → Data
and
Public only in front, private everywhere else


🧠 🧱 3-Tier Azure Diagram

✍️ Draw This on a Whiteboard

                 🌍 Internet
                      |
             Azure Front Door / WAF
                      |
              Application Gateway
                      |
        --------------------------------
        |        Web Tier (Public)     |
        |  App Service / VM Scale Set  |
        --------------------------------
                      |
        --------------------------------
        |        App Tier (Private)    |
        |  API / Backend / AKS         |
        --------------------------------
                      |
        --------------------------------
        |        Data Tier (Private)   |
        |  Azure SQL / Storage        |
        |  + Key Vault               |
        --------------------------------

        (Private Endpoints + VNet Integration)



🎤 What to Say While Drawing

🟢 Step 1 — Entry Point

“This is a 3-tier architecture in Microsoft Azure. Traffic enters through Front Door with WAF for global routing and security.”


🟢 Step 2 — Web Tier

“The web tier handles user requests. It’s the only layer exposed publicly and is typically built using App Service or VM Scale Sets.”


🟢 Step 3 — App Tier

“The app tier contains business logic and APIs. It’s private and only accessible from the web tier.”


🟢 Step 4 — Data Tier

“The data tier includes services like Azure SQL, Storage, and Key Vault, all accessed via Private Endpoints so they are not exposed to the internet.”


🟢 Step 5 — Security

I use VNet integration and Private Endpoints so all backend communication stays inside Azure. I also use Managed Identity for secure access to Key Vault and databases, eliminating secrets.



🔐 Add These Details

Mention these to stand out:

  • NSGs between tiers
  • Private DNS for Private Endpoints
  • No public access on DB / Key Vault
  • Use Azure Key Vault for secrets
  • Identity via Microsoft Entra ID

⚡ Ultra-Simple Memory Trick

👉 Draw 3 boxes vertically:

Web (Public)
App (Private)
Data (Private)

Then add:

  • WAF on top
  • Private Endpoints at bottom

💬 30-Second Version

“I’d design a 3-tier app with a web tier, app tier, and data tier. Traffic enters through Front Door with WAF, hits the web tier, then flows to a private app tier and finally to a private data tier. I’d secure backend services using Private Endpoints and use Managed Identity for authentication, ensuring no secrets are stored and no backend services are publicly exposed.”


🧠 Why This Works in Interviews

You just demonstrated:

  • ✅ Architecture design
  • ✅ Security best practices
  • ✅ Networking (private endpoints, VNets)
  • ✅ Identity (Managed Identity)

Azure WAF and Front Door

Azure WAF and Front Door

Azure Front Door

Azure Front Door is a global, scalable entry point for your web applications. Think of it as a smart traffic cop sitting at the edge of Microsoft’s global network that routes users to the fastest, most available backend.

Key capabilities:

  • Global load balancing — distributes traffic across regions, routing users to the nearest or healthiest backend
  • SSL/TLS termination — handles HTTPS offloading at the edge, reducing backend load
  • URL-based routing — routes /api/* to one backend and /images/* to another
  • Caching — caches static content at edge locations (POPs) to reduce latency
  • Health probes — automatically detects unhealthy backends and reroutes traffic
  • Session affinity — sticky sessions to keep a user on the same backend

Front Door operates at Layer 7 (HTTP/HTTPS) and uses Microsoft’s global private WAN backbone, so traffic travels faster than the public internet.


Azure WAF (Web Application Firewall)

Azure WAF is a security layer that inspects and filters HTTP/S traffic to protect web apps from common exploits and vulnerabilities.

What it protects against:

  • SQL injection
  • Cross-site scripting (XSS)
  • OWASP Top 10 threats
  • Bot attacks and scraping
  • Rate limiting / DDoS at Layer 7
  • Custom rule-based threats (e.g. block specific IPs, countries, headers)

Two modes:

  • Detection mode — logs threats but doesn’t block (good for tuning)
  • Prevention mode — actively blocks malicious requests

How They Work Together

WAF is a feature/policy that runs on top of Front Door (and also on Application Gateway). You attach a WAF policy to your Front Door profile, and it inspects all incoming traffic before it reaches your backends.

User Request
┌─────────────────────────────┐
│ Azure Front Door │ ← Global routing, caching, SSL termination
│ ┌───────────────────────┐ │
│ │ WAF Policy │ │ ← Inspect & filter malicious traffic
│ └───────────────────────┘ │
└─────────────────────────────┘
Your Backend (App Service, AKS, VM, etc.)

Front Door Tiers

FeatureStandardPremium
CDN + load balancing
WAFBasic rules only✅ Full (managed + custom rules)
Bot protection
Private Link to backends

When to Use What

ScenarioUse
Global traffic routing + failoverFront Door alone
Protect a single-region appApplication Gateway + WAF
Protect a global appFront Door + WAF (Premium)
Edge caching + securityFront Door + WAF

In short: Front Door gets traffic to the right place fast; WAF makes sure that traffic is safe.

Azure Resource Graph – find orphaned resource


What is Azure Resource Graph?

Azure Resource Graph lets you query all your resources across subscriptions using KQL (Kusto Query Language)—fast and at scale.

👉 Perfect for finding:

  • Orphaned disks
  • Unattached NICs
  • Unused public IPs
  • Resources missing relationships

What is an “Orphaned Resource”?

An orphaned resource is:

  • Not attached to anything
  • Still costing money or creating risk

Examples:

  • Disk not attached to any VM
  • Public IP not associated
  • NIC not connected
  • NSG not applied

Common Queries to Find Orphaned Resources


1. Unattached Managed Disks

Resources
| where type == "microsoft.compute/disks"
| where properties.diskState == "Unattached"
| project name, resourceGroup, location, diskSizeGB

👉 Finds disks not connected to any VM


2. Unused Public IP Addresses

Resources
| where type == "microsoft.network/publicipaddresses"
| where isnull(properties.ipConfiguration)
| project name, resourceGroup, location, sku

👉 These are exposed but unused → security + cost risk


3. Unattached Network Interfaces (NICs)

Resources
| where type == "microsoft.network/networkinterfaces"
| where isnull(properties.virtualMachine)
| project name, resourceGroup, location

4. Unused Network Security Groups (NSGs)

Resources
| where type == "microsoft.network/networksecuritygroups"
| where isnull(properties.networkInterfaces)
and isnull(properties.subnets)
| project name, resourceGroup, location

5. Empty Resource Groups (Bonus)

ResourceContainers
| where type == "microsoft.resources/subscriptions/resourcegroups"
| join kind=leftouter (
Resources
| summarize count() by resourceGroup
) on resourceGroup
| where count_ == 0 or isnull(count_)
| project resourceGroup

How to Run These Queries

You can run them in:

  • Azure Portal → Resource Graph Explorer
  • CLI:az graph query -q "<query>"
  • PowerShell:Search-AzGraph -Query "<query>"

Pro Tip (Senior-Level Insight)

👉 Don’t just find orphaned resources—automate cleanup

  • Schedule queries using:
    • Azure Automation
    • Logic Apps
  • Trigger:
    • Alerts
    • Cleanup workflows

Interview Answer

I use Azure Resource Graph with KQL queries to identify orphaned resources at scale across subscriptions. For example, I can query for unmanaged disks where the disk state is unattached, or public IPs without an associated configuration. Similarly, I check for NICs not linked to VMs and NSGs not applied to subnets or interfaces.

Beyond detection, I typically integrate these queries into automated governance workflows—using alerts or scheduled jobs to either notify teams or trigger cleanup—so we continuously reduce cost and improve security posture.


One-Liner to Remember

👉
“Resource Graph + KQL = fast, cross-subscription visibility for orphaned resources.”


Here’s a solid production-ready pattern, plus a script approach you can talk through in an interview.

Production cleanup strategy

Use Azure Resource Graph for detection, then use Azure Automation with Managed Identity for controlled remediation. Resource Graph is built for cross-subscription inventory queries at scale, and its query language is based on KQL. You can run the same queries in the portal, Azure CLI with az graph query, or PowerShell with Search-AzGraph. (Microsoft Learn)

Safe workflow

Phase 1: Detect
Run queries for likely orphaned resources such as unattached disks, unused public IPs, unattached NICs, and unused NSGs. Azure documents advanced query samples and the CLI quickstart for running them. (Microsoft Learn)

Phase 2: Classify
Do not delete immediately. First separate findings into:

  • definitely orphaned
  • likely orphaned
  • needs human review

A good rule is to require at least one of these before cleanup:

  • older than X days
  • no keep tag
  • no recent change window
  • not in protected subscriptions or resource groups

You can also use Resource Graph change history to review whether a resource was recently modified before acting. (Microsoft Learn)

Phase 3: Notify
Send a report to the owning team or central platform team. Include:

  • resource ID
  • resource group
  • subscription
  • resource age or last change
  • proposed action
  • deadline for objection

Phase 4: Quarantine before delete
For risky resource types, first tag them with something like:

  • cleanupCandidate=true
  • cleanupMarkedDate=2026-04-13
  • cleanupOwner=platform

Then wait 7 to 30 days depending on the environment.

Phase 5: Delete with guardrails
Only auto-delete low-risk items such as clearly unattached disks or unused public IPs after the waiting window. Keep production subscriptions on approval-based cleanup unless the criteria are extremely strict.

Good governance rules

A mature setup usually includes:

  • exclusion tags like doNotDelete=true
  • separate policy for prod vs non-prod
  • allowlist of critical subscriptions
  • dry-run mode by default
  • centralized logs of all cleanup actions
  • approval gate for medium-risk deletions

This aligns well with Azure’s broader security and operations guidance, and Azure Automation supports managed identities so runbooks can access Azure without stored secrets. (Microsoft Learn)

Example architecture

Azure Resource Graph
|
v
Scheduled Automation Runbook
(with Managed Identity)
|
+--> Query orphaned resources
+--> Filter by tags / age / subscription
+--> Write report to Storage / Log Analytics
+--> Notify owners
+--> Optional approval step
+--> Delete approved resources

Example: Azure CLI script

This is a simple version for unattached managed disks. Start in report-only mode.

#!/usr/bin/env bash
set -euo pipefail
QUERY="
Resources
| where type =~ 'microsoft.compute/disks'
| where properties.diskState =~ 'Unattached'
| project id, name, resourceGroup, subscriptionId, location, tags
"
echo "Finding unattached managed disks..."
az graph query -q "$QUERY" --first 1000 -o json > orphaned-disks.json
echo "Report saved to orphaned-disks.json"
cat orphaned-disks.json | jq -r '.data[] | [.subscriptionId, .resourceGroup, .name, .id] | @tsv'

Azure CLI supports az graph query for Resource Graph queries. (Microsoft Learn)

Example: safer delete flow in Bash

This version only deletes disks that:

  • are unattached
  • are not tagged doNotDelete=true
#!/usr/bin/env bash
set -euo pipefail
QUERY="
Resources
| where type =~ 'microsoft.compute/disks'
| where properties.diskState =~ 'Unattached'
| extend doNotDelete = tostring(tags.doNotDelete)
| where doNotDelete !~ 'true'
| project id, name, resourceGroup, subscriptionId, location
"
RESULTS=$(az graph query -q "$QUERY" --first 1000 -o json)
echo "$RESULTS" | jq -c '.data[]' | while read -r row; do
ID=$(echo "$row" | jq -r '.id')
NAME=$(echo "$row" | jq -r '.name')
echo "Deleting unattached disk: $NAME"
az resource delete --ids "$ID"
done

For production, add:

  • dry-run flag
  • approval list
  • deletion logging
  • retry handling
  • resource locks check

Example: PowerShell runbook pattern

This is closer to what many platform teams use in Azure Automation.

Disable-AzContextAutosave -Scope Process
Connect-AzAccount -Identity
$query = @"
Resources
| where type =~ 'microsoft.network/publicipaddresses'
| where isnull(properties.ipConfiguration)
| extend doNotDelete = tostring(tags.doNotDelete)
| where doNotDelete !~ 'true'
| project id, name, resourceGroup, subscriptionId, location
"@
$results = Search-AzGraph -Query $query
foreach ($item in $results) {
Write-Output "Cleanup candidate: $($item.name) [$($item.id)]"
# Dry run by default
# Remove-AzResource -ResourceId $item.id -Force
}

Search-AzGraph is the PowerShell command for Resource Graph, and Azure Automation supports system-assigned or user-assigned managed identities for authenticating runbooks securely. (Microsoft Learn)

What to say in an interview

A strong answer would sound like this:

I’d use Azure Resource Graph to detect orphaned resources across subscriptions, then feed those results into an Azure Automation runbook running under Managed Identity. I would never delete immediately. Instead, I’d apply filters like age, tags, subscription scope, and recent change history, then notify owners or mark resources for cleanup first. For low-risk resources in non-production, I might automate deletion after a quarantine period. For production, I’d usually keep an approval gate. That gives you cost control without creating operational risk. (Microsoft Learn)

Best resource types to target first

Start with the safest, highest-confidence cleanup candidates:

  • unattached managed disks
  • public IPs with no association
  • NICs not attached to VMs
  • NSGs not attached to subnets or NICs (Microsoft Learn)