Upgrade OCP cluster

Upgrading OpenShift is the ultimate “Day 2” test for an administrator. Because OCP 4.x is Operator-managed, the upgrade is not just a software update; it is a coordinated orchestration across the entire stack—from the Operating System (RHCOS) to the Control Plane and your worker nodes.

Here are the critical “interview-ready” concepts you need to know for OCP upgrades.


1. The Upgrade Flow (The Order Matters)

When you trigger an upgrade via the Web Console or oc adm upgrade, the cluster follows a strict sequence to ensure stability:

  1. Cluster Version Operator (CVO): First, the CVO updates itself. It is the “brain” that knows what the new version of every other operator should be.
  2. Control Plane Operators: The operators for the API server, Controller Manager, and Scheduler are updated.
  3. Etcd: The database is updated (usually one node at a time to maintain quorum).
  4. Control Plane Nodes: The Machine Config Operator (MCO) drains, updates the OS (RHCOS), and reboots the control plane nodes one by one.
  5. Worker Nodes: Finally, the MCO begins rolling updates through your worker node pools.

2. Update Channels

You must choose a “channel” that dictates how fast you receive updates:

  • Stable: Validated updates that have been out for a while.
  • Fast: Updates that are technically ready but might still be gaining “field experience.”
  • Candidate: Early access for testing.
  • EUS (Extended Update Support): Specific even-numbered versions (e.g., 4.14, 4.16, 4.18) that allow you to skip a minor version during upgrades (e.g., 4.14 → 4.16) to reduce the number of reboots.

3. The “Canary” Strategy (Custom MCPs)

In a large production cluster, you don’t want all 100 worker nodes to start rebooting at once.

  • MachineConfigPool (MCP) Pausing: You can “pause” a pool of nodes. This allows the Control Plane to upgrade, but keeps the Workers on the old version until you are ready.
  • Canary Testing: You can create a small “canary” MCP with only 2–3 nodes. Unpause this pool first, verify your apps work on the new version, and then unpause the rest of the cluster.

4. Critical Troubleshooting Questions

An interviewer will likely give you these scenarios:

  • “The upgrade is stuck at 57%.” What do you do?
    • Check ClusterOperators: Run oc get co. Look for any operator where AVAILABLE=False or PROGRESSING=True.
    • Check Node Status: Run oc get nodes. If a node is SchedulingDisabled, the MCO might be struggling to drain a pod (e.g., a pod without a PDB or a local volume).
  • “Can you roll back an OpenShift upgrade?”
    • NO. This is a trick question. OpenShift does not support rollbacks. Because the etcd database schema changes during upgrades, you can only “roll forward” by fixing the issue or, in a total disaster, by restoring the cluster from an etcd backup taken before the upgrade.

5. Best Practices for Admins

  • Check the Update Graph: Always use the Red Hat OpenShift Update Graph tool to ensure there is a supported path between your current version and your target.
  • Review Alerts: Clear all critical alerts before starting. If the cluster isn’t healthy before the upgrade, it definitely won’t be healthy after.
  • Pod Disruption Budgets (PDB): Ensure developers have set up PDBs so the upgrade doesn’t accidentally take down all replicas of a critical service at once.

The Canary Update strategy allows you to test an OpenShift upgrade on a small subset of nodes before rolling it out to the entire cluster. This is the gold standard for high-availability environments.

Here is the exact administrative workflow and commands you would use.


Step 1: Create a “Canary” MachineConfigPool (MCP)

First, you need a pool that targets only the nodes you want to test.

  1. Label your canary nodes:Bashoc label node <node-name> node-role.kubernetes.io/worker-canary=""
  2. Create the MCP:Save this as canary-mcp.yaml and run oc create -f canary-mcp.yaml.YAMLapiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker-canary spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker, worker-canary]} nodeSelector: matchLabels: node-role.kubernetes.io/worker-canary: ""

Step 2: Pause the Remaining Worker Pools

Before triggering the cluster upgrade, you must “pause” the main worker pool. This tells the Machine Config Operator (MCO): “Update the Control Plane, but do NOT touch these worker nodes yet.”

Bash

# Pause the standard worker pool
oc patch mcp/worker --type='merge' -p '{"spec":{"paused":true}}'

Step 3: Trigger the Upgrade

Now, start the cluster upgrade as usual (via Console or CLI).

Bash

oc adm upgrade --to=4.16.x

What happens now?

  • The Control Plane upgrades and reboots.
  • The Worker-Canary pool (which is NOT paused) updates and reboots.
  • The Worker pool (which IS paused) stays on the old version.

Step 4: Verify and Complete the Rollout

Once the Canary nodes are successfully updated and your applications are verified, you can roll out the update to the rest of the cluster by unpausing the main pool.

  1. Check status:Bashoc get mcp You should see worker-canary is UPDATED, but worker shows UPDATED=False.
  2. Unpause the main pool:Bashoc patch mcp/worker --type='merge' -p '{"spec":{"paused":false}}' The MCO will now begin the rolling update of the remaining worker nodes.

Critical Interview Warning: The “Pause” Alert

If an interviewer asks: “Is it safe to leave an MCP paused indefinitely?”

  • Answer: No. Starting in OCP 4.11+, a critical alert will fire if a pool is paused for more than 1 hour during an update.
  • Reason: Pausing an MCP prevents Certificate Rotation. If you leave it paused too long (usually >24 hours during an upgrade cycle), the nodes’ Kubelet certificates may expire, and the nodes will go NotReady, potentially breaking the cluster.

In OpenShift, Operators are the software managers that keep your cluster healthy. When an operator fails, it shows up as Degraded. As an admin, your job is to find the “who, why, and how” of the failure.

Here is the professional troubleshooting sequence for an OCP Operator failure.

1. Identify the Failing Operator

The first step is always to find which operator is complaining.

Bash

# Get the status of all cluster operators
oc get clusteroperators (or 'oc get co')

What to look for: Look for DEGRADED=True or AVAILABLE=False. Common ones that fail are authentication, console, image-registry, and machine-config.


2. The Investigation Sequence

Once you identify the degraded operator (e.g., authentication), follow this 4-step drill:

A. Describe the ClusterOperator

This gives you the “high-level” reason for the failure (often a specific error message from the operator itself).

Bash

oc describe clusteroperator authentication

B. Check the Operator’s Namespace

Every operator has its own namespace (usually starting with openshift-).

Bash

# Find the namespace and pods
oc get pods -A | grep authentication

C. Inspect the Pod Logs

The operator is just a pod. If it’s failing, it will tell you why in its logs.

Bash

oc logs -n openshift-authentication-operator deployment/authentication-operator

D. Check Events

Sometimes the problem isn’t the code, but the infrastructure (e.g., “Failed to pull image” or “Insufficient CPU”).

Bash

oc get events -n openshift-authentication-operator --sort-by='.lastTimestamp'

3. Common “Admin-Level” Failure Scenarios

In an interview, you can shine by mentioning these specific, real-world failures:

Failing OperatorTypical ReasonThe Fix
Machine-ConfigNode can’t drain because of a Pod Disruption Budget (PDB).Manually move the pod or adjust the PDB temporarily.
AuthenticationEtcd is slow or the internal OAuth secret is out of sync.Check etcd health; sometimes deleting the operator pod to force a restart helps.
Image-RegistryThe backend storage (S3, Azure Blob, NFS) is full or disconnected.Check the configs.imageregistry.operator.openshift.io resource and storage backend.
IngressPort 80/443 is blocked on the LoadBalancer or the Router deployment is scaling.Check the IngressController custom resource and cloud provider LB status.

4. The “Nuclear” Option: Must-Gather

If the API is behaving so poorly that you can’t even run these commands, or if you need to open a Red Hat Support ticket, use Must-Gather.

Bash

oc adm must-gather

Must-Gather is an admin’s best friend. It creates a local directory with every log, secret (redacted), and config file from the cluster. You can then use grep or ag locally to find the needle in the haystack.


5. Node-Level Debugging (When the API is down)

If the operator is failing because the node itself is unresponsive, you must go under the hood:

Bash

# Access the node via a debug pod (preferred)
oc debug node/<node-name>
# Once inside the debug pod, switch to host binaries
chroot /host
# Check the container runtime (CRI-O)
crictl ps
crictl logs <container_id>

OpenShift (OCP) interview

For an OpenShift (OCP) interview in 2026, you should expect questions that move beyond basic Kubernetes concepts and focus on enterprise operations, automation (Operators), and security.

Here is a curated list of high-value interview questions categorized by role and complexity.


1. Architectural Concepts

  • What is the role of the Cluster Version Operator (CVO)?
    • Answer: The CVO is the heart of OCP 4.x upgrades. It monitors the “desired state” of the cluster’s operators (the “payload”) and ensures the cluster is updated in a safe, coordinated manner across all components.
  • Explain the difference between an Infrastructure Node and a Worker Node.
    • Answer: Infrastructure nodes are used to host “cluster-level” services like the Router (Ingress), Monitoring (Prometheus/Grafana), and Registry. By labeling nodes as infra, companies can often save on Red Hat subscription costs, as these nodes typically don’t require the same licensing as nodes running application workloads.
  • What is the “Etcd Quorum” and why is it important in OCP?
    • Answer: OpenShift typically requires an odd number of Control Plane nodes (usually 3) to maintain a quorum in the etcd database. If you lose more than half of your masters, the cluster becomes read-only to prevent data corruption.

2. Networking & Traffic (The Gateway API Era)

  • Explain Ingress vs. Route vs. Gateway API. (See previous discussion)
    • Key Focus: Interviewers want to know if you understand that Routes are OCP-native, Ingress is K8s-standard, and Gateway API is the future standard for advanced traffic management (canary, mirroring, etc.).
  • How does “Service Serving Certificate Secrets” work in OCP?
    • Answer: OCP can automatically generate a TLS certificate for a Service. You annotate a Service with service.beta.openshift.io/serving-cert-secret-name. OCP then creates a secret containing a cert/key signed by the internal Cluster CA, allowing for easy end-to-end encryption.

3. Security (The “Hardest” Category)

  • Scenario: A developer says their pod won’t start because of a “Security Context” error. What do you check?
    • Answer: I would check the Security Context Constraints (SCC). By default, OCP runs pods with the restricted-v2 SCC, which prevents running as root. If the pod requires root or host access, I’d check if the ServiceAccount has been granted a more permissive SCC like anyuid or privileged.
  • What are NetworkPolicies vs. EgressFirewalls?
    • Answer: NetworkPolicies control traffic between pods inside the cluster (East-West). EgressFirewalls (part of OCP’s OVN-Kubernetes) control traffic leaving the cluster to external IPs or CIDR blocks (North-South).

4. Troubleshooting & Operations

  • How do you recover a cluster if the Control Plane certificates have expired?
    • Answer: This usually involves using the oc adm certificate approve command to approve pending CSRs (Certificate Signing Requests) or manually rolling back the cluster clock if it’s an emergency. OCP 4.x generally tries to auto-renew these, but clock drift can break it.
  • Describe the Source-to-Image (S2I) workflow.
    • Answer: S2I takes source code from Git, injects it into a “builder image” (like Node.js or Java), and outputs a ready-to-run container image. It simplifies the CI/CD process for developers who don’t want to write Dockerfiles.

5. Advanced / 2026 Trends

  • What is OpenShift Virtualization (KubeVirt)?
    • Answer: It allows you to run legacy Virtual Machines (VMs) as pods on OpenShift. This is critical for “modernizing” apps where one part is a container and the other is a legacy Windows or Linux VM that can’t be containerized yet.
  • How does Red Hat Advanced Cluster Management (RHACM) help in a multi-cluster setup?
    • Answer: RHACM provides a single pane of glass to manage security policies, application placement, and cluster lifecycle (creation/deletion) across multiple OCP clusters on AWS, Azure, and on-prem.

Quick Tip for the Interview

Whenever you answer, use the phrase “Operator-led design.” OpenShift 4 is built entirely on Operators. If the interviewer asks, “How do I fix the registry?” the best answer starts with, “I would check the status of the Image Registry Operator using oc get clusteroperator.” This shows you understand the fundamental architecture of the platform.

As an OpenShift Administrator, your interview will focus heavily on cluster stability, lifecycle management (upgrades), security enforcement, and the “Day 2” operations that keep an enterprise cluster running.

Here are the top admin-focused interview questions for 2026, divided by functional area.


1. Cluster Lifecycle & Maintenance

  • How does the Cluster Version Operator (CVO) manage upgrades, and what do you check if an upgrade hangs at 57%?
    • Answer: The CVO coordinates with all other cluster operators to reach a specific “desired version.” If it hangs, I check oc get clusteroperators to see which specific operator is degraded. Usually, it’s the Machine Config Operator (MCO) waiting for nodes to drain or the Authentication Operator having issues with etcd.
  • What is the “Must-Gather” tool, and when would you use it?
    • Answer: oc adm must-gather is the primary diagnostic tool. It launches a pod that collects logs, CRD states, and operating system debugging info. I use it before opening a Red Hat support ticket or when a complex issue involves multiple operators.
  • Explain how to back up and restore the etcd database.
    • Answer: I use the etcd-snapshot.sh script provided on the control plane nodes. For restoration, I must stop the static pods for the API server and etcd, then use the backup to restore the data directory. It’s critical to do this on a single control plane node first to re-establish a quorum.

2. Node & Infrastructure Management

  • What is a MachineConfigPool (MCP), and why would you pause it?
    • Answer: An MCP groups nodes (like master or worker) so the MCO can apply configurations to them. I would pause an MCP during a sensitive maintenance window or when troubleshooting a configuration change that I don’t want to roll out to all nodes at once.
  • How do you add a custom SSH key or a CronJob to the underlying RHCOS nodes?
    • Answer: You don’t log into the nodes manually. You create a MachineConfig YAML. The MCO then detects this, reboots the nodes (if necessary), and applies the change to the immutable filesystem.
  • What happens if a node enters a NotReady state?
    • Answer: First, I check node pressure (CPU/Memory/Disk). Then I check the kubelet and crio services on the node using oc debug node/<node-name>. I also check for network reachability between the node and the Control Plane.

3. Networking & Security

  • What is the benefit of OVN-Kubernetes over the legacy OpenShift SDN?
    • Answer: OVN-K is the default in 4.x. It supports modern features like IPsec encryption for pod-to-pod traffic, smarter load balancing, and Egress IPs for specific projects to exit the cluster via a fixed IP address for firewall white-listing.
  • A user is complaining they can’t reach a service in another project. What do you check?
    • Answer:
      1. NetworkPolicies: Is there a policy blocking “Cross-Namespace” traffic?
      2. Service/Endpoints: Does the Service have active Endpoints (oc get endpoints)?
      3. Namespace labels: If using a high-isolation network plugin, do the namespaces have the correct labels to “talk” to each other?
  • How do you restrict a specific group of users from creating LoadBalancer type services?
    • Answer: I would use an Admission Controller or a specialized RBAC role that removes the update/create verbs for the services/status resource, or more commonly, use a Policy Engine like Gatekeeper/OPA to deny the request.

4. Storage & Capacity Planning

  • How do you handle “Volume Expansion” if a database runs out of space?
    • Answer: If the underlying StorageClass supports allowVolumeExpansion: true, I simply edit the PersistentVolumeClaim (PVC) and increase the storage value. OpenShift and the CSI driver handle the resizing of the file system on the fly.
  • What is the difference between ReadWriteOnce (RWO) and ReadWriteMany (RWX)?
    • Answer: RWO allows only one node to mount the volume (good for databases). RWX allows multiple nodes/pods to mount it simultaneously (required for shared file storage like NFS or ODF).

5. Scenario-Based: “The Midnight Call”

  • Scenario: The Web Console is down, and oc commands are timing out. Where do you start?
    • Answer: This sounds like an API Server or etcd failure. I would:
      1. Log into a Control Plane node directly via SSH.
      2. Check the status of static pods in /etc/kubernetes/manifests.
      3. Run crictl ps to see if the kube-apiserver or etcd containers are crashing.
      4. Check the node’s disk space (etcd often fails if the disk is 100% full).

Pro-Tip for Admin Interviews:

In 2026, emphasize GitOps. Mention that you prefer managing cluster configurations (like HTPasswd providers or Quota objects) via ArgoCD rather than manual oc apply commands. This shows you are an admin who values Idempotency and Disaster Recovery.

cert-manager in Kubernetes

cert-manager in Kubernetes is an add-on that automates TLS certificate management.

It helps you:

  • request certificates
  • renew them before they expire
  • store them as Kubernetes Secrets
  • connect them to Ingresses, gateways, and apps

Common uses:

  • Let’s Encrypt certificates for public apps
  • internal PKI certificates for cluster services
  • webhook, mTLS, and service cert automation

Core pieces:

  • Issuer / ClusterIssuer: tells cert-manager where certs come from
    • Issuer = namespaced
    • ClusterIssuer = cluster-wide
  • Certificate: declares the cert you want
  • Secret: where the cert and private key get stored
  • Challenge / Order: used mainly with ACME providers like Let’s Encrypt

Typical flow:

  1. You install cert-manager.
  2. Create an Issuer or ClusterIssuer.
  3. Create a Certificate resource or annotate an Ingress.
  4. cert-manager gets the cert and saves it into a Secret.
  5. Your Ingress/controller uses that Secret for HTTPS.

Very simple example with Let’s Encrypt:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
email: you@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

Then a certificate:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: myapp-cert
namespace: default
spec:
secretName: myapp-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- example.com

Why people use it:

  • avoids manual certificate renewal
  • reduces downtime from expired certs
  • works natively with Kubernetes resources
  • supports public and private certificate authorities

Alright — let’s walk through cert-manager in Kubernetes step by step, focusing on the three things you asked for:


1️⃣ How to install cert-manager

The easiest and most reliable way is using Helm.

Add the Helm repo

helm repo add jetstack https://charts.jetstack.io
helm repo update

Install cert-manager (with CRDs)

helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true

Verify installation

kubectl get pods -n cert-manager

You should see pods like:

  • cert-manager
  • cert-manager-webhook
  • cert-manager-cainjector

2️⃣ How cert-manager works with Ingress

This is the most common real-world setup.

🔁 Flow (important concept)

  1. You create an Ingress
  2. Add a special annotation → tells cert-manager to issue a cert
  3. cert-manager:
    • requests cert from Let’s Encrypt
    • completes challenge (HTTP-01 or DNS-01)
    • stores cert in a Secret
  4. Ingress uses that Secret → HTTPS works automatically

🧩 Example Ingress with cert-manager

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- example.com
secretName: myapp-tls
rules:
- host: example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 80

👉 Key things:

  • annotation → tells cert-manager which issuer to use
  • tls.secretName → where the cert will be stored
  • cert-manager fills that Secret automatically

3️⃣ Issuer vs ClusterIssuer vs Certificate

This is where people usually get confused — here’s the clean breakdown:


Issuer

  • Namespace-scoped
  • Only works inside one namespace
kind: Issuer
metadata:
name: my-issuer
namespace: default

Use when:

  • small apps
  • strict namespace isolation

ClusterIssuer

  • Cluster-wide
  • Can be used from any namespace
kind: ClusterIssuer
metadata:
name: letsencrypt-prod

Use when:

  • shared infrastructure
  • most production setups ✅

🟣 Certificate

  • Declares what cert you want
  • cert-manager turns this into a real cert
kind: Certificate
metadata:
name: my-cert
spec:
secretName: my-tls
dnsNames:
- example.com
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer

How they connect together

Think of it like this:

  • Issuer / ClusterIssuer → “Who gives me the cert?”
  • Certificate → “I want a cert for this domain”
  • Secret → “Here’s your actual cert + key”
  • Ingress → “Use this cert for HTTPS”

Typical production setup

Most teams use:

  • ClusterIssuer (Let’s Encrypt)
  • ✅ Ingress annotation (auto certs)
  • ❌ No manual Certificate objects (optional)

Bonus: HTTP-01 vs DNS-01

HTTP-01 (most common)

  • cert-manager creates a temporary endpoint
  • Let’s Encrypt verifies via HTTP
  • works with Ingress

DNS-01

  • adds a DNS record to prove ownership
  • needed for:
    • wildcard certs (*.example.com)
    • internal services

Great question — this is where cert-manager becomes really powerful.

At a high level:

👉 cert-manager = certificate lifecycle automation
👉 Service mesh (Istio / Linkerd) = uses certificates for mTLS between services

So cert-manager can act as the certificate authority (or CA manager) for your mesh.


🧠 Big picture: how they fit together

cert-manager → issues certificates
service mesh → uses them for mTLS
secure pod-to-pod communication

🔐 What mTLS in a service mesh actually means

In both Istio and Linkerd:

  • Every pod gets a certificate + private key
  • Pods authenticate each other using certs
  • Traffic is:
    • encrypted ✅
    • authenticated ✅
    • tamper-proof ✅

⚙️ Option 1: Built-in CA (default behavior)

Istio / Linkerd by default:

  • run their own internal CA
  • automatically issue certs to pods
  • rotate certs

👉 This works out-of-the-box and is easiest.


🧩 Option 2: Using cert-manager as the CA

This is where integration happens.

Instead of mesh managing certs itself:

👉 cert-manager becomes the source of truth for certificates


🧱 Architecture with cert-manager

cert-manager
(Issuer / ClusterIssuer)
Mesh control plane (Istio / Linkerd)
Sidecars / proxies in pods

🔵 Istio + cert-manager

Default Istio:

  • uses istiod as CA

With cert-manager:

  • you replace Istio’s CA with:
    • cert-manager + external CA (Vault, Let’s Encrypt, internal PKI)

Common approach: Istio + cert-manager + external CA

cert-manager:

  • manages root/intermediate certs

Istio:

  • requests workload certs from that CA

Why do this?

  • centralized certificate management
  • enterprise PKI integration (e.g. HashiCorp Vault)
  • compliance requirements

Linkerd + cert-manager

Linkerd has cleaner native integration.

👉 Linkerd actually recommends using cert-manager.


How it works:

  • cert-manager issues:
    • trust anchor (root cert)
    • issuer cert
  • Linkerd uses those to:
    • issue certs to proxies
    • rotate automatically

Example flow:

  1. Create a ClusterIssuer (e.g. self-signed or Vault)
  2. cert-manager generates:
    • root cert
    • intermediate cert
  3. Linkerd control plane uses them
  4. Sidecars get short-lived certs

🔁 Certificate lifecycle in mesh (with cert-manager)

  1. cert-manager creates CA certs
  2. mesh control plane uses them
  3. sidecars request short-lived certs
  4. certs rotate automatically

When to use cert-manager with a mesh

✅ Use cert-manager if:

  • you need custom CA / PKI
  • you want centralized certificate control
  • you’re integrating with:
    • Vault
    • enterprise PKI
  • compliance/security requirements

❌ Skip it if:

  • you just want simple mTLS
  • default mesh CA is enough

Important distinction

👉 cert-manager does NOT handle:

  • traffic encryption itself
  • service-to-service routing

👉 service mesh does NOT handle:

  • external certificate issuance (well)
  • complex PKI integrations (alone)

Simple mental model

  • cert-manager = certificate factory
  • Istio / Linkerd = security + traffic engine

Interview-style summary

If you need a sharp answer:

“cert-manager integrates with service meshes by acting as an external certificate authority. While Istio and Linkerd can issue certificates internally, cert-manager enables centralized PKI management, supports external CAs like Vault, and provides automated rotation, making it useful for production-grade mTLS setups.”


Here’s a real-world debugging checklist for cert-manager + service mesh / mTLS, organized in the order that usually finds the issue fastest.

1. Start with the symptom, not the YAML

First sort the failure into one of these buckets:

  • Certificate issuance problem: Secrets are missing, Certificate is not Ready, ACME challenges fail, or issuer/webhook errors appear. cert-manager’s troubleshooting flow centers on the Certificate, CertificateRequest, Order, and Challenge resources. (cert-manager)
  • Mesh identity / mTLS problem: certificates exist, but workloads still fail handshakes, sidecars can’t get identities, or mesh health checks fail. Istio and Linkerd both separate certificate management from runtime identity distribution. (Istio)

That split matters because cert-manager can be healthy while the mesh is broken, and vice versa. (cert-manager)

2. Confirm the control planes are healthy

Check the obvious first:

kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n linkerd

For cert-manager, the important core components are the controller, webhook, and cainjector; webhook issues are a documented source of certificate failures. (cert-manager)

For Linkerd, run:

linkerd check

Linkerd’s official troubleshooting starts with linkerd check, and many identity and certificate problems show up there directly. (Linkerd)

For Istio, check control-plane health and then inspect config relevant to CA integration if you are using istio-csr or another external CA path. Istio’s cert-manager integration for workload certificates requires specific CA-server changes. (cert-manager)

3. Check the certificate objects before the Secrets

If cert-manager is involved, do this before anything else:

kubectl get certificate -A
kubectl describe certificate <name> -n <ns>
kubectl get certificaterequest -A
kubectl describe certificaterequest <name> -n <ns>

cert-manager’s own troubleshooting guidance points to these resources first because they expose the reason issuance or renewal failed. (cert-manager)

What you’re looking for:

  • Ready=False
  • issuer not found
  • permission denied
  • webhook validation errors
  • failed renewals
  • pending requests that never progress

If you’re using ACME, continue with:

kubectl get order,challenge -A
kubectl describe order <name> -n <ns>
kubectl describe challenge <name> -n <ns>

ACME failures are usually visible at the Order / Challenge level. (cert-manager)

4. Verify the issuer chain and secret contents

Typical failure pattern: the Secret exists, but it is the wrong Secret, wrong namespace, missing keys, or signed by the wrong CA.

Check:

kubectl get issuer,clusterissuer -A
kubectl describe issuer <name> -n <ns>
kubectl describe clusterissuer <name>
kubectl get secret <secret-name> -n <ns> -o yaml

For mesh-related certs, validate:

  • the Secret name matches what the mesh expects
  • the Secret is in the namespace the mesh component actually reads
  • the chain is correct
  • the certificate has not expired
  • the issuer/trust anchor relationship is the intended one

In Linkerd specifically, the trust anchor and issuer certificate are distinct, and Linkerd documents that workload certs rotate automatically but the control-plane issuer/trust-anchor credentials do not unless you set up rotation. (Linkerd)

5. Check expiration and rotation next

A lot of “random” mesh outages are just expired identity material.

For Linkerd, verify:

  • trust anchor validity
  • issuer certificate validity
  • whether rotation was automated or done manually

Linkerd’s docs are explicit that proxy workload certs rotate automatically, but issuer and trust anchor rotation require separate handling; expired root or issuer certs are a known failure mode. (Linkerd)

For Istio, if using a custom CA or Kubernetes CSR integration, verify the configured CA path and signing certs are still valid and match the active mesh configuration. (cert-manager)

6. If this is Istio, verify whether the mesh is using its built-in CA or an external one

This is a very common confusion point.

If you use cert-manager with Istio workloads, you are typically not just “adding cert-manager”; you are replacing or redirecting the CA flow, often through istio-csr or Kubernetes CSR integration. cert-manager’s Istio integration docs call out changes like disabling the built-in CA server and setting the CA address. (cert-manager)

So check:

  • Is istiod acting as CA, or is an external CA path configured?
  • Is caAddress pointing to the expected service?
  • If istio-csr is used, is it healthy and reachable?
  • Are workload cert requests actually reaching the intended signer?

If that split-brain exists, pods may get no certs or certs from the wrong signer. That is an inference from how Istio’s custom CA flow is wired. (cert-manager)

7. If this is Linkerd, run the identity checks early

For Linkerd, do not guess. Run:

linkerd check
linkerd check --proxy

The Linkerd troubleshooting docs center on linkerd check, and certificate / identity issues often surface there more quickly than raw Kubernetes inspection. (Linkerd)

Then look for:

  • identity component failures
  • issuer/trust-anchor mismatch
  • certificate expiration warnings
  • injected proxies missing identity

If linkerd check mentions expired identity material, go straight to issuer/trust-anchor rotation docs. (Linkerd)

8. Verify sidecar or proxy injection happened

If the pod is not meshed, mTLS debugging is a distraction.

Check:

kubectl get pod <pod> -n <ns> -o yaml

Look for the expected sidecar/proxy containers and mesh annotations. If they are absent, the issue is injection or policy, not certificate issuance. Istio and Linkerd both rely on the dataplane proxy to actually use workload identities for mTLS. (Istio)

9. Check policy mismatches after identities are confirmed

Once certificates and proxies look correct, inspect whether the traffic policy demands mTLS where the peer does not support it.

For Istio, check authentication policy objects such as PeerAuthentication and any destination-side expectations. Istio’s authentication docs cover how mTLS policy is applied. (Istio)

Classic symptom:

  • one side is strict mTLS
  • the other side is plaintext, outside mesh, or not injected

That usually produces handshake/reset errors even when cert-manager is completely fine. This is an inference from Istio’s mTLS policy model. (Istio)

10. Read the logs in this order

When the issue is still unclear, the best signal usually comes from logs in this order:

  1. cert-manager controller
  2. cert-manager webhook
  3. mesh identity/CA component (istiod, istio-csr, or Linkerd identity)
  4. the source and destination proxy containers

Use:

kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
kubectl logs -n istio-system deploy/istiod
kubectl logs -n <istio-csr-namespace> deploy/istio-csr
kubectl logs -n linkerd deploy/linkerd-identity
kubectl logs <pod> -n <ns> -c <proxy-container>

cert-manager specifically documents webhook and issuance troubleshooting as core paths. Linkerd and Istio docs likewise center on their identity components for mesh cert issues. (cert-manager)

11. For ingress or gateway TLS, separate north-south from east-west

A lot of teams mix up:

  • ingress/gateway TLS
  • service-to-service mTLS

With Istio, cert-manager integration for gateways is straightforward and separate from workload identity. Istio’s docs show cert-manager managing gateway TLS credentials, while workload certificate management is handled through different CA mechanisms. (Istio)

So ask:

  • Is the failure only at ingress/gateway?
  • Or only pod-to-pod?
  • Or both?

If only ingress is broken, inspect the gateway Secret and gateway config, not mesh identity. (Istio)

12. Fast triage map

Use this shortcut:

  • Certificate not Ready → inspect CertificateRequest, Order, Challenge, issuer, webhook. (cert-manager)
  • Secret exists but mesh still fails → inspect trust chain, expiry, namespace, and mesh CA configuration. (cert-manager)
  • Linkerd only → run linkerd check, then inspect issuer/trust anchor status. (Linkerd)
  • Istio + cert-manager for workloads → verify external CA wiring, especially CA server disablement and caAddress. (cert-manager)
  • Handshake failures with healthy certs → inspect mesh policy and whether both endpoints are actually meshed. (Istio)

13. The three most common root causes

In practice, the big ones are:

  1. Expired or non-rotated issuer / trust anchor, especially in Linkerd. (Linkerd)
  2. Istio external CA miswiring, especially when using cert-manager for workloads rather than just gateway TLS. (cert-manager)
  3. Policy/injection mismatch, where strict mTLS is enabled but one side is not part of the mesh. (Istio)

14. Minimal command pack to keep handy

kubectl get certificate,certificaterequest,issuer,clusterissuer -A
kubectl describe certificate <name> -n <ns>
kubectl get order,challenge -A
kubectl logs -n cert-manager deploy/cert-manager
kubectl logs -n cert-manager deploy/cert-manager-webhook
linkerd check
linkerd check --proxy
kubectl logs -n istio-system deploy/istiod
kubectl get pods -A -o wide
kubectl get secret -A

Flux (or FluxCD)

Flux (or FluxCD) is a GitOps continuous delivery tool for Kubernetes. Here’s a concise breakdown:


What it does

Flux is an operator that runs in your Kubernetes cluster, constantly comparing the cluster’s live state to the state defined in your Git repo. If they differ, Flux automatically makes changes to the cluster to match the repo. In other words, Git is the single source of truth — you push a change to Git, Flux detects it and applies it to the cluster automatically, with no manual kubectl apply needed.


How it works — core components

Core components of FluxCD (the GitOps Toolkit) include the Source Controller, Kustomize Controller, Helm Controller, and Notification Controller. Each is a separate Kubernetes controller responsible for one concern:

  • Source Controller — watches Git repos, Helm repos, OCI registries, and S3 buckets for changes
  • Kustomize Controller — applies raw YAML and Kustomize overlays to the cluster
  • Helm Controller — manages HelmRelease objects (declarative Helm chart deployments)
  • Notification Controller — sends alerts to Slack, Teams, etc. when syncs succeed or fail

Key characteristics

  • Pull-based model: Flux enables pure pull-based GitOps application deployments — no access to clusters is needed by the source repo or by any other cluster. This is more secure than push-based pipelines where your CI system needs cluster credentials.
  • Drift detection: If your live cluster diverges from Git (e.g., due to manual edits), Flux will detect the drift and revert it, ensuring deterministic deployments.
  • Kubernetes-native: Flux v2 is built from the ground up to use Kubernetes’ API extension system. Everything is a CRD — GitRepository, Kustomization, HelmRelease, etc.
  • Security-first: Flux uses true Kubernetes RBAC via impersonation and supports multiple Git repositories. It follows a pull vs. push model, least amount of privileges, and adheres to Kubernetes security policies with tight integration with security tools.
  • Multi-cluster: Flux can use one Kubernetes cluster to manage apps in either the same or other clusters, spin up additional clusters, and manage cluster fleets.

CNCF standing & adoption

Flux is a Cloud Native Computing Foundation (CNCF) graduated project, used in production by various organisations and cloud providers. Notable users include Deutsche Telekom (managing 200+ clusters with just 10 engineers), the US Department of Defense, and Microsoft Azure (which uses Flux natively in AKS and Azure Arc).


Flux vs. Argo CD (the main alternative)

Flux CD is highly composable — use only the controllers you need. It’s preferred by teams who already think in CRDs and reconciliation loops, and is excellent for infrastructure-as-code and complex dependency handling. The main trade-off is that Flux has some drawbacks such as lack of a native UI and a steep learning curve. Argo CD is the better choice if your team wants a rich visual dashboard out of the box.


Relation to OCP

Flux is commonly used with OpenShift as the GitOps engine for managing cluster configuration and application deployments. Red Hat also ships OpenShift GitOps (based on Argo CD) as an official operator, so in OCP environments you’ll encounter both — Flux tends to be chosen by platform engineering teams who want tighter Kubernetes-native control, while OpenShift GitOps is the supported out-of-the-box option from Red Hat.

Here’s a thorough breakdown of how Flux integrates with OCP:


Installation — two options

Option 1: Flux Operator via OperatorHub (recommended)

Flux can be installed on a Red Hat OpenShift cluster directly from OperatorHub using the Flux Operator — an open-source project part of the Flux ecosystem that provides a declarative API for the lifecycle management of the Flux controllers on OpenShift.

Once installed, you declare a FluxInstance CR with cluster.type: openshift:

apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
distribution:
version: "2.x"
registry: "ghcr.io/fluxcd"
cluster:
type: openshift # ← tells Flux it's on OCP
multitenant: true
networkPolicy: true
sync:
kind: GitRepository
url: "https://my-git-server.com/my-org/my-fleet.git"
ref: "refs/heads/main"
path: "clusters/my-cluster"

Option 2: flux bootstrap CLI

The best way to install Flux on OpenShift via CLI is to use the flux bootstrap command. This command works with GitHub, GitLab, as well as generic Git providers. You require cluster-admin privileges to install Flux on OpenShift.


The OCP-specific challenge: SCCs

OCP’s default restricted-v2 SCC blocks containers from running as root — and Flux controllers, like many Kubernetes tools, need specific adjustments to run cleanly. The official integration handles this by:

  • Shipping a scc.yaml manifest that grants Flux controllers the correct non-root SCC permissions
  • Patching the Kustomization to remove the default SecComp profile and enforce the correct UID expected by Flux images, preventing OCP from altering the container user

The cluster.type: openshift flag in the FluxInstance spec automatically applies these adjustments — no manual SCC patching needed when using the Flux Operator.


What the integration looks like end-to-end

┌─────────────────────────────────────────────────────┐
│ Git Repository │
│ clusters/my-cluster/ │
│ ├── flux-system/ (Flux bootstrap manifests) │
│ ├── namespaces/ (OCP Projects) │
│ ├── rbac/ (Roles, RoleBindings, SCCs) │
│ └── apps/ (Deployments, Routes, etc.) │
└────────────────────┬────────────────────────────────┘
│ pull (every ~1 min)
┌─────────────────────────────────────────────────────┐
│ OCP Cluster (flux-system ns) │
│ source-controller → watches Git/OCI/Helm repos │
│ kustomize-controller→ applies YAML/Kustomize │
│ helm-controller → manages HelmReleases │
│ notification-ctrl → sends alerts to Slack etc. │
└─────────────────────────────────────────────────────┘

Multi-tenancy on OCP

When multitenant: true is set, Flux uses true Kubernetes RBAC via impersonation — meaning each tenant’s Kustomization runs under its own service account, scoped to its own namespace. This maps naturally to OCP Projects, where each team or app gets an isolated namespace with its own SCC and RBAC policies.

The pattern looks like this in Git:

# tenants/team-a/kustomization.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a-apps
namespace: flux-system
spec:
serviceAccountName: team-a-reconciler # impersonates this SA
targetNamespace: team-a # deploys into this OCP Project
path: ./tenants/team-a/apps
sourceRef:
kind: GitRepository
name: fleet-repo

Each team-a-reconciler service account only has permissions within team-a‘s namespace — enforced by both RBAC and the namespace’s SCC policies.


Key considerations for OCP + Flux

TopicDetail
TestingFlux v2.3 was the first release end-to-end tested on OpenShift.
Operator lifecycleWhen a subscription is applied, OpenShift’s Operator Lifecycle Manager (OLM) automatically handles upgrading Flux.
Enterprise supportBackwards compatibility with older versions of Kubernetes and OpenShift is offered by vendors such as ControlPlane that provide enterprise support for Flux.
vs. OpenShift GitOpsRed Hat ships its own GitOps operator (based on Argo CD) as the officially supported option. Flux on OCP is community/third-party supported, preferred by teams who want a more Kubernetes-native, CLI-driven approach.
NetworkPolicySetting networkPolicy: true in the FluxInstance spec automatically creates NetworkPolicies for the flux-system namespace, restricting controller-to-controller traffic.

OCP (OpenShift Container Platform) Security Best Practices


Identity & Access Control

  • RBAC & Least Privilege: Every user, service account, and process should possess only the absolute minimum permissions needed. Isolate workloads using distinct service accounts, each bound to Roles containing relevant permissions, and avoid attaching sensitive permissions directly to user accounts.
  • Strong Authentication: Implement robust authentication mechanisms such as multi-factor authentication (MFA) or integrate with existing identity management systems to prevent unauthorized access.
  • Audit Regularly: Regularly audit Roles, ClusterRoles, RoleBindings, and SCC usage to ensure they remain aligned with the principle of least privilege and current needs.
  • Avoid kubeadmin: Don’t use the default kubeadmin superuser account in production — integrate with an enterprise identity provider instead.

Cluster & Node Hardening

  • Use RHCOS for nodes: It is best to leverage OpenShift’s relationship with cloud providers and use the most recent Red Hat Enterprise Linux CoreOS (RHCOS) for all OCP cluster nodes. RHCOS is designed to be as immutable as possible, and any changes to the node must be authorized through the Red Hat Machine Operator — no direct user access needed.
  • Control plane HA: A minimum of three control-plane nodes should be configured to allow accessibility in a node outage event.
  • Network isolation: Strict network isolation prevents unauthorized external ingress to OpenShift cluster API endpoints, nodes, or pod containers. The DNS, Ingress Controller, and API server can be set to private after installation.

Container Image Security

  • Scan images continuously: Use image scanning tools to detect vulnerabilities and malware within container images. Use trusted container images from reputable sources and regularly update them to include the latest security patches.
  • Policy enforcement: Define and enforce security policies for container images, ensuring that only images meeting specific criteria — such as being signed by trusted sources or containing no known vulnerabilities — are deployed.
  • No root containers: OpenShift has stricter security policies than vanilla Kubernetes — running a container as root is forbidden by default.

Security Context Constraints (SCCs)

OpenShift uses Security Context Constraints (SCCs) that give your cluster a strong security base. By default, OpenShift prevents cluster containers from accessing protected Linux features such as shared file systems, root access, and certain core capabilities like the KILL command. Always use the most restrictive SCC that still allows your workload to function.


Network Security

  • Zero-trust networking: Apply granular access controls between individual pods, namespaces, and services in Kubernetes clusters and external resources, including databases, internal applications, and third-party cloud APIs.
  • Use NetworkPolicies to restrict east-west traffic between namespaces and pods by default.
  • Egress control: Use Egress Gateways or policies to control outbound traffic from pods.

Compliance & Monitoring

  • Compliance Operator: The OpenShift Compliance Operator supports profiles for standards including PCI-DSS versions 3.2.1 and 4.0, enabling automated compliance scanning across the cluster.
  • Continuous monitoring: Use robust logging and monitoring solutions to gain visibility into container behavior, network flows, and resource utilization. Set up alerts for abnormalities like unusually high memory or CPU usage that could indicate compromise.
  • Track CVEs proactively: Security, bug fix, and enhancement updates for OCP are released as asynchronous errata through the Red Hat Network. Registry images should be scanned upon notification and patched if affected by new vulnerabilities.

Namespace & Project Isolation

Using projects and namespaces simplifies management and enhances security by limiting the potential impact of compromised applications, segregating resources based on application/team/environment, and ensuring users can only access the resources they are authorized to use.


Key tools to leverage: Advanced Cluster Security (ACS/StackRox), Compliance Operator, OpenShift built-in image registry with scanning, and NetworkPolicy/Calico for zero-trust networking.

SCCs (Security Context Constraints) are OpenShift’s pod-level security gate — separate from RBAC. The golden rules are: always start from restricted-v2, never modify built-in SCCs, create custom ones when needed, assign them to dedicated service accounts (not users), and never grant anyuid or privileged to app workloads.

RBAC controls what users and service accounts can do via the API. The key principle is deny-by-default — bind roles to groups rather than individuals, keep bindings namespace-scoped unless cross-namespace is genuinely needed, audit regularly with oc auth can-i and oc policy who-can, and never touch default system ClusterRoles.

Network Policies implement microsegmentation at the pod level. The pattern is always: default-deny first, then explicitly open only what’s needed — ingress from the router, traffic from the same namespace, and specific app-to-app flows. For egress, use EgressNetworkPolicy to whitelist specific CIDRs or domains and block everything else.

All three layers work together: RBAC controls the API plane, SCCs control the node plane, and NetworkPolicies control the network plane. A strong OCP security posture needs all three.

Azure Network Watcher

Azure Network Watcher is Azure’s built-in network monitoring and diagnostics service for IaaS resources. It helps you monitor, troubleshoot, and visualize networking for things like VMs, VNets, load balancers, application gateways, and traffic paths in Azure. It is not meant for PaaS monitoring or web/mobile analytics. (Microsoft Learn)

For interviews, the clean way to explain it is:

“Network Watcher is the tool I use when I need to see how traffic is flowing in Azure, why connectivity is failing, or what route/security rule is affecting a VM. It gives me diagnostics like topology, next hop, IP flow verify, connection troubleshooting, packet capture, and flow logs.” (Microsoft Learn)

The most important features to remember are:

  • Topology: visual map of network resources and relationships. (Microsoft Learn)
  • IP flow verify: checks whether a packet to/from a VM would be allowed or denied by NSG rules. (Microsoft Learn)
  • Next hop: tells you where traffic to a destination IP will go, such as Internet, Virtual Appliance, VNet peering, gateway, or None. Very useful for UDR and routing issues. (Microsoft Learn)
  • Connection troubleshoot / Connection Monitor: tests reachability and latency between endpoints and shows path health over time. (Microsoft Learn)
  • Packet capture: captures packets on a VM or VM scale set for deep troubleshooting. (Microsoft Learn)
  • Flow logs / traffic analytics: records IP traffic flow data and helps analyze traffic patterns. (Microsoft Learn)

A strong interview answer for when to use it:

“I use Network Watcher when a VM cannot reach a private endpoint, an app cannot talk to another subnet, routing seems wrong, NSGs may be blocking traffic, or I need packet-level proof. I usually check NSG/IP Flow Verify first, then Next Hop, then Connection Troubleshoot, and if needed packet capture and flow logs.” That workflow maps directly to the capabilities Microsoft documents. (Microsoft Learn)

A simple example:
If a VM cannot reach a private endpoint, I would check:

  1. DNS resolution for the private endpoint name.
  2. IP flow verify for NSG allow/deny.
  3. Next hop to confirm the route is correct.
  4. Connection troubleshoot / Connection Monitor for end-to-end reachability and latency.
  5. Packet capture if I need proof of SYN drops, resets, or missing responses. (Microsoft Learn)

One interview caution:
Network Watcher is mainly for Azure IaaS network diagnosis, not your general observability platform for app performance. Azure Monitor is broader, and Network Watcher plugs into that platform for network health and diagnostics. (Microsoft Learn)

Here are clean, interview-ready answers you can memorize and adapt depending on how deep the interviewer goes 👇


30-Second Answer

“Azure Network Watcher is a network diagnostics and monitoring service for Azure IaaS. I use it to troubleshoot connectivity issues between resources like VMs, VNets, and private endpoints. Key tools I rely on are IP Flow Verify to check NSG rules, Next Hop for routing issues, and Connection Troubleshoot for end-to-end connectivity. If needed, I go deeper with packet capture and flow logs.”


1–2 Minute Answer (More Detailed, Still Smooth)

“Azure Network Watcher is a native Azure service that helps monitor, diagnose, and troubleshoot network issues in IaaS environments. It’s especially useful when dealing with VMs, VNets, NSGs, and routing.

For example, if a VM cannot connect to another resource, I follow a structured approach:

  • First, I use IP Flow Verify to confirm whether NSG rules are allowing or denying traffic
  • Then I check Next Hop to validate routing and identify if traffic is going to a firewall, gateway, or nowhere
  • I use Connection Troubleshoot or Connection Monitor to test actual connectivity and latency
  • If the issue is still unclear, I use packet capture to inspect traffic at the packet level
  • And for ongoing visibility, I enable NSG flow logs and traffic analytics

So overall, Network Watcher gives me both real-time troubleshooting and historical visibility into network traffic and behavior.”


Scenario Answer

Scenario: VM cannot connect to a Private Endpoint

“If a VM cannot connect to a private endpoint, I’d troubleshoot using Network Watcher step-by-step:

  1. DNS first – confirm the private endpoint resolves to a private IP
  2. Use IP Flow Verify to check if NSG rules allow traffic
  3. Use Next Hop to confirm routing isn’t misconfigured (like a bad UDR sending traffic to a firewall)
  4. Run Connection Troubleshoot to validate end-to-end connectivity
  5. If needed, run packet capture to see if packets are dropped or not returning

This structured approach helps isolate whether the issue is DNS, NSG, routing, or the application itself.”


Bonus: One-Liner

“Network Watcher is my go-to tool for debugging Azure networking — it helps me validate NSGs, routing, and connectivity quickly.”


This is one of those topics where interviewers are testing if you truly understand Azure networking flow, not just definitions. Let’s make this practical and interview-ready.


Private Endpoint vs Service Endpoint

Private Endpoint (Private Link)

Image
Image
Image
  • Assigns a private IP inside your VNet
  • Traffic stays entirely on Microsoft backbone
  • Uses Private DNS
  • Most secure (no public exposure)

Service Endpoint

Image
Image
Image
  • Extends your VNet identity to Azure services
  • Still uses the public endpoint
  • Secured by VNet-based access rules
  • No private IP assigned to the service

Key Differences

FeaturePrivate EndpointService Endpoint
IP AddressPrivate IP in VNetPublic IP
Traffic PathFully privatePublic endpoint (Azure backbone)
DNS Required✅ Yes (critical)❌ No
Security LevelHighestMedium
Data Exfiltration ProtectionStrongLimited

Troubleshooting Approach (THIS is what matters)

Scenario 1: Private Endpoint NOT Working

👉 This is where most candidates fail — DNS is the #1 issue.

Step-by-step:

1. DNS Resolution (MOST IMPORTANT)

  • Does the FQDN resolve to a private IP?
  • If not → DNS misconfiguration

👉 Common issue:

  • Missing Private DNS Zone (e.g., privatelink.blob.core.windows.net)
  • VNet not linked to DNS zone

2. NSG Check

  • Use Network Watcher IP Flow Verify
  • Ensure traffic is allowed

3. Routing (UDR / Firewall)

  • Use Next Hop
  • Check if traffic is being forced through a firewall incorrectly

4. Private Endpoint State

  • Approved?
  • Connected?

5. Connection Troubleshoot

  • Validate actual reachability

Scenario 2: Service Endpoint NOT Working

👉 Easier than Private Endpoint, but different failure points.

Step-by-step:

1. Subnet Configuration

  • Is Service Endpoint enabled on the subnet?

2. Resource Firewall

  • Example: Storage Account → “Selected networks”
  • Is your subnet allowed?

3. NSG Rules

  • Still applies → allow outbound

4. Route Table

  • If forced tunneling is enabled → traffic may NOT reach Azure service properly

5. Public Endpoint Access

  • Ensure the service allows public endpoint traffic (since Service Endpoint uses it)

Side-by-Side Troubleshooting Mindset

Problem AreaPrivate EndpointService Endpoint
DNS🔴 Critical🟢 Not needed
Subnet config🟡 Minimal🔴 Must enable endpoint
Firewall rules (resource)🟢 Private access🔴 Must allow subnet
Routing issues🔴 Common🟡 Sometimes
ComplexityHighMedium

🧩 Interview Scenario Answer (Perfect Response)

“If a connection to an Azure service fails, I first determine whether it’s using Private Endpoint or Service Endpoint because the troubleshooting path differs.

  • For Private Endpoint, I start with DNS — ensuring the service resolves to a private IP via Private DNS. Then I check NSGs, routing using Next Hop, and validate connectivity using Network Watcher tools.
  • For Service Endpoint, I verify the subnet has the endpoint enabled, ensure the Azure resource firewall allows that subnet, and confirm routing isn’t forcing traffic through a path that breaks connectivity.

The key difference is that Private Endpoint issues are usually DNS-related, while Service Endpoint issues are typically configuration or access control related.”


Pro Tip

Say this line:

“Private Endpoint failures are usually DNS problems. Service Endpoint failures are usually access configuration problems.”


Here’s a clean mental model + diagram . This ties together DNS → Routing → NSG → Destination in the exact order Azure evaluates traffic.


The Core Flow

That’s your anchor. Every troubleshooting answer should follow this flow.


Visual Memorization Diagram

🧩 End-to-End Flow (Private Endpoint example)

Image
Image

Step-by-Step Mental Model

1. DNS (FIRST — always)

👉 Question:
“Where is this name resolving to?”

  • Private Endpoint → should resolve to private IP
  • Service Endpoint → resolves to public IP

If DNS is wrong → NOTHING else matters


2. Routing (Next Hop)

👉 Question:
“Where is the traffic going?”

  • Internet?
  • Virtual Appliance (Firewall)?
  • VNet Peering?
  • None (blackhole)?

Use:

  • Network Watcher → Next Hop

🔴If routing is wrong → traffic never reaches destination


3. NSG (Security Filtering)

👉 Question:
“Is traffic allowed or denied?”

  • Check:
    • Source IP
    • Destination IP
    • Port
    • Protocol

Use:

  • Network Watcher → IP Flow Verify

🔴 If denied → traffic is dropped


4. Destination (Final Check)

👉 Question:
“Is the service itself allowing traffic?”

  • Private Endpoint → connection approved?
  • Service Endpoint → firewall allows subnet?
  • App listening on port?

The Interview Cheat Code

“When debugging Azure networking, I always follow a layered approach: first DNS resolution, then routing using Next Hop, then NSG validation with IP Flow Verify, and finally I check the destination service configuration.”


Example Walkthrough

VM cannot reach Storage Account (Private Endpoint)

👉 You say:

  1. DNS – does it resolve to private IP?
  2. Routing – is traffic going to correct subnet or firewall?
  3. NSG – is port 443 allowed outbound?
  4. Destination – is private endpoint approved?

Ultra-Simple Memory Trick

Think of it like a package delivery 📦:

  • DNS = Address lookup (where am I going?)
  • Routing = Road path (how do I get there?)
  • NSG = Security gate (am I allowed through?)
  • Destination = Door (is it open?)

Bonus

“Azure evaluates routing before NSG for outbound traffic decisions, so even if NSG allows traffic, incorrect routing can still break connectivity.”


AKS – Security Best Practice

For a brand-new microservices project in 2026, security isn’t just a “layer” you add at the end—it’s baked into the infrastructure. AKS has introduced several “secure-by-default” features that simplify this.

Here are the essential security best practices for your new setup:


1. Identity over Secrets (Zero Trust)

In 2026, storing connection strings or client secrets in Kubernetes “Secrets” is considered an anti-pattern.

  • Best Practice: Use Microsoft Entra Workload ID.
  • Why: Instead of your app having a password to access a database, your Pod is assigned a “Managed Identity.” Azure confirms the Pod’s identity via a signed token, granting it access without any static secrets that could be leaked.
  • New in 2026: Enable Conditional Access for Workload Identities to ensure a microservice can only connect to your database if it’s running inside your specific VNet.

2. Harden the Host (Azure Linux 3.0)

The operating system running your nodes is part of your attack surface.

  • Best Practice: Standardize on Azure Linux 3.0 (CBL-Mariner).
  • Why: It is a “distroless-adjacent” host OS. It contains ~500 packages compared to the thousands in Ubuntu, drastically reducing the number of vulnerabilities (CVEs) you have to patch.
  • Advanced Isolation: For sensitive services (like payment processing), enable Pod Sandboxing. This uses Kata Containers to run the service in a dedicated hardware-isolated micro-VM, preventing “container breakout” attacks where a hacker could jump from your app to the node.

3. Network “Blast Radius” Control

If one microservice is compromised, you don’t want the attacker to move laterally through your entire cluster.

  • Best Practice: Use Cilium for Network Policy.
  • Why: As of 2026, Cilium is the gold standard for AKS networking. It uses eBPF technology to filter traffic at the kernel level.
  • Strategy: Implement a Default Deny policy. By default, no service should be able to talk to any other service unless you explicitly write a rule allowing it.

4. API Server Protection

The Kubernetes API server is the “front door” to your cluster. If someone gets in here, they own everything.

  • Best Practice: Use API Server VNet Integration (Private Clusters).
  • Why: This ensures your cluster’s management endpoint is not reachable from the public internet. It exists only inside your private network.
  • Access Control: Use Microsoft Entra RBAC (Role-Based Access Control). Never use the “cluster-admin” local account. Link permissions to your team’s Entra ID groups so that when an employee leaves the company, their cluster access is revoked instantly.

5. Continuous Supply Chain Security

Security starts before the code even reaches AKS.

  • Best Practice: Enable Defender for Cloud and Binary Authorization.
  • Why: Defender for Cloud scans your images in the Azure Container Registry (ACR) for malware and secrets.
  • Enforcement: Use Azure Policy for Kubernetes to block any deployment that hasn’t been scanned or isn’t coming from your trusted registry.

Summary Security Checklist

Area2026 Standard
IdentityMicrosoft Entra Workload ID (No Secrets)
OSAzure Linux 3.0 with OS Guard
NetworkCilium with mTLS (Service Mesh)
AccessPrivate Cluster + Entra RBAC
GovernanceAzure Policy “Restricted” Baseline

Pro-Tip: Check your Secure Score in Microsoft Defender for Cloud weekly. It will give you a prioritized list of “quick fixes” for your specific AKS cluster based on real-time threats.

With a Service Mesh (specifically the Istio-based add-on for AKS), you are moving toward a “Zero Trust” network architecture. In this setup, the network is no longer trusted by default; every connection must be verified and encrypted.

Here is the 2026 security blueprint for running microservices with Istio on AKS.


1. Automated mTLS (Encryption in Transit)

By default, traffic between Kubernetes Pods is unencrypted. With Istio, you can enforce Strict Mutual TLS (mTLS) without changing a single line of application code.

  • The Best Practice: Apply a PeerAuthentication policy at the namespace level set to STRICT.
  • The Result: Any service that tries to connect via plain text will be instantly rejected by the sidecar proxy. This ensures that even if an attacker gains access to your internal network, they cannot “sniff” sensitive data (like headers or tokens) passing between services.

2. Identity-Based Authorization

IP addresses are ephemeral in Kubernetes and shouldn’t be used for security. Istio uses SPIFFE identities based on the service’s Kubernetes Service Account.

  • The Best Practice: Use AuthorizationPolicy to define “Who can talk to Whom.”
  • Example: You can create a rule that says the Email Service can only receive requests from the Orders Service, and only if the request is a POST to the /send-receipt endpoint. Everything else is blocked at the source.

3. Secure the “Front Door” (Ingress Gateway)

In 2026, the Kubernetes Gateway API has reached full GA (General Availability) for the AKS Istio add-on.

  • The Best Practice: Use the Gateway and HTTPRoute resources instead of the older Ingress objects.
  • Security Benefit: It allows for better separation of concerns. Your platform team can manage the physical load balancer (the Gateway), while your developers manage the routing rules (HTTPRoute) for their specific microservices.

4. Dapr + Istio: The “Power Couple”

Since you are building microservices, you might also use Dapr for state and messaging. In 2026, these two work together seamlessly but require one key configuration:

  • The Best Practice: If both are present, let Istio handle the mTLS and Observability, and disable mTLS in Dapr.
  • Why: Having two layers of encryption (“double wrapping” packets) adds significant latency and makes debugging network issues a nightmare.

5. Visualizing the “Blast Radius”

The biggest security risk in microservices is lateral movement.

  • The Best Practice: Use the Kiali dashboard (integrated with AKS) to view your service graph in real-time.
  • The Security Win: If you see a weird line of communication between your Public Web Frontend and your Internal Payment Database that shouldn’t exist, you’ve found a security hole or a misconfiguration before it becomes a breach.

Summary Security Checklist for Istio on AKS

Task2026 Recommended Tool
Transport SecurityPeerAuthentication (mode: STRICT)
Service PermissionsIstio AuthorizationPolicy
External TrafficKubernetes Gateway API (Managed Istio Ingress)
Egress (Outgoing)Service Entry (Block all traffic to external sites except specific approved domains)
AuditingAzure Monitor for Containers + Istio Access Logs

Warning for 2026: Ensure your worker nodes have enough “headroom.” Istio sidecars (Envoy proxies) consume roughly 0.5 to 1.0 vCPU and several hundred MBs of RAM per Pod. For a project with many small microservices, this “sidecar tax” can add up quickly.

AKS

At its core, Azure Kubernetes Service (AKS) is Microsoft’s managed version of Kubernetes. It’s designed to take the “scary” parts of managing a container orchestration system—like setting up the brain of the cluster, patching servers, and handling scaling— and offload them to Azure so you can focus on your code.

Think of it as Kubernetes with a personal assistant.


1. How it Works (The Architecture)

AKS splits a cluster into two distinct parts:

  • The Control Plane (Managed by Azure): This is the “brain.” It manages the API server, the scheduler, and the cluster’s state. In AKS, Microsoft manages this for you for free (or for a small fee if you want a guaranteed Uptime SLA). You don’t have to worry about its health or security patching.
  • The Data Plane (Managed by You): These are the “worker nodes” (Virtual Machines) where your applications actually run. While you pay for these VMs, AKS makes it easy to add, remove, or update them with a single click or command.

2. Key Features (2026 Standards)

As of 2026, AKS has evolved into an “AI-ready” platform. Here are the standout features:

  • AKS Automatic: A newer “Zero-Ops” tier where Azure handles almost everything—node configuration, security hardening, and even choosing the right VM sizes based on your app’s needs.
  • Smart Scaling: It uses the Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler to grow or shrink your infrastructure based on real-time traffic, saving you money during quiet hours.
  • AI & GPU Support: Native integration with the latest NVIDIA GPUs (like the NCv6 series) makes it a go-to for training LLMs or running AI inference.
  • Enterprise Security: It integrates directly with Microsoft Entra ID (formerly Azure AD), so you can control who accesses your cluster using the same identity they use for email.

3. Why Use It?

BenefitWhat it means for you
Reduced OverheadYou don’t have to be a Kubernetes expert to keep the cluster running.
Cost EfficiencyYou only pay for the worker nodes you use. Features like Spot VMs can cut costs by up to 80% for non-critical tasks.
Fast UpdatesUpgrading to a new Kubernetes version is a “push-button” experience rather than a manual nightmare.
Deep IntegrationIt plays perfectly with Azure Monitor, Azure Storage, and GitHub Actions for a smooth “code-to-cloud” pipeline.

4. Common Use Cases

  • Microservices: Breaking down a giant app into smaller, independent pieces.
  • AI/ML Workflows: Running data-heavy training jobs and deploying models.
  • DevOps Pipelines: Using containers to ensure that “it works on my machine” also means “it works in production.”

Pro Tip: If you’re just starting out in 2026, look into Azure Linux 3.0 as your node operating system. It’s a lightweight, secure OS built by Microsoft specifically for AKS that boots faster and has a smaller attack surface than traditional Linux distros.

Starting a brand-new project in 2026 is an exciting time for AKS because the platform has moved from “manual infrastructure” to “intelligent automation.”

Here is the recommended blueprint for launching a fresh project today.


1. The Big Choice: Automatic vs. Standard

The first decision you’ll face in the Azure Portal is which “tier” to use.

FeatureAKS Automatic (Recommended)AKS Standard
Philosophy“Just run my code.”“Give me all the knobs.”
ManagementAzure manages nodes, scaling, and security.You manage node pools and VM sizes.
Best ForNew startups, rapid dev, and “Zero-Ops” teams.Large enterprises with strict custom networking.
SecurityHardened by default (Azure Policy, Cilium).Configurable (you must set the guardrails).

Advice: For a brand-new project, start with AKS Automatic. It enforces modern best practices (like the Cilium network data plane) out of the box, saving you from “Day 2” configuration headaches.

Automatic Kubernetes Cluster manages these elements for you:

  • Networking and Security Azure CNI Overlay powered by Azure Cilium
  • Resource provisioning Automated node provisioning and scaling
  • On-demand scaling Optimal scaling tools like KEDA, HPA, and VPA
  • Kubernetes version upgrade Automatic updates for enhanced stability

2. Setting Up Your Foundation (The 2026 Stack)

When configuring your new cluster, stick to these current standards:

  • The OS: Choose Azure Linux 3.0. It’s Microsoft’s own cloud-optimized distro. It’s faster and more secure than Ubuntu because it contains only the bare essentials needed to run containers.
  • Networking: Use Azure CNI Overlay. It allows you to scale to thousands of Pods without burning through your private IP address space—a common pitfall in older projects.
  • Identity: Enable Microsoft Entra Workload ID. Never use “Service Principals” or hardcoded secrets to let your app talk to a database. Workload ID gives your Pod its own managed identity.

3. Integrating AI (KAITO)

If your new project involves AI (and most do in 2026), you’ll want to enable the Kubernetes AI Toolchain Operator (KAITO).

  • What it does: It simplifies running Large Language Models (LLMs) like Phi-4 or Falcon on your cluster.
  • Why use it: Instead of manually figuring out which GPU driver matches which model, KAITO automates the provisioning of GPU nodes and sets up the inference endpoint for you.

4. Cost Control for New Projects

Don’t let a “Hello World” app cost you hundreds.

  1. Use the “Dev/Test” Preset: This scales down redundancy to save money while you’re still building.
  2. Spot Instances: For background tasks or non-critical processing, use Spot node pools to get up to 80-90% off standard VM pricing.
  3. Vertical Pod Autoscaler (VPA): In 2026, the VPA is much smarter. It can watch your app and suggest (or automatically apply) the exact amount of CPU and RAM needed, so you aren’t paying for “idle” capacity.

5. Your First Steps

To get moving, I recommend the following workflow:

  1. Install the Azure CLI (ensure it’s version 2.82+ for 2026 features).
  2. Create an ACR (Azure Container Registry): This is where you’ll store your app images.
  3. Deploy via “Automated Deployments”: Use the “Automated Deployments” tab in the AKS portal. You point it at your GitHub repo, and it will generate the GitHub Actions and Kubernetes YAML files for you.

Building a microservices architecture on AKS is a “textbook” use case, but in 2026, the strategy has shifted from just “managing containers” to “managing a distributed system.”

Since each service in your project will have its own lifecycle, database, and scaling needs, here is how you should structure your new AKS environment.


1. The Connectivity Layer: Service Mesh

In a microservices setup, the network becomes your biggest challenge. You shouldn’t have your services talking to each other using hardcoded internal URLs.

  • The 2026 Choice: Istio-based Service Mesh add-on for AKS.
  • Why: It provides “Mutual TLS” (mTLS) by default, meaning all communication between your microservices is encrypted automatically. It also gives you a “Service Graph,” allowing you to visualize exactly how traffic flows from Service A to Service B.

2. Global Traffic Routing (Ingress)

You need a “front door” to route external users to the correct microservice (e.g., api.myapp.com/orders goes to the Order Service).

  • Application Gateway for Containers (ALB): This is the modern evolution of the standard Ingress Controller. It’s a managed service that sits outside your cluster, handling SSL termination and Web Application Firewall (WAF) duties so your worker nodes don’t have to waste CPU on security overhead.

3. Data Persistence & State

The golden rule of microservices is one database per service.

  • Don’t run DBs inside AKS: While you can run SQL or MongoDB as a container, it’s a headache to manage.
  • The 2026 Way: Use Azure Cosmos DB or Azure SQL and connect them to your microservices using Service Connector. Service Connector handles the networking and authentication (via Workload ID) automatically, so your code doesn’t need to store connection strings or passwords.

4. Microservices Design Pattern (Dapr)

For a brand-new project, I highly recommend using Dapr (Distributed Application Runtime), which is an integrated extension in AKS.

Dapr provides “building blocks” as sidecars to your code:

  • Pub/Sub: Easily send messages between services (e.g., the “Order” service tells the “Email” service to send a receipt).
  • State Management: A simple API to save data without writing complex database drivers.
  • Resiliency: Automatically handles retries if one microservice is temporarily down.

5. Observability (The “Where is the Bug?” Problem)

With 10+ microservices, finding an error is like finding a needle in a haystack. You need a unified view.

  • Managed Prometheus & Grafana: AKS has a “one-click” onboarding for these. Prometheus collects metrics (CPU/RAM/Request counts), and Grafana gives you the dashboard.
  • Application Insights: Use this for “Distributed Tracing.” It allows you to follow a single user’s request as it travels through five different microservices, showing you exactly where it slowed down or failed.

Summary Checklist for Your New Project

  1. Cluster: Create an AKS Automatic cluster with the Azure Linux 3.0 OS.
  2. Identity: Use Workload ID instead of secrets.
  3. Communication: Enable the Istio add-on and Dapr extension.
  4. Database: Use Cosmos DB for high-scale microservices.
  5. CI/CD: Use GitHub Actions with the “Draft” tool to generate your Dockerfiles and manifests automatically.

Azure Storage

Azure Storage is a highly durable, scalable, and secure cloud storage solution. In 2026, it has evolved significantly into an AI-ready foundational layer, optimized not just for simple files, but for the massive datasets required for training AI models and serving AI agents.

The platform is divided into several specialized “data services” depending on the type of data you are storing.


1. The Core Data Services

ServiceData TypeBest For
Blob StorageUnstructured (Objects)Images, videos, backups, and AI training data lakes.
Azure FilesFile Shares (SMB/NFS)Replacing on-premise file servers; “Lift and Shift” for legacy apps.
Azure DisksBlock StoragePersistent storage for Virtual Machines (OS and data disks).
Azure TablesNoSQL Key-ValueLarge scale, schema-less structured data (e.g., user profiles).
Azure QueuesMessagingReliable messaging between different parts of an application.

2. Modern Tiers (Cost vs. Speed)

You don’t pay the same price for data you use every second versus data you keep for 10 years. You choose an Access Tier to optimize your bill:

  • Premium: SSD-backed. Ultra-low latency for high-performance apps and AI inference.
  • Hot: For data you access frequently. Lower access cost, higher storage cost.
  • Cool (30 days): For data like short-term backups. Lower storage cost, higher access cost.
  • Cold (90 days): New in the 2020s—optimized for “infrequent” but immediate access.
  • Archive (180 days): Lowest cost, but data is “offline.” Rehydrating it takes hours.

3. Redundancy: Protecting Your Data

Azure keeps multiple copies of your data to protect against hardware failure or entire data center disasters.

  • LRS (Locally Redundant): 3 copies in a single data center. Protects against disk failure.
  • ZRS (Zone-Redundant): 3 copies across 3 different data centers in one region. Protects against a data center fire/outage.
  • GRS (Geo-Redundant): Copies data to a secondary region hundreds of miles away. Protects against a major regional disaster.

🚀 What’s New in 2026?

  • Agentic Scale: Blob storage can now handle millions of small objects per second to support AI Agents that need to “remember” long-term context or retrieve data from RAG (Retrieval-Augmented Generation) stores.
  • Elastic SAN: A fully managed Storage Area Network (SAN) service that lets you pool block storage for different workloads, making it easier to migrate heavy SQL or Oracle databases to the cloud.
  • AI Integration: Native “hooks” for frameworks like LangChain and Microsoft Foundry, allowing AI models to read directly from your storage with minimal setup.

Choosing the right storage in 2026 depends on two main factors: the structure of your data and how your application (or AI agent) needs to access it.

Here is the 2026 decision framework to help you choose the right tool for the job.


🚦 The “Quick Decision” Tree

If your project needs to…The Winner is…
Store millions of files for AI training or Data Lakes.Blob Storage (Data Lake Gen2)
Replace an on-premise file server (SMB/NFS).Azure Files
Provide high-speed block storage for Virtual Machines.Managed Disks
Pool storage across many VMs/Containers like a Cloud SAN.Elastic SAN
Send messages between different microservices.Queue Storage
Store simple Key-Value data (User profiles, logs).Table Storage

🟦 1. Blob Storage: The AI & Big Data King

In 2026, Blob storage is no longer just for “backups.” It is the central engine for Agentic Scale—supporting AI agents that need to read massive amounts of context quickly.

  • Best For: Unstructured data (PDFs, Images, Parquet files).
  • Key Feature: Data Lake Storage Gen2. This adds a “Hierarchical Namespace” (real folders) to your blobs, which makes big data analytics and AI processing 10x faster.
  • 2026 Strategy: Use Cold Tier for data you only touch once a quarter but need available instantly for AI “Reasoning” tasks.

📂 2. Azure Files: The “Lift-and-Shift” Hero

If you have an existing application that expects a “Drive Letter” (like Z:\), use Azure Files.

  • Best For: Shared folders across multiple VMs or local office computers.
  • New in 2026: Elastic ZRS (Zone Redundant Storage). This provides ultra-high availability for mission-critical file shares without the complexity of managing your own cluster.
  • Performance: Use Premium Files if you are running active databases or high-transaction apps; use Standard for simple office document sharing.

💽 3. Managed Disks vs. Elastic SAN

This is the “local” vs “network” storage debate for your servers.

  • Managed Disks (The Individual): Use Premium SSD v2. It’s the modern standard because it allows you to scale IOPS and Throughput separately, so you don’t have to buy a “huge disk” just to get “high speed.”
  • Elastic SAN (The Pool): If you are migrating a massive environment from an on-premise SAN (like Dell EMC or NetApp), Elastic SAN lets you buy one large “pool” of performance and share it across all your VMs and Kubernetes clusters.

🔍 4. Specialized: Tables & Queues

These are “developer” storage types.

  • Azure Tables: Use this if Cosmos DB is too expensive for your needs. It’s a “no-frills” NoSQL database for billions of small, structured rows.
  • Azure Queues: Use this to decouple your app. If a user uploads a photo, put a message in the Queue. A “Worker” then sees that message and processes the photo. This prevents your app from crashing under heavy load.

💡 Pro-Tip for 2026: The “Cost First” Choice

If you are still unsure, start with Blob Storage. It is the most flexible, has the best API support for modern AI frameworks (like LangChain), and offers the most aggressive cost-saving tiers (Cool, Cold, and Archive).

What is the “main action” your application will take? (e.g., “Scanning documents for AI,” “Hosting a website,” or “Running a heavy SQL database?”) I can give you the specific configuration for that.

Cost optimization in Azure is no longer just about “turning things off.” In 2026, it is a continuous lifecycle known as FinOps, focusing on three distinct phases: Inform (Visibility), Optimize (Rightsizing & Rates), and Operate (Governance).

Here is the strategic blueprint for optimizing your Azure spend today.


1. Inform: Get Full Visibility

You cannot optimize what you cannot see.

  • Tagging Enforcement: Use Azure Policy to require tags like Environment, Owner, and CostCenter. This allows you to group costs by department or project in Azure Cost Management.
  • Budget Alerts: Set thresholds at 50%, 80%, and 100% of your predicted monthly spend.
  • Azure Advisor Score: Check your “Cost Score” in Azure Advisor. It provides a “to-do list” of unused resources, such as unattached Managed Disks or idle ExpressRoute circuits.

2. Optimize: The Two-Pronged Approach

Optimization is divided into Usage (buying less) and Rate (paying less for what you use).

A. Usage Optimization (Rightsizing)

  • Shut Down Idle Resources: Azure Advisor flags VMs with <3% CPU usage. For Dev/Test environments, use Auto-shutdown or Azure Automation to turn VMs off at 7:00 PM and on at 7:00 AM.
  • Storage Tiering: Move data that hasn’t been touched in 30 days to the Cool tier, and data older than 180 days to the Archive tier. This can save up to 90% on storage costs.
  • B-Series VMs: For workloads with low average CPU but occasional spikes (like small web servers), use the B-Series (Burstable) instances to save significantly.

B. Rate Optimization (Commitment Discounts)

In 2026, you choose your discount based on how much flexibility you need.

Discount TypeSavingsBest For…
Reserved Instances (RI)Up to 72%Static workloads. You commit to a specific VM type in a specific region for 1 or 3 years.
Savings Plan for ComputeUp to 65%Dynamic workloads. A flexible $ /hour commitment that applies across VM families and regions.
Azure Hybrid BenefitUp to 85%Using your existing Windows/SQL licenses in the cloud so you don’t pay for them twice.
Spot InstancesUp to 90%Interruptible workloads like batch processing or AI model training.

3. Operate: Modern 2026 Strategies

  • AI Cost Governance: With the rise of Generative AI, monitor your Azure OpenAI and AI Agent token usage. Use Rate Limiting on your APIs to prevent a runaway AI bot from draining your budget in a single night.
  • FinOps Automation: Use Azure Resource Graph to find “orphaned” resources (like Public IPs not attached to anything) and delete them automatically via Logic Apps.
  • Sustainability & Carbon Optimization: Use the Azure Carbon Optimization tool. Often, the most “green” resource (lowest carbon footprint) is also the most cost-efficient one.

✅ The “Quick Wins” Checklist

  1. [ ] Delete Unattached Disks: When you delete a VM, the disk often stays behind and keeps billing you.
  2. [ ] Switch to Savings Plans: If your RIs are expiring, move to a Savings Plan for easier management.
  3. [ ] Check for “Zombies”: Idle Load Balancers, VPN Gateways, and App Service Plans with zero apps.
  4. [ ] Rightsize your SQL: Switch from “DTU” to the vCore model for more granular scaling and Hybrid Benefit savings.

Pro Tip: Never buy a Reserved Instance (RI) for a server that hasn’t been rightsized first. If you buy a 3-year reservation for an oversized 16-core VM, you are “locking in” waste for 36 months!

To find the “low-hanging fruit” in your Azure environment, you can use Azure Resource Graph Explorer and Log Analytics.

Here are the specific KQL (Kusto Query Language) scripts to identify common waste areas.


1. Identify Orphaned Resources (Quickest Savings)

These resources are costing you money every hour but aren’t attached to anything. Run these in the Azure Resource Graph Explorer.

A. Unattached Managed Disks

Code snippet

Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project name, resourceGroup, subscriptionId, location, diskSizeGB = properties.diskSizeGB
| order by diskSizeGB desc

B. Unattached Public IPs

Code snippet

Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project name, resourceGroup, subscriptionId, location, ipAddress = properties.ipAddress

2. Identify Underutilized VMs (Rightsizing)

To run this, your VMs must be sending performance data to a Log Analytics Workspace. Use this to find VMs that are consistently running below 5% CPU.

KQL for Underutilized VMs (Last 7 Days):

Code snippet

Perf
| where TimeGenerated > ago(7d)
| where ObjectName == "Processor" and CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer, _ResourceId
| where AvgCPU < 5
| order by AvgCPU asc
  • Action: If MaxCPU is also low, consider “Downsizing” the VM (e.g., from a D4 to a D2) or switching it to a B-series.

3. Find Idle App Service Plans

App Service Plans cost money even if they have zero apps running on them. Run this in Resource Graph Explorer.

Code snippet

resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project name, resourceGroup, subscriptionId, Sku = sku.name, Tier = sku.tier

4. Search for “Zombie” Network Interfaces

These don’t cost money directly, but they clutter your environment and use up IP addresses in your subnets.

Code snippet

Resources
| where type =~ 'microsoft.network/networkinterfaces'
| where isnull(properties.virtualMachine)
| project name, resourceGroup, subscriptionId, location

💡 How to Automate This in 2026

Instead of running these manually, use Azure Workbooks.

  1. Search for “Workbooks” in the Azure Portal.
  2. Click Add > New.
  3. Add a “Query” cell and paste any of the KQL scripts above.
  4. Save the Workbook as “Monthly Cost Cleanup.”

Now, you can open this dashboard once a month and see exactly what needs to be deleted!

Setting up an automated alert for “Unattached Disks” is a brilliant move for cost governance. In Azure, this is handled by monitoring the Activity Log for a specific event: the “Delete Virtual Machine” action (which leaves the disk behind) or the “Detach Disk” action.

Here is the 2026 step-by-step guide to setting this up.


Step 1: Create an Action Group (The “Who” to notify)

Before you create the alert, you need to tell Azure how to contact you.

  1. Search for Monitor in the Azure Portal.
  2. Click Alerts > Action groups > + Create.
  3. Basics: Give it a name like CostAlertTeam.
  4. Notifications: Select Email/SMS message/Push/Voice.
  5. Enter your email address and name the notification EmailDevOps.
  6. Click Review + create.

Step 2: Create the Activity Log Alert (The “When”)

Now, we create the trigger that watches for disks being left alone.

  1. In Monitor, click Alerts > + Create > Alert rule.
  2. Scope: Select your Subscription.
  3. Condition: This is the most important part. Click + Add condition and search for:
    • Signal Name: Detach Disk (Microsoft.Compute/disks)
    • Alternative: You can also alert on Delete Virtual Machine, but “Detach Disk” is more accurate for catching orphaned resources.
  4. Refine the Logic: Under “Event initiated by,” you can leave it as “Any” or specify a specific automation service principal if you only want to catch manual detaches.

Step 3: Connect and Save

  1. Actions: Click Select action groups and choose the CostAlertTeam group you created in Step 1.
  2. Details: Name the rule Alert-Disk-Unattached.
  3. Severity: Set it to Informational (Sev 4) or Warning (Sev 3).
  4. Click Review + create.

💡 The “Pro” Way (2026 Strategy): Use Log Analytics

The method above tells you when a disk is detached, but it won’t tell you about disks that are already unattached. To catch those, use a Log Search Alert with a KQL query.

The KQL Query:

Code snippet

// Run this every 24 hours to find any disk with a "ManagedBy" state of NULL
resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project name, resourceGroup, subscriptionId

Why this is better:

  • Activity Log Alerts are “reactive” (they fire only at the moment of the event).
  • Log Search Alerts are “proactive” (they scan your environment every morning and email you a list of every unattached disk, even if it was detached months ago).

✅ Summary of the Workflow

  1. Detach/Delete Event happens in the VNet.
  2. Activity Log captures the event.
  3. Azure Monitor sees the event matches your rule.
  4. Action Group sends you an email immediately.

While an immediate alert is great for a “fire-drill” response, a Weekly Summary Report is the gold standard for long-term cost governance. It keeps your inbox clean and ensures your team stays accountable for “disk hygiene.”

In 2026, the best way to do this without writing custom code is using Azure Logic Apps.


🛠️ The Architecture: “The Monday Morning Cleanup”

We will build a simple 3-step workflow that runs every Monday at 9:00 AM, queries for unattached disks, and sends you a formatted HTML table.

Step 1: Create the Logic App (Recurrence)

  1. Search for Logic Apps and create a new one (select Consumption plan for lowest cost).
  2. Open the Logic App Designer and select the Recurrence trigger.
  3. Set it to:
    • Interval: 1
    • Frequency: Week
    • On these days: Monday
    • At these hours: 9

Step 2: Run the KQL Query

  1. Add a new step and search for the Azure Monitor Logs connector.
  2. Select the action: Run query and visualize results.
  3. Configure the connection:
    • Subscription/Resource Group: Select your primary management group.
    • Resource Type: Log Analytics Workspace.
  4. The Query: Paste the “Orphaned Disk” query from earlier:Code snippetResources | where type has "microsoft.compute/disks" | extend diskState = tostring(properties.diskState) | where managedBy == "" and diskState == "Unattached" | project DiskName = name, ResourceGroup = resourceGroup, SizeGB = properties.diskSizeGB, Location = location
  5. Chart Type: Select HTML Table.

Step 3: Send the Email

  1. Add a final step: Office 365 Outlook – Send an email (V2).
  2. To: Your team’s email.
  3. Subject: ⚠️ Weekly Action: Unattached Azure Disks found
  4. Body:
    • Type some text like: “The following disks are currently unattached and costing money. Please delete them if they are no longer needed.”
    • From the Dynamic Content list, select Attachment Content (this is the HTML table from Step 2).

📊 Why this is the “Pro” Move

  • Zero Maintenance: Once it’s running, you never have to check the portal manually.
  • Low Cost: A Logic App running once a week costs roughly $0.02 per month.
  • Formatted for Humans: Instead of a raw JSON blob, you get a clean table that you can forward to project owners.

✅ Bonus: Add a “Delete” Link

If you want to be a 2026 power user, you can modify the KQL to include a “Deep Link” directly to each disk in the Azure Portal:

Code snippet

| extend PortalLink = strcat("https://portal.azure.com/#@yourtenant.onmicrosoft.com/resource", id)
| project DiskName, SizeGB, PortalLink

Now, you can click the link in your email and delete the disk in seconds.

Combining the different “zombie” resources into one report is the most efficient way to manage your Azure hygiene.

By using the union operator in KQL, we can create a single list of various resource types that are currently costing you money without providing value.


1. The “Ultimate Zombie” KQL Query

Copy and paste this query into your Logic App or Azure Resource Graph Explorer. It looks for unattached disks, unassociated IPs, and empty App Service Plans all at once.

Code snippet

// Query for Orphaned Disks
Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project Name = name, Type = "Orphaned Disk", Detail = strcat(properties.diskSizeGB, " GB"), ResourceGroup, SubscriptionId
| union (
// Query for Unassociated Public IPs
Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project Name = name, Type = "Unattached IP", Detail = tostring(properties.ipAddress), ResourceGroup, SubscriptionId
)
| union (
// Query for Empty App Service Plans (Costly!)
resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project Name = name, Type = "Empty App Service Plan", Detail = strcat(sku.tier, " - ", sku.name), ResourceGroup, SubscriptionId
)
| union (
// Query for Idle Load Balancers (No Backend Pool members)
resources
| where type == "microsoft.network/loadbalancers"
| where array_length(properties.backendAddressPools) == 0
| project Name = name, Type = "Idle Load Balancer", Detail = "No Backend Pools", ResourceGroup, SubscriptionId
)
| order by Type asc

2. Updating Your Logic App Report

To make this work in your weekly email:

  1. Open your Logic App and update the “Run query” step with the new combined KQL above.
  2. Update the HTML Table: Since the new query uses consistent column names (Name, Type, Detail), your HTML table will now neatly list the different types of waste side-by-side.

3. Advanced 2026 Tip: Add “Potential Savings”

If you want to get your manager’s attention, you can add a “Estimated Monthly Waste” column. While KQL doesn’t know your exact billing, you can hardcode estimates:

Code snippet

| extend MonthlyWaste = case(
Type == "Orphaned Disk", 5.00, // Estimate $5 per month
Type == "Unattached IP", 4.00, // Estimate $4 per month
Type == "Empty App Service Plan", 50.00, // Estimate $50+ for Standard+
0.00)

✅ Your “Monday Morning” Checklist

When you receive this email every Monday, follow this triage:

  • Disks: Delete immediately unless you specifically kept it as a “manual backup” (though you should use Azure Backup for that).
  • Public IPs: Delete. Unused Public IPs are charged by the hour in Azure.
  • App Service Plans: If you aren’t using them, scale them to the Free (F1) tier or delete them. These are often the biggest hidden costs.

To turn this report into a powerful leadership tool, we need to calculate the “Total Potential Monthly Savings.” This changes the conversation from “We have a few loose disks” to “We can save $800/month by clicking these buttons.”

Here is how to update your Logic App and KQL to include a summary total.


1. Updated “Master Zombie” Query (With Estimated Costs)

We will add a hidden cost value to every “zombie” found, then summarize the total at the very end.

Code snippet

let RawData = Resources
| where type has "microsoft.compute/disks"
| extend diskState = tostring(properties.diskState)
| where managedBy == "" and diskState == "Unattached"
| project Name = name, Type = "Orphaned Disk", Detail = strcat(properties.diskSizeGB, " GB"), MonthlyWaste = 10.00, ResourceGroup
| union (
Resources
| where type == "microsoft.network/publicipaddresses"
| where properties.ipConfiguration == "" and properties.natGateway == ""
| project Name = name, Type = "Unattached IP", Detail = tostring(properties.ipAddress), MonthlyWaste = 4.00, ResourceGroup
)
| union (
resources
| where type =~ "microsoft.web/serverfarms"
| where properties.numberOfSites == 0
| project Name = name, Type = "Empty App Service Plan", Detail = strcat(sku.tier, " - ", sku.name), MonthlyWaste = 55.00, ResourceGroup
);
// This part creates the final list
RawData
| order by Type asc
| union (
RawData
| summarize Name = "TOTAL POTENTIAL SAVINGS", Type = "---", Detail = "---", MonthlyWaste = sum(MonthlyWaste), ResourceGroup = "---"
)

2. Formatting the Logic App Email

Since KQL doesn’t easily format currency, we’ll use the Logic App “Compose” action to make the final total stand out in your email.

  1. Run the Query: Use the Run query and list results action in Logic Apps with the KQL above.
  2. Add a “Compose” Step: Between the Query and the Email, add a Data Operations - Compose action.
  3. The HTML Body: Use this template in the email body to make it look professional:

HTML

<h3>Azure Monthly Hygiene Report</h3>
<p>The following resources are identified as waste.
Cleaning these up will result in the estimated savings below.</p>
@{body('Create_HTML_table')}
<br>
<div style="background-color: #e1f5fe; padding: 15px; border-radius: 5px; border: 1px solid #01579b;">
<strong>💡 Quick Win Tip:</strong> Deleting these resources today
will save your department approx <strong>$@{outputs('Total_Waste_Sum')}</strong> per month.
</div>

3. Why This Works in 2026

  • The “Nudge” Effect: By showing the total dollar amount at the bottom, you create a psychological incentive for resource owners to clean up.
  • Customizable Pricing: You can adjust the MonthlyWaste numbers in the KQL to match your specific Enterprise Agreement (EA) pricing.
  • Single Pane of Glass: You now have one query that covers Compute, Network, and Web services.

✅ Final Triage Steps

  • Review: If you see a “TOTAL POTENTIAL SAVINGS” of $0.00, congratulations! Your environment is clean.
  • Action: For the “Empty App Service Plans,” check if they are in a Free (F1) or Shared (D1) tier first—those don’t cost money, but they will still show up as “Empty.”

Azure 3-tier app: enterprise landing zone version

Redraw-from-memory diagram

                              Users / Internet
                                     |
                           Azure Front Door + WAF
                                     |
                     =====================================
                     |                                  |
                  Region A                           Region B
                  Primary                            Secondary
                     |                                  |
               App Gateway/WAF                    App Gateway/WAF
                     |                                  |
          -------------------------         -------------------------
          |       Spoke: App      |         |       Spoke: App      |
          | Web / API / AKS       |         | Web / API / AKS       |
          | Managed Identity      |         | Managed Identity      |
          -------------------------         -------------------------
                     |                                  |
          -------------------------         -------------------------
          |      Spoke: Data      |         |      Spoke: Data      |
          | SQL / Storage / KV    |         | SQL / Storage / KV    |
          | Private Endpoints     |         | Private Endpoints     |
          -------------------------         -------------------------

                  \_________________ Hub VNet __________________/
                   Firewall | Bastion | Private DNS | Resolver
                   Monitoring | Shared Services | Connectivity

          On-prem / Branches
                 |
        ExpressRoute / VPN
                 |
        Global connectivity to hubs / spokes



What makes this an Azure Landing Zone design

Azure landing zones are the platform foundation for subscriptions, identity, networking, governance, security, and platform automation. Microsoft’s landing zone guidance explicitly frames these as design areas, not just one network diagram. (Microsoft Learn)

So in an interview, say this first:

“This isn’t just a 3-tier app. I’m placing the app inside an enterprise landing zone, where networking, identity, governance, and shared services are standardized at the platform layer.” (Microsoft Learn)

How to explain the architecture

Traffic enters through Azure Front Door with WAF, which is the global entry point and can distribute requests across multiple regional deployments for higher availability. Microsoft’s guidance calls out Front Door as the global load balancer in multiregion designs. (Microsoft Learn)

Each region has its own application stamp in a spoke VNet. The app tier runs in the spoke, stays mostly stateless, and uses Managed Identity to access downstream services securely without storing secrets. The data tier sits behind Private Endpoints, so services like Key Vault, SQL, and Storage are not exposed publicly. A private endpoint gives the service a private IP from the VNet. (Microsoft Learn)

Shared controls live in the hub VNet: Azure Firewall, Bastion, DNS, monitoring, and sometimes DNS Private Resolver for hybrid name resolution. Hub-and-spoke is the standard pattern for centralizing shared network services while isolating workloads in spokes. (Microsoft Learn)

The key enterprise networking points

Use hub-and-spoke so shared controls are centralized and workloads are isolated. Microsoft’s hub-spoke guidance specifically notes shared DNS and cross-premises routing as common hub responsibilities. (Microsoft Learn)

For Private Endpoint DNS, use centralized private DNS zones and link them to every VNet that needs to resolve those names. This is one of the most important details interviewers look for, because private endpoint failures are often DNS failures. (Microsoft Learn)

For multi-region, either peer regional hubs or use Azure Virtual WAN when the estate is large and needs simpler any-to-any connectivity across regions and on-premises. (Microsoft Learn)

  • “Only the front door is public.”
  • “App and data tiers stay private.”
  • “Private Endpoints are used for PaaS services.”
  • “Managed Identity removes stored credentials.”
  • “Policies and guardrails are applied at the landing zone level.”
  • “Shared inspection and egress control sit in the hub.”

That lines up with landing zone governance, security, and platform automation guidance. (Microsoft Learn)

2-minute interview answer

“I’d place the 3-tier application inside an Azure landing zone using a hub-and-spoke, multi-region design. Azure Front Door with WAF would be the global ingress layer and route traffic to regional application stacks. In each region, the web and app tiers would live in a spoke VNet, while shared services like Firewall, Bastion, private DNS, and monitoring would live in the hub. The data tier would use services like Azure SQL, Storage, and Key Vault behind Private Endpoints, with centralized private DNS linked to all VNets that need resolution. The application would use Managed Identity for secure access without secrets. For resilience, I’d deploy a secondary region and let Front Door handle failover. For larger estates or more complex connectivity, I’d consider Virtual WAN to simplify cross-region and hybrid networking.” (Microsoft Learn)

Memory trick

Remember it as:

Global edge → Regional spokes → Private data → Shared hub controls

Or even shorter:

Front Door, Spokes, Private Link, Hub

Perfect—here’s a one-page Azure interview cheat sheet you can quickly revise before interviews 👇


Azure Architecture Cheat Sheet (Landing Zone + Networking + Identity)


1. Core Architecture

👉
– Hub-and-spoke, multi-region, with centralized security and private backend services in Microsoft Azure.


2. Mental Diagram

Internet
|
Front Door (WAF)
|
Region A / Region B
|
Spoke VNet (App)
|
Private Endpoint
|
Data (SQL / Storage / Key Vault)
+ Hub VNet
Firewall | DNS | Bastion

3. Security Principles

  • “Only ingress is public”
  • “Everything else is private”
  • “Use Private Endpoints for PaaS”
  • “Use Managed Identity—no secrets”
  • “Enforce with policies and RBAC via Microsoft Entra ID”

4. Identity (VERY IMPORTANT)

  • Most secure → Managed Identity
  • Types:
    • User
    • Service Principal
    • Managed Identity

👉 Rule:

  • Inside Azure → Managed Identity
  • Outside Azure → Federated Identity / Service Principal

5. Networking (What to Remember)

Private Endpoint

  • Uses private IP
  • Needs Private DNS
  • ❗ Most common issue = DNS

Public Endpoint

  • Needs:
    • NAT Gateway or Public IP
    • Route to internet

👉 Rule:

  • Private = DNS problem
  • Public = Routing problem

6. Troubleshooting Framework

👉 Always say:

“What → When → Who → Why → Fix”

StepTool
WhatCost Mgmt / Metrics
WhenLogs (Azure Monitor)
WhoActivity Log
WhyCorrelation
FixScale / Secure / Block

7. Defender Alert Triage

👉
“100 alerts = 1 root cause”

Steps:

  1. Go to Microsoft Defender for Cloud (not emails)
  2. Group by resource/type
  3. Find pattern (VM? same alert?)
  4. Check:
    • NSG (open ports?)
    • Identity (who triggered?)
  5. Contain + prevent

8. Cost Spike Debug

  1. Cost Management → find resource
  2. Metrics → confirm usage
  3. Activity Log → who created/changed
  4. Check:
    • Autoscale
    • CI/CD
    • Compromise

9. Resource Graph (Quick Wins)

Use Azure Resource Graph for:

  • Orphaned disks
  • Unused IPs
  • Recent resources

10. 3-Tier Design (Quick Version)

WAF → Web → App → Data
Private Endpoints

11. Power Phrases

Say these to stand out:

  • “Zero trust architecture”
  • “Least privilege access”
  • “Identity-first security”
  • “Private over public endpoints”
  • “Centralized governance via landing zone”
  • “Eliminate secrets using Managed Identity”

Final Memory Trick

👉
“Front Door → Spoke → Private Link → Hub → Identity”


30-Second Killer Answer

I design Azure environments using a landing zone with hub-and-spoke networking and multi-region resilience. Traffic enters through Front Door with WAF, workloads run in spoke VNets, and backend services are secured using private endpoints. I use managed identities for authentication to eliminate secrets, and enforce governance through policies and RBAC. This ensures a secure, scalable, and enterprise-ready architecture.