Understanding ARO’s Kubernetes API Operations

Kubernetes API Operations Through the ARO Private Endpoint

Every interaction with an ARO cluster — whether from a human, a tool, or an automated controller — flows through a single TCP connection to port 6443 on the API server private endpoint. The API server is the absolute centre of gravity for all cluster operations.


Every Operation Is a REST Call

The Kubernetes API server exposes a RESTful HTTP/2 API over TLS. Every tool — kubectl, oc, operators, kubelet — translates its work into one of five HTTP verbs against a resource path:

GET /api/v1/namespaces/payments/pods list pods
GET /api/v1/namespaces/payments/pods/web-1 get single pod
POST /api/v1/namespaces/payments/pods create pod
PUT /api/v1/namespaces/payments/pods/web-1 replace pod
PATCH /api/v1/namespaces/payments/pods/web-1 partial update
DELETE /api/v1/namespaces/payments/pods/web-1 delete pod
GET /api/v1/namespaces/payments/pods?watch=1 watch stream

Every one of these travels as TLS-encrypted HTTP/2 to 10.1.0.8:6443.


Category 1 — Human CLI Operations (kubectl + oc)

kubectl — standard Kubernetes operations

# Every one of these becomes a REST call through the private endpoint
# LIST pods → GET /api/v1/namespaces/default/pods
kubectl get pods -n payments
# CREATE deployment → POST /apps/v1/namespaces/payments/deployments
kubectl apply -f deployment.yaml
# EXEC into pod → POST + UPGRADE to SPDY/WebSocket
kubectl exec -it web-1 -- /bin/bash
# PORT-FORWARD → POST + WebSocket tunnel
kubectl port-forward svc/my-app 8080:80
# LOGS → GET /api/v1/namespaces/payments/pods/web-1/log
kubectl logs web-1 --follow
# WATCH resources → GET with ?watch=1 (long-lived streaming connection)
kubectl get pods --watch

oc CLI — OpenShift-specific additions

oc wraps kubectl completely and adds calls to OpenShift-specific API groups:

# OpenShift Route → POST /apis/route.openshift.io/v1/namespaces/.../routes
oc expose svc/my-app
# Project (OpenShift namespace wrapper)
# → POST /apis/project.openshift.io/v1/projectrequests
oc new-project my-team
# ImageStream → GET /apis/image.openshift.io/v1/namespaces/.../imagestreams
oc get imagestreams
# BuildConfig → POST /apis/build.openshift.io/v1/namespaces/.../builds
oc start-build my-app
# DeploymentConfig (legacy OpenShift resource)
# → GET /apis/apps.openshift.io/v1/namespaces/.../deploymentconfigs
oc rollout latest dc/my-app
# SCC inspection → GET /apis/security.openshift.io/v1/securitycontextconstraints
oc get scc

Category 2 — Operators and Controllers

Operators are long-running processes inside the cluster that maintain perpetual watch connections to the API server — the busiest category of API consumers by connection count.

The watch loop — how operators work

// Every operator runs this pattern against the API server
// Connection: persistent HTTP/2 stream to 10.1.0.8:6443
// 1. LIST — get current state (one-time at startup)
GET /apis/apps/v1/namespaces/payments/deployments
→ Returns: all deployments + resourceVersion: 48291
// 2. WATCH — subscribe to changes (permanent long-poll)
GET /apis/apps/v1/namespaces/payments/deployments?watch=1&resourceVersion=48291
→ Server keeps connection open indefinitely
→ Pushes events as they occur:
{"type":"MODIFIED","object":{"metadata":{"name":"web"},...}}
{"type":"ADDED","object":{"metadata":{"name":"worker"},...}}
{"type":"DELETED","object":{"metadata":{"name":"old"},...}}
// 3. RECONCILE — when event received, fix actual → desired state
PATCH /apis/apps/v1/namespaces/payments/replicasets/web-abc
→ Creates/deletes pods to match desired replicas
// 4. STATUS UPDATE — write observed state back
PATCH /apis/apps/v1/namespaces/payments/deployments/web/status
→ {"observedGeneration": 5, "availableReplicas": 3}

Built-in OpenShift operators that run this loop continuously

OperatorWhat it watchesWhat it does
openshift-apiserver-operatorapiservers.config.openshift.ioManages API server config and certs
cluster-version-operatorclusterversions.config.openshift.ioDrives cluster upgrades
machine-config-operatormachineconfigs, machineconfigpoolsApplies RHCOS config to nodes
ingress-operatoringresses.config.openshift.ioManages router deployments
dns-operatordnses.config.openshift.ioManages CoreDNS config
network-operatornetworks.config.openshift.ioManages OVN-Kubernetes
image-registry-operatorconfigs.imageregistry.operator.openshift.ioManages internal registry
authentication-operatorauthentications.config.openshift.ioManages OAuth server

Every one of these has persistent watch connections open to the API server at all times — a healthy ARO cluster typically has 40–80 active watch streams running 24/7.


Category 3 — Kubelet (Node Agent)

Every worker node runs a kubelet process that maintains its own connection to the API server — reporting node health and receiving pod assignments:

Worker node kubelet → 10.1.0.8:6443
Outbound (kubelet → API server):
POST /api/v1/nodes/worker-1/status every 10 seconds — node heartbeat
PATCH /api/v1/namespaces/app/pods/web-1/status when pod state changes
POST /api/v1/events kubelet events (OOM, image pull)
Inbound (API server → kubelet port 10250):
GET https://worker-1:10250/exec/... kubectl exec forwarding
GET https://worker-1:10250/log/... kubectl logs forwarding
GET https://worker-1:10250/metrics Prometheus scraping

If the kubelet loses its connection to the API server for more than the node-monitor-grace-period (default 40 seconds), the node is marked NotReady and pods begin eviction.


Category 4 — CI/CD Pipelines

Self-hosted CI/CD runners inside the VNet authenticate to the API server using a service account token:

# Service account for CI/CD — scoped to specific namespace
apiVersion: v1
kind: ServiceAccount
metadata:
name: cicd-deployer
namespace: payments
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: deployer
namespace: payments
rules:
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "create", "update", "patch"]
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cicd-deployer-binding
namespace: payments
roleRef:
kind: Role
name: deployer
subjects:
- kind: ServiceAccount
name: cicd-deployer
namespace: payments

GitHub Actions pipeline using this service account:

- name: Deploy to ARO
run: |
# Authenticate with service account token — all traffic to 10.1.0.8:6443
oc login ${{ secrets.ARO_API_URL }} \
--token ${{ secrets.CICD_SA_TOKEN }}
# Each command = REST call through private endpoint
oc set image deployment/web \
web=acrprod.azurecr.io/my-app:${{ github.sha }} \
-n payments
oc rollout status deployment/web -n payments

Category 5 — Admission Webhooks

Admission webhooks add an external hop during the API server request pipeline — the API server calls out to your webhook service before persisting any object:

kubectl apply -f pod.yaml
API server receives POST /api/v1/namespaces/payments/pods
Authn + RBAC pass
Mutating admission webhook:
API server → POST https://gatekeeper-webhook.gatekeeper-system.svc:443/mutate
Webhook adds labels, sets resource limits, injects sidecars
→ Returns mutated pod spec
Validating admission webhook:
API server → POST https://gatekeeper-webhook.gatekeeper-system.svc:443/validate
Checks policy: must have resource limits, no root, valid image registry
→ Returns: allowed: true (or denied with reason)
Persist to etcd → notify watchers → return 201 Created

Common admission webhooks in ARO:

WebhookPurpose
OPA GatekeeperPolicy enforcement — block non-compliant resources
KyvernoPolicy as code — mutate, validate, generate
Istio / OpenShift Service MeshInject Envoy sidecar into pods automatically
Red Hat ACMMulti-cluster governance policies
Cert-managerInject TLS certificates into resources

Category 6 — Monitoring and Observability

# Prometheus scrapes API server metrics via the API endpoint
GET https://10.1.0.8:6443/metrics
# Returns: apiserver_request_total, apiserver_request_duration_seconds,
# etcd_request_duration_seconds, workqueue_depth, ...
# Health endpoints checked by Azure ARO service monitor
GET https://10.1.0.8:6443/healthz → "ok"
GET https://10.1.0.8:6443/readyz → "ok"
GET https://10.1.0.8:6443/livez → "ok"
# OpenShift console reads cluster state continuously
GET /apis/config.openshift.io/v1/clusterversions/version
GET /api/v1/namespaces?limit=500
GET /apis/project.openshift.io/v1/projects

The Request Pipeline — What Happens Inside

Every request through the private endpoint traverses this exact pipeline inside kube-apiserver:

TLS handshake on 10.1.0.8:6443
1. AUTHENTICATION — who are you?
• OIDC token (Entra ID) → extract user + groups
• x509 client cert → extract CN as username
• Bearer token → look up service account
• Failure → 401 Unauthorized
2. AUTHORIZATION (RBAC) — are you allowed?
• Check: user + groups + verb + resource + namespace
• ClusterRoleBinding / RoleBinding lookup
• OpenShift SCC evaluation for pods
• Failure → 403 Forbidden
3. ADMISSION CONTROL — is this allowed by policy?
• Mutating webhooks (modify the object)
• Built-in admission plugins (ResourceQuota, LimitRanger)
• Validating webhooks (accept or reject)
• Failure → 400/403 with reason
4. VALIDATION — is the object schema correct?
• OpenAPI schema validation
• CRD schema validation
• Field immutability checks
• Failure → 422 Unprocessable Entity
5. PERSIST TO etcd
• Serialise to protobuf
• Encrypt at rest (AES-GCM, ARO managed)
• Write to etcd with optimistic concurrency (resourceVersion)
• Failure → 409 Conflict (resourceVersion mismatch)
6. NOTIFY WATCHERS
• Push event to all active watch streams matching the resource
• Controllers, operators, scheduler, kubelet all receive notification
7. RETURN RESPONSE
• 200 OK (GET)
• 201 Created (POST)
• 200 OK with updated object (PATCH/PUT)
• 404 Not Found
• Streaming response for watch/exec/logs/port-forward

API Groups — Kubernetes vs OpenShift

The API server serves two parallel API surfaces — Kubernetes core APIs and OpenShift extension APIs — all through the same 10.1.0.8:6443 endpoint:

Kubernetes core APIs:
/api/v1/ pods, services, configmaps, secrets, nodes
/apis/apps/v1/ deployments, replicasets, statefulsets, daemonsets
/apis/batch/v1/ jobs, cronjobs
/apis/rbac.authorization.k8s.io/ clusterroles, rolebindings
/apis/storage.k8s.io/ storageclasses, persistentvolumes
/apis/networking.k8s.io/ ingresses, networkpolicies
OpenShift extension APIs:
/apis/route.openshift.io/ routes (OpenShift ingress primitive)
/apis/project.openshift.io/ projects (namespace + RBAC wrapper)
/apis/build.openshift.io/ buildconfigs, builds
/apis/image.openshift.io/ imagestreams, imagestreamtags
/apis/apps.openshift.io/ deploymentconfigs (legacy)
/apis/security.openshift.io/ securitycontextconstraints
/apis/config.openshift.io/ cluster-wide config (DNS, network, auth)
/apis/operator.openshift.io/ operator configuration resources
/apis/machine.openshift.io/ machines, machinesets (MachineAPI)

Key Takeaway

The ARO API server private endpoint at 10.1.0.8:6443 is not just the entry point for human CLI commands — it is the nervous system of the entire cluster. Every automated process — the 40+ built-in OpenShift operators maintaining cluster state, every kubelet heartbeating from every worker node every 10 seconds, every CI/CD deployment, every admission webhook validation, every Prometheus health check — flows through this single TLS endpoint. Making it private eliminates the internet attack surface entirely, while the seven-stage request pipeline inside the API server ensures every operation is authenticated, authorised, policy-checked, validated, and durably persisted before any response is returned.

Best Practices for OpenShift on Azure: ARO Guide

OpenShift Container Platform on Azure — ARO Best Practices

Azure Red Hat OpenShift (ARO) is a fully managed OpenShift 4 service jointly operated by Microsoft and Red Hat — both companies share responsibility for the control plane, infrastructure, and SLA (99.95%).


1. Networking Best Practices

Always deploy a private cluster

A private ARO cluster hides the Kubernetes API server behind a private endpoint — no public IP, unreachable from the internet:

az aro create \
  --resource-group rg-aro \
  --name aro-prod \
  --vnet aro-spoke-vnet \
  --master-subnet master-subnet \
  --worker-subnet worker-subnet \
  --apiserver-visibility Private \      # ← API server private
  --ingress-visibility Private \        # ← ingress private
  --pull-secret @pull-secret.txt

Access to the private API server is then through Azure Bastion → jump host, or over ExpressRoute/VPN from on-premises.


Subnet sizing — get this right before deployment (cannot resize after)

ARO consumes IP addresses aggressively — every pod gets its own IP from the node’s subnet range:

SubnetMinimumRecommendedNotes
Master subnet/27/24Fixed 3 masters — needs room for Azure infra IPs
Worker subnet/27/23 or /22Every pod consumes an IP — size generously
Ingress subnet/28/27For LB / App Gateway front-end IPs
Private endpoints/28/27One IP per private endpoint
Worker subnet sizing example:
  /23 = 512 addresses
  Azure reserves 5
  Available: 507
  Max pods per node: 250 (default OpenShift SDN)
  Nodes supportable: ~2 per node × workers
  Plan for: 3× current need for growth headroom


Egress lockdown via Azure Firewall

ARO requires outbound internet access for Red Hat update servers, telemetry, and pull.registry.redhat.io. Lock this down with Azure Firewall application rules rather than allowing all outbound:

Azure Firewall Application Rules for ARO egress:
┌─────────────────────────────────────────────────────────┐
│ Name Target FQDN │
├─────────────────────────────────────────────────────────┤
│ aro-rh-registry registry.redhat.io │
│ registry.access.redhat.com │
│ quay.io │
│ cdn.quay.io │
├─────────────────────────────────────────────────────────┤
│ aro-azure-services *.blob.core.windows.net │
│ *.servicebus.windows.net │
│ *.table.core.windows.net │
├─────────────────────────────────────────────────────────┤
│ aro-monitoring *.ods.opinsights.azure.com │
│ *.oms.opinsights.azure.com │
├─────────────────────────────────────────────────────────┤
│ aro-rh-telemetry cert-api.access.redhat.com │
│ api.access.redhat.com │
└─────────────────────────────────────────────────────────┘

Apply a UDR on the master and worker subnets pointing 0.0.0.0/0 to the Azure Firewall private IP — same hub and spoke pattern as any spoke workload.


Use a custom DNS server

Point the ARO VNet DNS to your hub DNS Private Resolver so cluster nodes can resolve private endpoints and internal domains:

az network vnet update \
  --resource-group rg-aro-network \
  --name aro-spoke-vnet \
  --dns-servers 10.0.5.4    # DNS Private Resolver inbound endpoint IP


2. Availability and Resilience Best Practices

Spread across all three Availability Zones

ARO deploys 3 master nodes — one per AZ automatically. Workers must be explicitly spread via MachineSets:

# MachineSet for AZ1 — replicate for AZ2, AZ3
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: aro-prod-worker-eastus-1
  namespace: openshift-machine-api
spec:
  replicas: 3
  template:
    spec:
      providerSpec:
        value:
          zone: "1"                         # AZ1
          vmSize: Standard_D16s_v3
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS

Create three MachineSets — one per zone — with equal replica counts. This ensures workloads survive a full AZ failure.


Enable cluster autoscaler

apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  resourceLimits:
    maxNodesTotal: 24
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 30s
---
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: aro-prod-worker-eastus-1
  namespace: openshift-machine-api
spec:
  minReplicas: 3
  maxReplicas: 8
  scaleTargetRef:
    kind: MachineSet
    name: aro-prod-worker-eastus-1


Use zone-redundant storage for persistent volumes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS       # Zone-redundant storage — survives AZ failure
  cachingMode: ReadOnly
reclaimPolicy: Retain        # Retain on PVC delete — prevents data loss
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Use Premium_ZRS instead of Premium_LRS for stateful workloads — ZRS replicates the disk synchronously across three AZs so a pod can reschedule to another zone without losing its data.


3. Security Best Practices

Use Workload Identity (pod-level Azure RBAC)

Never put Azure credentials in pods. Use Workload Identity to give individual pods an Azure AD identity with scoped RBAC permissions:

# Enable workload identity on ARO cluster
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --enable-managed-identity

# Create a managed identity for a specific workload
az identity create \
  --resource-group rg-aro-workloads \
  --name id-payment-service

# Grant it only what it needs
az role assignment create \
  --assignee <identity-client-id> \
  --role "Key Vault Secrets User" \
  --scope /subscriptions/.../vaults/kv-prod

# Annotate the service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service-sa
  namespace: payments
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"


Integrate Azure Key Vault for secrets via CSI driver

Never store secrets in OpenShift Secrets (base64 is not encryption). Use the Secrets Store CSI driver to mount Key Vault secrets directly into pods:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: azure-kv-secrets
  namespace: payments
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    clientID: "<managed-identity-client-id>"
    keyvaultName: kv-prod
    tenantID: "<tenant-id>"
    objects: |
      array:
        - |
          objectName: db-connection-string
          objectType: secret
        - |
          objectName: api-key
          objectType: secret


Integrate with Azure Container Registry via private endpoint

# Create ACR with private endpoint — no public access
az acr create \
  --resource-group rg-aro \
  --name acrprodaro \
  --sku Premium \
  --public-network-enabled false

# Private endpoint in ARO spoke
az network private-endpoint create \
  --name pe-acr-prod \
  --resource-group rg-aro-network \
  --vnet-name aro-spoke-vnet \
  --subnet private-endpoint-subnet \
  --private-connection-resource-id $(az acr show --name acrprodaro --query id -o tsv) \
  --group-id registry \
  --connection-name pe-acr-conn

# Grant ARO pull access
az role assignment create \
  --assignee <aro-kubelet-identity> \
  --role AcrPull \
  --scope $(az acr show --name acrprodaro --query id -o tsv)


Apply OpenShift Security Context Constraints (SCC)

Never run pods as root. Use the restricted-v2 SCC (default in OpenShift 4.11+):

apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1001
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
        readOnlyRootFilesystem: true


Enable Microsoft Defender for Containers

az security pricing create \
  --name Containers \
  --tier Standard

Defender for Containers provides runtime threat detection, vulnerability scanning for images in ACR, and Kubernetes audit log analysis — all surfaced in Microsoft Defender for Cloud.


4. Observability Best Practices

Forward logs to Azure Monitor / Log Analytics

# Enable container insights on ARO
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --enable-managed-identity

# Deploy the monitoring add-on via Helm
helm repo add microsoft https://microsoft.github.io/charts/repo
helm install azuremonitor-containers \
  microsoft/azuremonitor-containers \
  --set omsagent.secret.wsid=<workspace-id> \
  --set omsagent.secret.key=<workspace-key> \
  --namespace kube-system


Use Azure Monitor alerts for cluster health

AlertMetricThreshold
Node CPU pressurecpuUsageNanoCores> 85% for 5 min
Node memory pressurememoryWorkingSetBytes> 80% of capacity
Pod restart looprestartCount> 5 in 10 min
PVC near fullpvUsedBytes> 85% of capacity
Node not readynodeConditionNotReady > 2 min

5. Day-2 Operations Best Practices

Cluster upgrade strategy

ARO manages the control plane upgrade automatically — you control timing for worker nodes:

# Check available upgrade versions
az aro get-upgrade-versions \
  --resource-group rg-aro \
  --name aro-prod

# Schedule upgrade in maintenance window
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --version 4.14.12

Use the EUS (Extended Update Support) channel for production clusters — it allows staying on a minor version for up to 18 months while still receiving security patches, avoiding the churn of mandatory minor version upgrades every 45 days.


Worker node upgrade — use surge capacity

# MachineConfigPool surge upgrade strategy
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker
spec:
  maxUnavailable: 1          # Upgrade one node at a time

Upgrade workers one node at a time to maintain application availability — pods are gracefully drained before each node reboots into the new RHCOS version.


Summary — ARO Best Practice Checklist

CategoryPractice
NetworkPrivate cluster — no public API or ingress
NetworkEgress via Azure Firewall with FQDN allow-list
NetworkDNS Private Resolver for private endpoint resolution
NetworkWorker subnet /22 or larger — never resize after
AvailabilityWorkers spread across AZs via 3 MachineSets
AvailabilityCluster autoscaler min 3 per zone
AvailabilityPremium_ZRS disks for stateful workloads
AvailabilityZone-redundant Azure Load Balancer
SecurityWorkload Identity — no credentials in pods
SecurityKey Vault + CSI driver — no base64 secrets
SecurityACR via private endpoint — no public pull
SecuritySCC restricted-v2 — no root containers
SecurityDefender for Containers enabled
ObservabilityContainer Insights → Log Analytics
ObservabilityAzure Monitor alerts on node and pod health
OperationsEUS channel for production stability
OperationsmaxUnavailable: 1 for worker upgrades

MCP Operations Server: AI-Enabled Managed Ops Explained

To bridge your local Python code to a production-ready AKS environment, you need a Dockerfile that doesn’t just run the code, but does so securely and efficiently.

By 2026, the standard for MCP servers in production is to move away from STDIO (local command line) and use SSE (Server-Sent Events) over HTTP. This allows your AI agents to talk to the server over a network.

1. The Production Dockerfile

This Dockerfile uses a “non-root” user (security best practice) and installs the necessary drivers to talk to the Docker socket or Kubernetes API.

Dockerfile

# Use a lightweight Python 2026-ready base image
FROM python:3.12-slim
# Install system dependencies (curl for health checks)
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Create a non-root user for security
RUN groupadd -r mcpuser && useradd -r -g mcpuser mcpuser
# Copy requirements and install
# Note: includes 'mcp[cli]' for server capabilities
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy server code
COPY server.py .
# Give our non-root user access to the app folder
RUN chown -r mcpuser:mcpuser /app
USER mcpuser
# Expose the port for SSE/HTTP transport (Standard for 2026)
EXPOSE 8000
# Start the server using the FastMCP production runner
CMD ["python", "server.py", "--transport", "sse", "--port", "8000"]

2. The requirements.txt

You’ll need these specific libraries:

Plaintext

fastmcp>=1.0.0
docker>=7.0.0
kubernetes>=30.0.0
uvicorn # Required for high-performance HTTP transport

3. Deploying to AKS (The “Support” Strategy)

When you deploy this to your client’s AKS cluster, you’ll use a standard Kubernetes Deployment.

Why this is better for your role:

  • Scaling: If the dev team grows, you can scale the MCP server to 3 replicas so the AI assistant never lags.
  • Security: Instead of sharing your personal kubeconfig, the MCP server uses a ServiceAccount with “View Only” permissions. This means the AI can see the logs but can’t accidentally delete the production database.

4. How to Pitch the “AI Operations” Tier

You can now offer a new support tier called “AI-Enabled Managed Ops”:

“I’ve built a custom MCP Operations Server for our cluster. It allows our internal AI agents to perform health checks, retrieve logs, and analyze container stats using natural language. This doesn’t replace me; it allows me to respond to your requests 10x faster because the AI is doing the ‘data gathering’ for me inside our secure perimeter.”

One final piece of the puzzle

To make this work in AKS, the pod needs permission to “see” the other pods.

To finish the MCP server integration on AKS, you need to grant the pod the right permissions to “talk” to the Kubernetes API.

If you don’t do this, the AI will be “blind”—it will try to list pods and get a 403 Forbidden error.


1. The RBAC Strategy

We will use three Kubernetes objects:

  • ServiceAccount: The identity for your MCP pod.
  • ClusterRole: A set of rules that allow “Viewing” (reading pods, logs, and events).
  • ClusterRoleBinding: The “glue” that attaches the Role to the ServiceAccount.

2. The RBAC YAML (mcp-rbac.yaml)

YAML

# 1. The Identity
apiVersion: v1
kind: ServiceAccount
metadata:
name: mcp-server-sa
namespace: default
---
# 2. The Permissions (Read-Only/Viewer)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mcp-pod-viewer
rules:
- apiGroups: [""]
# Accessing 'pods' for list/get, and 'pods/log' specifically for tracing
resources: ["pods", "pods/log", "pods/status", "events", "nodes", "services"]
verbs: ["get", "list", "watch"]
---
# 3. The Connection
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: mcp-server-binding
subjects:
- kind: ServiceAccount
name: mcp-server-sa
namespace: default
roleRef:
kind: ClusterRole
name: mcp-pod-viewer
apiGroup: rbac.authorization.k8s.io

3. Updating your Deployment

Finally, ensure your MCP server deployment uses this serviceAccountName:

YAML

spec:
template:
spec:
serviceAccountName: mcp-server-sa
containers:
- name: mcp-server
image: your-mcp-image:latest
# ... other config ...

4. Why this is “Safe” for your Client

When you explain this to the company, emphasize these three points:

  • Namespace Scoping: Even though it’s a ClusterRole, you can swap it for a Role if you only want the AI to see specific namespaces (e.g., only production-frontend).
  • No Secrets Access: Notice that secrets is not in the list of resources. The AI literally cannot see the database passwords, even if it tries.
  • Auditability: Every time the AI agent asks the MCP server for logs, Kubernetes logs that action under the mcp-server-sa identity. You have a perfect audit trail.

Putting it all together

You now have the Terraform for infrastructure, the Python for the server, the Docker for the container, and the RBAC for security.

You’re ready to pitch this as a “Self-Healing AI Operations Layer.”

Top Azure Kubernetes Service Interview Questions

Here are high-impact Azure Kubernetes Service (AKS) interview questions—the kind that actually get asked in real interviews—plus what interviewers are looking for in your answers.


1. AKS Fundamentals

What is Azure Kubernetes Service (AKS)?
  • Managed Kubernetes cluster on Azure
  • Azure manages control plane (free), you manage node pools
  • Integrates with Azure networking, identity, and security services

Interviewer wants:

  • You understand shared responsibility
  • You know why AKS vs self-managed Kubernetes

Difference between AKS and Kubernetes?
  • Kubernetes = open-source container orchestrator
  • AKS = managed implementation of Kubernetes in Azure

Bonus:

  • Mention upgrades, scaling, monitoring handled by Azure

2. Architecture & Components

What are the main components of AKS?

  • Control plane (API server, scheduler, etcd)
  • Node pools (VMs running pods)
  • Pods, deployments, services

Strong answer:

  • Mention system node pool vs user node pool

What is a node pool?
  • Group of nodes with same configuration
  • Used for:
    • Scaling
    • Workload isolation (e.g., GPU vs general compute)

System node pool vs user node pool?
  • System pool → runs critical pods (CoreDNS, kube-proxy)
  • User pool → runs your apps

Interview tip: mention taints/tolerations


3. Networking (VERY IMPORTANT)

How does networking work in AKS?
Image
  • Two main models:
    • Kubenet
    • Azure CNI

Kubenet vs Azure CNI?
FeatureKubenetAzure CNI
IP assignmentNATReal VNet IP
ScalabilityBetterLimited by subnet
ComplexityLowerHigher
Use caseSmall clustersEnterprise

Strong answer:

  • Azure CNI = required for private endpoints / enterprise networking

What is a private AKS cluster?
  • API server is exposed via private IP
  • No public access

Mention:

  • Uses Private Endpoint + Private DNS

How do you expose applications?
  • LoadBalancer service
  • Ingress Controller (e.g., NGINX, AGIC)

Bonus:

  • Mention Application Gateway Ingress Controller (AGIC)

4. Identity & Security

How does AKS handle identity?
  • Uses Azure Active Directory
  • Managed Identity for cluster
  • RBAC for authorization

What is pod identity?
  • Allows pods to access Azure resources securely

Mention:

  • Workload Identity (modern replacement)

How do you secure AKS?
  • Network policies
  • RBAC
  • Private clusters
  • Secrets via Key Vault
  • Defender for Kubernetes

Strong answer = layered security


5. Scaling & Availability

How do you scale AKS?
  • Horizontal Pod Autoscaler (HPA)
  • Cluster Autoscaler

👉 Explain:

  • HPA = pods
  • Cluster autoscaler = nodes

How do you ensure high availability?
  • Multiple node pools
  • Availability zones
  • Replica sets

6. Storage

How does storage work in AKS?
  • Persistent Volumes (PV)
  • Persistent Volume Claims (PVC)
  • Azure Disks / Azure Files

Azure Disk vs Azure File?
FeatureDiskFile
AccessSingle podMultiple pods
PerformanceHighModerate

7. CI/CD & Deployment

How do you deploy apps to AKS?
  • kubectl
  • Helm
  • GitHub Actions / Azure DevOps

What is Helm?
  • Kubernetes package manager

Think:

  • “apt-get for Kubernetes”

8. Monitoring & Troubleshooting

How do you monitor AKS?
  • Azure Monitor
  • Log Analytics
  • Container Insights

Pod is not starting—what do you check?

👉 Interview GOLD answer:

  1. kubectl describe pod
  2. kubectl logs
  3. Check events
  4. Image pull issues?
  5. Resource limits?

Node is not ready—what do you check?
  • VM health
  • kubelet status
  • networking issues
  • disk pressure

9. Real-World Scenario Questions

Your app is not reachable externally—what do you check?
  • Service type (LoadBalancer?)
  • NSG rules
  • Ingress config
  • DNS resolution

Pods cannot reach database in Azure SQL
  • VNet integration
  • Private endpoint DNS
  • NSG / firewall rules

Cluster ran out of IPs

This is a BIG one:

  • Happens in Azure CNI
  • Fix:
    • Increase subnet size
    • Use multiple node pools
    • Switch to overlay (advanced)

10. Advanced / Senior-Level Questions

How do you design AKS for enterprise?
  • Hub-spoke networking
  • Private cluster
  • Azure Firewall
  • Azure Policy
  • Multiple environments (dev/test/prod)

What are common AKS pitfalls?
  • IP exhaustion
  • Misconfigured DNS
  • Over-permissioned identities
  • No autoscaling

How do you upgrade AKS safely?
  • Upgrade node pools one at a time
  • Use surge upgrade
  • Test in staging

If You Want to Stand Out

Say things like:

  • “I’d default to private AKS + Azure CNI in enterprise”
  • “I always plan subnet sizing early to avoid IP exhaustion”
  • “I separate system and user node pools for reliability”

I’ll walk you through real interview-style troubleshooting drills, the way an interviewer would push you step-by-step.


Drill 1: “Pod is stuck in Pending”

Scenario

You deploy an app, but the pod never starts.


How you should think (out loud)

Step 1 — Describe the pod

kubectl describe pod <pod-name>

Look for:

  • Insufficient CPU/Memory
  • node affinity
  • taints not tolerated

Common Root Causes

1. Not enough resources

  • Node pool too small
  • No autoscaler

2. Taints / tolerations mismatch

  • Pod can’t be scheduled

3. No available nodes

  • Cluster autoscaler disabled or maxed out

Strong interview answer

“I’d start with kubectl describe pod to check scheduling events. Most Pending issues are either resource constraints, taints, or node availability. Then I’d verify node pool capacity and autoscaler behavior.”


Drill 2: “Pod is crashing (CrashLoopBackOff)”

Scenario

Pod starts but keeps restarting.


Steps

Step 1 — Check logs

kubectl logs <pod-name>

Step 2 — Describe pod

kubectl describe pod <pod-name>

Common Causes
  • App crash (bad config, env vars)
  • Liveness probe killing container
  • Missing secret/config map

Pro answer

“I’d first check container logs, then validate probes and configuration dependencies like secrets. CrashLoopBackOff is usually application or probe-related.”


Drill 3: “App not accessible externally”

Scenario

App deployed but browser can’t reach it.


Debug flow
Image
Image

Step-by-step
  1. Check service
kubectl get svc
  • Is it LoadBalancer?

  1. Check external IP
  • Assigned or stuck in <pending>?

  1. Check ingress
kubectl get ingress

  1. Check NSG / firewall
  • Port 80/443 open?

  1. DNS resolution
  • Is domain pointing correctly?

Common Causes
  • Service is ClusterIP only
  • NSG blocking traffic
  • Ingress misconfigured
  • Backend pods not healthy

Strong answer

“I’d trace from outside in: DNS → Load Balancer → Ingress → Service → Pod. That quickly isolates where traffic is breaking.”


Drill 4: “Pods cannot reach Azure SQL / external service”

Scenario

App runs but can’t connect to DB.


Think networking first

Steps
  1. Test from inside pod
kubectl exec -it <pod> -- curl <endpoint>

  1. Check DNS resolution
nslookup <db-name>

  1. Check networking
  • VNet integration
  • Private endpoint?

  1. Check NSG rules
  • Outbound allowed?

  1. Check Azure SQL firewall

Common Causes
  • Private endpoint DNS not configured
  • NSG blocking outbound
  • Wrong connection string

Pro answer

“I’d validate connectivity from inside the pod, then check DNS resolution for private endpoints, and finally NSG and firewall rules.”


Drill 5: “Cluster ran out of IPs” (VERY COMMON)

Scenario

Pods stop scheduling, errors appear.


What’s happening?
  • Using Azure CNI → each pod gets real VNet IP
  • Subnet is exhausted

Symptoms
  • Pods stuck in Pending
  • Errors about IP allocation

Fixes
  • Expand subnet
  • Add new node pool with bigger subnet
  • Use Azure CNI Overlay (advanced)

Strong answer

“This is a classic Azure CNI limitation. I’d check subnet utilization and either expand it or redesign with better IP planning.”


Drill 6: “Node shows NotReady”

Scenario

One or more nodes go unhealthy.


Steps
kubectl get nodes
kubectl describe node <node>

Check for:
  • Disk pressure
  • Memory pressure
  • kubelet stopped
  • Network issues

Azure-specific checks
  • VM status in Azure Portal
  • Underlying VMSS health

Strong answer

“I’d check node conditions via kubectl describe, then validate VM health in Azure and kubelet status.”


Drill 7: “Deployment succeeded but no pods created”

🎯 Scenario

You applied YAML, nothing runs.


Steps

kubectl get deployments
kubectl describe deployment <name>

Causes

  • Replica = 0
  • Image pull error
  • Invalid YAML


MASTER TROUBLESHOOTING FRAMEWORK (Memorize This)

When stuck, always go:

Flow

1. Pod

  • Status? Logs?

2. Node

  • Capacity? Healthy?

3. Network

  • Service? DNS? NSG?

4. Azure layer

  • VNet / Subnet / Private endpoint?

How to Sound Senior in Interviews

Say this:

“I follow a layered troubleshooting approach:
Kubernetes layer (pods, services),
then node health,
then networking,
and finally Azure infrastructure like VNets and NSGs.”


Understanding Azure RBAC vs Kubernetes RBAC

When explaining this to a client, it is helpful to describe it as the difference between who can touch the physical server (the building) versus who can edit the files on the computer inside (the office).

In 2026, the industry standard is to use Azure RBAC for both, but they still operate on two distinct “control planes.”


1. The Two Control Planes

In AKS, access is split into two layers:

  • The Azure Control Plane (Azure RBAC): This governs the “outside” of the cluster. It’s about the Kubernetes resource itself as it exists in your Azure portal.
  • The Kubernetes Control Plane (Kubernetes RBAC): This governs the “inside” of the cluster. It’s about the pods, namespaces, and deployments running on the nodes.

2. Side-by-Side Comparison

FeatureAzure RBACKubernetes RBAC
ScopeSubscription / Resource Group / AKS ResourceCluster / Namespace / Specific Pods
Managed ViaAzure Portal, CLI, Terraformkubectl, YAML manifests, Helm
Typical ActionsScaling nodes, Upgrading K8s version, Deleting the cluster.Creating a Pod, Editing a Service, Viewing logs in a Namespace.
Identity SourceMicrosoft Entra ID (Azure AD)Service Accounts (or Entra ID via integration)

3. The “Hybrid” Option (Azure RBAC for K8s Authorization)

This is the most confusing part for beginners, but the most important for you to propose to your client.

You can now use Azure RBAC to manage internal Kubernetes permissions. Instead of writing complex RoleBinding YAML files for every user, you assign them a built-in Azure role that Kubernetes understands.

Key Built-in Roles (2026 Standards):

  • AKS RBAC Viewer: Can see resources in a namespace but can’t see secrets or change anything.
  • AKS RBAC Writer: Can deploy apps and edit resources.
  • AKS RBAC Admin: Full control over a namespace.
  • AKS RBAC Cluster Admin: The “God Mode” for the entire cluster.

4. How to Explain the Workflow to Your Manager

“Think of it like a bank:

  1. Azure RBAC is the security guard at the front door. He checks your ID (Entra ID) and decides if you’re even allowed in the building. He also decides who can add more teller windows (Scale nodes) or renovate the lobby (Upgrade cluster).
  2. Kubernetes RBAC is the permissions on the safe. Once you’re inside, it decides if you can open Drawer A (Namespace ‘Dev’) or Drawer B (Namespace ‘Prod’).”

Pro-Tip: Recommendation

If you want to provide “Gold Standard” support, propose disabling local accounts and moving entirely to Azure RBAC for Kubernetes Authorization. > Why? Because when an employee leaves the company and their Entra ID (Azure AD) is deleted, their access to the Kubernetes cluster is instantly revoked. No orphaned RoleBindings to worry about.

Understanding how a developer goes from their laptop to a running container in a secured AKS environment is the best way to prove the value of your setup.

Here is the step-by-step lifecycle of a developer’s request in a Zero-Trust AKS environment.


The Access Lifecycle (Step-by-Step)

1. Authentication (The Gatekeeper)

The developer doesn’t have a “Kubernetes password.” Instead, they run:

Bash

az login
az aks get-credentials --resource-group rg-prod --name aks-01

At this moment, Azure RBAC checks if their Entra ID account has permission to even download the cluster configuration.

2. Authorization (The Office Door)

The developer tries to deploy a new microservice:

Bash

kubectl apply -f my-app.yaml

The AKS API Server intercepts this. Since we are using Azure RBAC for Kubernetes Authorization, it asks Entra ID: “Does this user have the ‘AKS RBAC Writer’ role for the ‘Production’ namespace?” * If Yes: The request proceeds.

  • If No: The request is blocked with a 403 Forbidden error.

3. Policy Validation (The Safety Inspector)

Before the pod is actually scheduled, Azure Policy (the Admission Controller) scans the my-app.yaml.

  • It checks: “Is this container trying to run as root? Does it have CPU limits?” * If the YAML is “lazy” (insecure), Azure Policy rejects it immediately, even though the developer has “Writer” permissions.

4. Identity & Secrets (The Secure Handshake)

Once the pod starts, it needs to talk to the database.

  • The pod presents its Workload Identity (a managed identity) to the Azure Key Vault.
  • Key Vault verifies the pod’s identity and hands over the database string via the CSI Driver.
  • The password is never stored in a file or an environment variable where a human could see it.

Summary Table for Your Proposal

To wrap this up for your client, you can present this “Success Path” to show them exactly what they are paying for:

StageSecurity LayerPurpose
LoginEntra IDEnsures only active employees can connect.
ActionAzure RBACLimits what a developer can do (e.g., Read vs. Write).
DeployAzure PolicyForces best practices (No root, resource limits).
ConnectWorkload IdentityEliminates hardcoded passwords in the code.

Pro-Tip: The “Audit” Hook

Tell your client: “With this setup, we can generate a report at any time showing exactly who accessed the production cluster and what they changed. This makes SOC2 or ISO27001 audits a breeze.”

Understanding Azure RBAC vs Kubernetes RBAC

When explaining this to a client, it is helpful to describe it as the difference between who can touch the physical server (the building) versus who can edit the files on the computer inside (the office).

In 2026, the industry standard is to use Azure RBAC for both, but they still operate on two distinct “control planes.”


1. The Two Control Planes

In AKS, access is split into two layers:

  • The Azure Control Plane (Azure RBAC): This governs the “outside” of the cluster. It’s about the Kubernetes resource itself as it exists in your Azure portal.
  • The Kubernetes Control Plane (Kubernetes RBAC): This governs the “inside” of the cluster. It’s about the pods, namespaces, and deployments running on the nodes.

2. Side-by-Side Comparison

FeatureAzure RBACKubernetes RBAC
ScopeSubscription / Resource Group / AKS ResourceCluster / Namespace / Specific Pods
Managed ViaAzure Portal, CLI, Terraformkubectl, YAML manifests, Helm
Typical ActionsScaling nodes, Upgrading K8s version, Deleting the cluster.Creating a Pod, Editing a Service, Viewing logs in a Namespace.
Identity SourceMicrosoft Entra ID (Azure AD)Service Accounts (or Entra ID via integration)

3. The “Hybrid” Option (Azure RBAC for K8s Authorization)

This is the most confusing part for beginners, but the most important for you to propose to your client.

You can now use Azure RBAC to manage internal Kubernetes permissions. Instead of writing complex RoleBinding YAML files for every user, you assign them a built-in Azure role that Kubernetes understands.

Key Built-in Roles (2026 Standards):

  • AKS RBAC Viewer: Can see resources in a namespace but can’t see secrets or change anything.
  • AKS RBAC Writer: Can deploy apps and edit resources.
  • AKS RBAC Admin: Full control over a namespace.
  • AKS RBAC Cluster Admin: The “God Mode” for the entire cluster.

4. How to Explain the Workflow to Your Manager

“Think of it like a bank:

  1. Azure RBAC is the security guard at the front door. He checks your ID (Entra ID) and decides if you’re even allowed in the building. He also decides who can add more teller windows (Scale nodes) or renovate the lobby (Upgrade cluster).
  2. Kubernetes RBAC is the permissions on the safe. Once you’re inside, it decides if you can open Drawer A (Namespace ‘Dev’) or Drawer B (Namespace ‘Prod’).”

Pro-Tip: Recommendation

If you want to provide “Gold Standard” support, propose disabling local accounts and moving entirely to Azure RBAC for Kubernetes Authorization. > Why? Because when an employee leaves the company and their Entra ID (Azure AD) is deleted, their access to the Kubernetes cluster is instantly revoked. No orphaned RoleBindings to worry about.

Understanding how a developer goes from their laptop to a running container in a secured AKS environment is the best way to prove the value of your setup.

Here is the step-by-step lifecycle of a developer’s request in a Zero-Trust AKS environment.


The Access Lifecycle (Step-by-Step)

1. Authentication (The Gatekeeper)

The developer doesn’t have a “Kubernetes password.” Instead, they run:

Bash

az login
az aks get-credentials --resource-group rg-prod --name aks-01

At this moment, Azure RBAC checks if their Entra ID account has permission to even download the cluster configuration.

2. Authorization (The Office Door)

The developer tries to deploy a new microservice:

Bash

kubectl apply -f my-app.yaml

The AKS API Server intercepts this. Since we are using Azure RBAC for Kubernetes Authorization, it asks Entra ID: “Does this user have the ‘AKS RBAC Writer’ role for the ‘Production’ namespace?” * If Yes: The request proceeds.

  • If No: The request is blocked with a 403 Forbidden error.

3. Policy Validation (The Safety Inspector)

Before the pod is actually scheduled, Azure Policy (the Admission Controller) scans the my-app.yaml.

  • It checks: “Is this container trying to run as root? Does it have CPU limits?” * If the YAML is “lazy” (insecure), Azure Policy rejects it immediately, even though the developer has “Writer” permissions.

4. Identity & Secrets (The Secure Handshake)

Once the pod starts, it needs to talk to the database.

  • The pod presents its Workload Identity (a managed identity) to the Azure Key Vault.
  • Key Vault verifies the pod’s identity and hands over the database string via the CSI Driver.
  • The password is never stored in a file or an environment variable where a human could see it.

Summary Table for Your Proposal

To wrap this up for your client, you can present this “Success Path” to show them exactly what they are paying for:

StageSecurity LayerPurpose
LoginEntra IDEnsures only active employees can connect.
ActionAzure RBACLimits what a developer can do (e.g., Read vs. Write).
DeployAzure PolicyForces best practices (No root, resource limits).
ConnectWorkload IdentityEliminates hardcoded passwords in the code.

Pro-Tip: The “Audit” Hook

Tell your client: “With this setup, we can generate a report at any time showing exactly who accessed the production cluster and what they changed. This makes SOC2 or ISO27001 audits a breeze.”

Deploy AKS Clusters with Terraform: Best Practices

To deploy a production-ready AKS cluster using Terraform, it is best practice to separate your Network (VNet/Subnet) from the AKS Cluster resource. This ensures that if you ever need to destroy the cluster, your networking infrastructure remains intact.

Here is a clean, modular example using the AzureRM provider.

1. The Provider Configuration

First, create a main.tf to define your requirements.

Terraform

terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0" # Or 4.x if using the latest 2026 releases
}
}
}
provider "azurerm" {
features {}
}

2. Networking Resources

AKS needs a dedicated subnet. We’ll use Azure CNI (Advanced Networking) as it’s the standard for enterprise security.

Terraform

resource "azurerm_resource_group" "aks_rg" {
name = "rg-production-aks"
location = "East US"
}
resource "azurerm_virtual_network" "aks_vnet" {
name = "vnet-aks-prod"
location = azurerm_resource_group.aks_rg.location
resource_group_name = azurerm_resource_group.aks_rg.name
address_space = ["10.0.0.0/16"]
}
resource "azurerm_subnet" "aks_subnet" {
name = "snet-aks-nodes"
resource_group_name = azurerm_resource_group.aks_rg.name
virtual_network_name = azurerm_virtual_network.aks_vnet.name
address_prefixes = ["10.0.1.0/24"]
}

3. The AKS Cluster Resource

This block includes the security features we discussed: System Assigned Identity, Azure RBAC, and Azure Linux as the OS.

Terraform

resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-prod-01"
location = azurerm_resource_group.aks_rg.location
resource_group_name = azurerm_resource_group.aks_rg.name
dns_prefix = "aksprod"
# Enable Azure RBAC for Kubernetes
azure_policy_enabled = true
local_account_disabled = true
default_node_pool {
name = "systempool"
node_count = 3
vm_size = "Standard_DS2_v2"
vnet_subnet_id = azurerm_subnet.aks_subnet.id
# Use Azure Linux for better security/performance
os_sku = "AzureLinux"
# Enable auto-scaling for production
enable_auto_scaling = true
min_count = 3
max_count = 5
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
load_balancer_sku = "standard"
network_policy = "azure" # Enables Kubernetes Network Policies
}
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}

4. Essential Outputs

You’ll need the cluster configuration to connect via kubectl.

Terraform

output "client_certificate" {
value = azurerm_kubernetes_cluster.aks.kube_config.0.client_certificate
sensitive = true
}
output "kube_config" {
value = azurerm_kubernetes_cluster.aks.kube_config_raw
sensitive = true
}

Key Implementation Steps

  1. Initialize: Run terraform init to download the Azure provider.
  2. Plan: Run terraform plan -out=main.tfplan to preview the 4 resources being created.
  3. Apply: Run terraform apply "main.tfplan".
  4. Connect: Once finished, use the Azure CLI to get your credentials:Bashaz aks get-credentials --resource-group rg-production-aks --name aks-prod-01

Why this is a “Support Pro” Move

By delivering this in Terraform, you are telling the company: “I don’t just click buttons in the portal. I provide Infrastructure as Code that is version-controlled, repeatable, and documented.” This makes it much easier to propose a “Disaster Recovery” service later on.

Integrating the Azure Key Vault (AKV) Secrets Store CSI Driver into your Terraform code is the final step in removing sensitive data (like database passwords or API keys) from your Kubernetes manifests.

Here is the additional code to enable the driver and set up the necessary permissions.


1. Enable the CSI Driver in AKS

In your azurerm_kubernetes_cluster resource block (from the previous code), you need to add the key_vault_secrets_provider block:

Terraform

resource "azurerm_kubernetes_cluster" "aks" {
# ... existing config ...
key_vault_secrets_provider {
secret_rotation_enabled = true
secret_rotation_interval = "2m"
}
}

2. Create the Key Vault

You need a vault to actually store the secrets.

Terraform

resource "azurerm_key_vault" "kv" {
name = "kv-prod-aks-01"
location = azurerm_resource_group.aks_rg.location
resource_group_name = azurerm_resource_group.aks_rg.name
enabled_for_disk_encryption = true
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
# Best practice: Don't use access policies, use RBAC
enable_rbac_authorization = true
}
data "azurerm_client_config" "current" {}

3. Link AKS to Key Vault (The “Magic” Link)

When you enable the CSI driver, AKS creates a “Secret Provider Class” identity. You must give that identity permission to read from the Key Vault.

Terraform

# Identify the Managed Identity created by the AKS CSI Driver
resource "azurerm_role_assignment" "aks_kv_reader" {
scope = azurerm_key_vault.kv.id
role_definition_name = "Key Vault Secrets User"
principal_id = azurerm_kubernetes_cluster.aks.key_vault_secrets_provider[0].secret_identity[0].object_id
}

4. Usage: The SecretProviderClass (K8s Manifest)

Terraform sets up the infrastructure, but you still need a small Kubernetes object to tell the pod which secrets to pull. You can apply this via kubectl or a Terraform kubernetes_manifest resource:

YAML

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: azure-kv-provider
namespace: production
spec:
provider: azure
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "true"
userAssignedIdentityID: "<AKS_CSI_CLIENT_ID>" # Output this from Terraform
keyvaultName: "kv-prod-aks-01"
objects: |
array:
- |
objectName: db-password
objectType: secret
tenantId: "<YOUR_TENANT_ID>"

Why this is a “Gold Standard” Setup

By using this approach, your Linux servers and Docker microservices become significantly more secure:

  • No “Cleartext” Secrets: Developers never see the production password.
  • Auto-Rotation: If you change the password in the Azure Portal/Key Vault, the CSI driver automatically updates the file inside the running Docker container within 2 minutes.
  • Audit Trail: Every time a pod accesses a secret, it’s logged in Azure Monitor.

Pro-Tip for your Proposal

When talking to the client, use this phrasing:

“I am implementing a Zero-Trust Secret Architecture. This ensures that sensitive credentials never touch our code repository or container images, and they are rotated automatically to prevent long-term credential leakage.”

Understanding OADP: A Guide to OpenShift API for Data Protection

Here’s a thorough explanation of OADP — what it is, how it works, and how to use it.


What OADP is

The OpenShift API for Data Protection (OADP) product safeguards customer applications on OpenShift Container Platform. It offers comprehensive disaster recovery protection, covering OpenShift Container Platform applications, application-related cluster resources, persistent volumes, and internal images. OADP is also capable of backing up both containerized applications and virtual machines. However, OADP does not serve as a disaster recovery solution for etcd or OpenShift Operators.

In plain terms: OADP is the application-layer backup tool for OCP. Where etcd backup protects the cluster skeleton (all resource definitions), OADP protects what’s running inside namespaces — the actual workloads and their data.

OADP is the OpenShift API for Data Protection operator. This open source operator sets up and installs Velero on the OpenShift platform, allowing users to backup and restore applications.


ArchitectureHere’s a comprehensive explanation of OADP across all its key dimensions.


What OADP is

The OpenShift API for Data Protection (OADP) provides a comprehensive solution for backing up and restoring applications, persistent volumes, and custom resources across various environments. OADP is the OpenShift API for Data Protection operator — this open source operator sets up and installs Velero on the OpenShift platform, allowing users to backup and restore applications.

In short: OADP = Velero + OpenShift-specific plugins + OLM lifecycle management. Everything is driven by Kubernetes CRs.


What OADP protects

Data that can be protected with OADP includes Kubernetes resource objects, persistent volumes, and internal images. More specifically:

  • Kubernetes objects — all resources in selected namespaces: Deployments, Services, ConfigMaps, Secrets, Routes, PVCs, RoleBindings, etc.
  • Internal container images — images stored in the OCP internal registry (built by S2I/Tekton and not pushed externally)
  • Persistent volume data — via CSI snapshots, cloud-native snapshots, or file-system backup (Kopia)
  • OpenShift Virtualization VMs — OADP can quiesce VMs, snapshot their disks, and restore them fully

What it does NOT protect: OADP does not serve as a disaster recovery solution for etcd or OpenShift Operators. OADP support is applicable to customer workload namespaces and cluster scope resources. Full cluster backup and restore are not supported.


Core components

ComponentRole
OADP OperatorInstalls/manages Velero and all CRDs via OLM. Runs in openshift-adp
VeleroThe backup engine — serialises K8s resources, coordinates PV backup
Node agent (Kopia)DaemonSet on every node — handles file-level PV backup
openshift pluginOCP-specific handling for Routes, SCCs, internal registry images
csi pluginIntegrates with CSI VolumeSnapshot API for fast PV snapshots

Step 1 — Install

Install from OperatorHub or via CLI:

cat <<EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: redhat-oadp-operator
namespace: openshift-adp
spec:
channel: stable-1.5
name: redhat-oadp-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
installPlanApproval: Automatic
EOF

Step 2 — Configure via DataProtectionApplication CR

The DataProtectionApplication (DPA) is the master config CR. It tells OADP where to store backups, which plugins to load, and how to handle PV backup:

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa-cluster
namespace: openshift-adp
spec:
configuration:
velero:
defaultPlugins:
- openshift # required — handles Routes, SCCs, internal images
- aws # swap for gcp or azure as needed
- csi # enables CSI volume snapshots
nodeAgent:
enable: true
uploaderType: kopia # preferred over restic since OADP 1.3
backupLocations:
- name: default
velero:
provider: aws
default: true
credential:
name: cloud-credentials
key: cloud
objectStorage:
bucket: my-ocp-backups
prefix: cluster-prod
config:
region: ca-central-1

For on-prem with ODF/NooBaa, use provider: aws with a custom s3Url pointing to the NooBaa S3 Route — no cloud account required.


Step 3 — Take backups

# One-time backup with a pre-hook to quiesce PostgreSQL
apiVersion: velero.io/v1
kind: Backup
metadata:
name: my-app-backup
namespace: openshift-adp
spec:
includedNamespaces: [my-app, my-app-db]
excludedResources: [events, events.events.k8s.io]
defaultVolumesToFsBackup: true # Kopia for PVs
storageLocation: default
ttl: 720h0m0s # 30-day retention
hooks:
resources:
- name: quiesce-db
includedNamespaces: [my-app-db]
labelSelector:
matchLabels:
app: postgresql
pre:
- exec:
container: postgresql
command: ["/bin/bash", "-c", "psql -c 'CHECKPOINT'"]
timeout: 30s

You can schedule backups at specified intervals. You can use hooks to run commands in a container on a pod, for example fsfreeze to freeze a file system. You can configure a hook to run before or after a backup or restore.

# Scheduled daily backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: openshift-adp
spec:
schedule: "0 2 * * *"
template:
includedNamespaces: ["*"]
excludedNamespaces: [openshift-*, kube-*, openshift-adp]
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h0m0s # 7-day retention

Step 4 — Restore

apiVersion: velero.io/v1
kind: Restore
metadata:
name: my-app-restore
namespace: openshift-adp
spec:
backupName: my-app-backup
restorePVs: true
existingResourcePolicy: none # skip resources that already exist

For cross-cluster disaster recovery, point the destination cluster’s DPA at the same S3 bucket with accessMode: ReadOnly, then create the Restore CR. OADP auto-creates the target namespace — don’t pre-create it, as that causes SCC conflicts.


PV backup — three strategies

The underlying mechanism within OADP that allows the backup and restore of persistent volumes is either Restic, Kopia, CSI snapshots, or CSI dataMover. Backups are incremental by default.

StrategySpeedWorks on-premHow
CSI snapshotsFastestYes (Ceph RBD/FS)Label a VolumeSnapshotClass with velero.io/csi-volumesnapshot-class: "true"
Native cloud snapshotsFastNoConfigure snapshotLocations in DPA
Kopia (file-system backup)Slower, incrementalYes (any PV)Set defaultVolumesToFsBackup: true in Backup CR

OADP 1.3 includes a built-in Data Mover that uses Kopia as the uploader mechanism to read snapshot data and write to a Unified Repository, allowing you to restore stateful applications from a remote object store if a failure or cluster corruption occurs.


Key limits and best practices

  • Always exclude events, pods, and replicasets from backups — they are recreated automatically
  • Test restores monthly — an untested backup is not a backup
  • Pair with etcd backup — OADP covers application data, etcd covers the cluster skeleton; both are needed for full DR
  • Use hooks for stateful apps (databases, message queues) to get crash-consistent backups
  • Monitor Velero’s Prometheus metrics at /metrics on the Velero pod and alert on backup.status.phase != Completed

Azure DR Test: Restore with OpenShift & OADP

Here’s a realistic Azure-specific DR test using
OpenShift Container Platform +
OpenShift API for Data Protection (OADP).

We’ll simulate a namespace + data loss and walk through a full restore using Azure Blob + Disk snapshots.


Scenario (Azure DR test)

my-app namespace deleted
❌ PVC + data gone
❌ Need full recovery from backup

Environment:

  • OADP configured with Azure Blob
  • CSI snapshots enabled (Azure Disk)

What we’re restoring

  • Kubernetes resources (deployments, services, routes)
  • Persistent volumes (via Azure snapshots)
  • Application data

Flow overview

Backup (Blob + Disk Snapshot)
Namespace deleted ❌
Velero Restore triggered
Resources recreated
PVC restored from snapshot
App back online ✅

Step-by-step restore


Step 1: Confirm failure

oc get ns my-app

Should show:

NotFound

Step 2: List available backups

oc get backup -n openshift-adp

Example:

azure-backup Completed

Step 3: Create restore

apiVersion: velero.io/v1
kind: Restore
metadata:
name: restore-my-app
namespace: openshift-adp
spec:
backupName: azure-backup
includedNamespaces:
- my-app

Apply:

oc apply -f restore.yaml

Step 4: Watch restore progress

oc get restore -n openshift-adp

Detailed:

oc describe restore restore-my-app -n openshift-adp

Step 5: Verify namespace restored

oc get ns my-app

Then:

oc get pods -n my-app

Step 6: Verify PVC restoration

oc get pvc -n my-app

Check:

  • Status = Bound

Step 7: Verify Azure disk restore

In Azure:

az disk list --resource-group <rg>

You should see:

  • restored disk from snapshot

Step 8: Check application

oc get routes -n my-app

Test:

curl http://<route>

What just happened

  1. OADP pulled metadata from Azure Blob
  2. Recreated Kubernetes objects
  3. Triggered Azure disk snapshot restore
  4. Reattached volumes to pods

Full app recovery


Real-world variations


Case 1: Partial restore

Restore only one resource:

includedResources:
- deployments

Case 2: Restore to different namespace

namespaceMapping:
my-app: my-app-restore

Case 3: Restore without volumes

restorePVs: false

Azure-specific pitfalls

1. Missing snapshot permissions

→ restore fails silently or PVC stuck


2. Storage class mismatch

→ PVC stays Pending


3. Region mismatch

→ snapshot cannot attach


4. Private cluster networking

→ cannot reach Blob storage


Troubleshooting


Check restore logs

oc logs -n openshift-adp deployment/velero

Check events

oc get events -n my-app

Check PVC issues

oc describe pvc <pvc-name> -n my-app

Pro DR test (recommended)

Simulate:

  1. Backup app
  2. Delete namespace
  3. Restore
  4. Validate data integrity

Do this quarterly


Advanced Azure DR test

Try:

  • Restore to new cluster in different region
  • Reconnect DNS
  • Validate external integrations

Key takeaway

  • Azure DR = Blob (metadata) + Disk snapshot (data)
  • OADP restores both together
  • Works for full or partial recovery

Step-by-Step Guide to Install OADP on OpenShift

Here’s a practical step-by-step OADP install for OpenShift, using AWS S3 as the backup location. This is the most common pattern and maps to Red Hat’s current OADP flow: install the OADP Operator, create the default credentials secret, then create a DataProtectionApplication (DPA). OADP is the supported OpenShift path for application backup/restore, and for PV snapshots your provider must support native snapshots or CSI snapshots. (Red Hat Documentation)

1. Prereqs

You need:

  • cluster-admin access
  • an S3 bucket
  • AWS credentials with access to the bucket
  • snapshot support if you want PV snapshots
  • oc logged into the cluster. OADP also requires a default credentials secret during installation. (Red Hat Documentation)

2. Create the OADP namespace

oc create namespace openshift-adp

Red Hat’s OADP examples use openshift-adp as the namespace. (Red Hat Documentation)

3. Install the OADP Operator

In the OpenShift web console:

  • go to Operators → OperatorHub
  • search for OADP
  • open OpenShift API for Data Protection
  • click Install
  • install it into openshift-adp

Wait for the operator pod to be running:

oc get pods -n openshift-adp

The Red Hat flow is to install the OADP Operator first, then configure credentials and the DPA. (Red Hat Documentation)

4. Create the AWS credentials file

Create a local file named credentials-velero:

cat <<'EOF' > credentials-velero
[default]

aws_access_key_id=YOUR_AWS_ACCESS_KEY_ID

aws_secret_access_key=YOUR_AWS_SECRET_ACCESS_KEY

EOF

Red Hat documents this credentials-velero pattern for AWS-backed OADP installs. (Red Hat Documentation)

5. Create the default OADP secret

Create the required secret in openshift-adp:

oc create secret generic cloud-credentials \
-n openshift-adp \
--from-file cloud=./credentials-velero

For AWS, the default secret name is cloud-credentials. Red Hat notes that the DPA install expects a default secret; otherwise installation fails. (Red Hat Documentation)

6. Create the DataProtectionApplication

Apply a DPA like this:

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa
namespace: openshift-adp
spec:
backupLocations:
- velero:
provider: aws
default: true
objectStorage:
bucket: YOUR_S3_BUCKET
prefix: ocp-backups
config:
region: us-east-1
snapshotLocations:
- velero:
provider: aws
config:
region: us-east-1
configuration:
velero:
defaultPlugins:
- openshift
- aws
- csi

Apply it:

oc apply -f dpa.yaml

The DPA is the main OADP custom resource that wires backup storage and snapshot locations, and current OpenShift docs describe these OADP objects as the supported app backup path. (Red Hat Documentation)

7. Wait for OADP to become ready

Check the DPA and pods:

oc get dpa -n openshift-adp
oc get pods -n openshift-adp

You want the DPA to move to a ready state before creating backups. Red Hat’s backup flow requires the DataProtectionApplication to be Ready before backup CRs are used. (Red Hat Documentation)

8. Create your first backup

Once OADP is ready, back up a namespace:

apiVersion: velero.io/v1
kind: Backup
metadata:
name: app-backup
namespace: openshift-adp
spec:
includedNamespaces:
- my-app
snapshotVolumes: true
ttl: 720h

Apply it:

oc apply -f backup.yaml

OADP uses Velero backup CRs for application backup and supports filtering by namespace, labels, or resource type. (Red Hat Documentation)

9. Check backup status

oc get backup -n openshift-adp
oc describe backup app-backup -n openshift-adp

This confirms whether the backup finished and whether volume snapshots were taken.

10. Optional: schedule automatic backups

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: openshift-adp
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- my-app
snapshotVolumes: true
ttl: 720h

Apply it:

oc apply -f schedule.yaml

OADP supports scheduled Velero backups through Schedule objects. (Red Hat Documentation)

11. Common mistakes

  • No default cloud-credentials secret
  • wrong bucket region
  • no snapshot support for your storage class
  • assuming OADP backs up etcd; it does not
  • installing into a namespace with an overly long name can cause secret-labeling issues in some OADP cases. (Red Hat Documentation)

12. Minimal install checklist

oc create namespace openshift-adp
# install OADP Operator from OperatorHub
oc create secret generic cloud-credentials -n openshift-adp --from-file cloud=./credentials-velero
oc apply -f dpa.yaml
oc get dpa -n openshift-adp
oc apply -f backup.yaml
oc get backup -n openshift-adp