Best Practices for OpenShift on Azure: ARO Guide

OpenShift Container Platform on Azure — ARO Best Practices

Azure Red Hat OpenShift (ARO) is a fully managed OpenShift 4 service jointly operated by Microsoft and Red Hat — both companies share responsibility for the control plane, infrastructure, and SLA (99.95%).

1. Networking Best Practices

Always deploy a private cluster

A private ARO cluster hides the Kubernetes API server behind a private endpoint — no public IP, unreachable from the internet:

			
az aro create \
  --resource-group rg-aro \
  --name aro-prod \
  --vnet aro-spoke-vnet \
  --master-subnet master-subnet \
  --worker-subnet worker-subnet \
  --apiserver-visibility Private \      # ← API server private
  --ingress-visibility Private \        # ← ingress private
  --pull-secret @pull-secret.txt

		

Access to the private API server is then through Azure Bastion → jump host, or over ExpressRoute/VPN from on-premises.

Subnet sizing — get this right before deployment (cannot resize after)

ARO consumes IP addresses aggressively — every pod gets its own IP from the node’s subnet range:

Subnet	Minimum	Recommended	Notes
Master subnet	/27	/24	Fixed 3 masters — needs room for Azure infra IPs
Worker subnet	/27	/23 or /22	Every pod consumes an IP — size generously
Ingress subnet	/28	/27	For LB / App Gateway front-end IPs
Private endpoints	/28	/27	One IP per private endpoint

			
Worker subnet sizing example:
  /23 = 512 addresses
  Azure reserves 5
  Available: 507
  Max pods per node: 250 (default OpenShift SDN)
  Nodes supportable: ~2 per node × workers
  Plan for: 3× current need for growth headroom

		

Egress lockdown via Azure Firewall

ARO requires outbound internet access for Red Hat update servers, telemetry, and pull.registry.redhat.io. Lock this down with Azure Firewall application rules rather than allowing all outbound:

			
Azure Firewall Application Rules for ARO egress:
  ┌─────────────────────────────────────────────────────────┐
  │ Name                    Target FQDN                     │
  ├─────────────────────────────────────────────────────────┤
  │ aro-rh-registry         registry.redhat.io              │
  │                         registry.access.redhat.com      │
  │                         quay.io                         │
  │                         cdn.quay.io                     │
  ├─────────────────────────────────────────────────────────┤
  │ aro-azure-services      *.blob.core.windows.net         │
  │                         *.servicebus.windows.net        │
  │                         *.table.core.windows.net        │
  ├─────────────────────────────────────────────────────────┤
  │ aro-monitoring          *.ods.opinsights.azure.com      │
  │                         *.oms.opinsights.azure.com      │
  ├─────────────────────────────────────────────────────────┤
  │ aro-rh-telemetry        cert-api.access.redhat.com      │
  │                         api.access.redhat.com           │
  └─────────────────────────────────────────────────────────┘

		

Apply a UDR on the master and worker subnets pointing 0.0.0.0/0 to the Azure Firewall private IP — same hub and spoke pattern as any spoke workload.

Use a custom DNS server

Point the ARO VNet DNS to your hub DNS Private Resolver so cluster nodes can resolve private endpoints and internal domains:

			
az network vnet update \
  --resource-group rg-aro-network \
  --name aro-spoke-vnet \
  --dns-servers 10.0.5.4    # DNS Private Resolver inbound endpoint IP

2. Availability and Resilience Best Practices

Spread across all three Availability Zones

ARO deploys 3 master nodes — one per AZ automatically. Workers must be explicitly spread via MachineSets:

			
# MachineSet for AZ1 — replicate for AZ2, AZ3
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: aro-prod-worker-eastus-1
  namespace: openshift-machine-api
spec:
  replicas: 3
  template:
    spec:
      providerSpec:
        value:
          zone: "1"                         # AZ1
          vmSize: Standard_D16s_v3
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS

		

Create three MachineSets — one per zone — with equal replica counts. This ensures workloads survive a full AZ failure.

Enable cluster autoscaler

			
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  resourceLimits:
    maxNodesTotal: 24
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 30s
---
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: aro-prod-worker-eastus-1
  namespace: openshift-machine-api
spec:
  minReplicas: 3
  maxReplicas: 8
  scaleTargetRef:
    kind: MachineSet
    name: aro-prod-worker-eastus-1

		

Use zone-redundant storage for persistent volumes

			
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium-zrs
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_ZRS       # Zone-redundant storage — survives AZ failure
  cachingMode: ReadOnly
reclaimPolicy: Retain        # Retain on PVC delete — prevents data loss
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

		

Use Premium_ZRS instead of Premium_LRS for stateful workloads — ZRS replicates the disk synchronously across three AZs so a pod can reschedule to another zone without losing its data.

3. Security Best Practices

Use Workload Identity (pod-level Azure RBAC)

Never put Azure credentials in pods. Use Workload Identity to give individual pods an Azure AD identity with scoped RBAC permissions:

			
# Enable workload identity on ARO cluster
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --enable-managed-identity
# Create a managed identity for a specific workload
az identity create \
  --resource-group rg-aro-workloads \
  --name id-payment-service
# Grant it only what it needs
az role assignment create \
  --assignee <identity-client-id> \
  --role "Key Vault Secrets User" \
  --scope /subscriptions/.../vaults/kv-prod

		

			
# Annotate the service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service-sa
  namespace: payments
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"

		

Integrate Azure Key Vault for secrets via CSI driver

Never store secrets in OpenShift Secrets (base64 is not encryption). Use the Secrets Store CSI driver to mount Key Vault secrets directly into pods:

			
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: azure-kv-secrets
  namespace: payments
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    clientID: "<managed-identity-client-id>"
    keyvaultName: kv-prod
    tenantID: "<tenant-id>"
    objects: |
      array:
        - |
          objectName: db-connection-string
          objectType: secret
        - |
          objectName: api-key
          objectType: secret

		

Integrate with Azure Container Registry via private endpoint

			
# Create ACR with private endpoint — no public access
az acr create \
  --resource-group rg-aro \
  --name acrprodaro \
  --sku Premium \
  --public-network-enabled false
# Private endpoint in ARO spoke
az network private-endpoint create \
  --name pe-acr-prod \
  --resource-group rg-aro-network \
  --vnet-name aro-spoke-vnet \
  --subnet private-endpoint-subnet \
  --private-connection-resource-id $(az acr show --name acrprodaro --query id -o tsv) \
  --group-id registry \
  --connection-name pe-acr-conn
# Grant ARO pull access
az role assignment create \
  --assignee <aro-kubelet-identity> \
  --role AcrPull \
  --scope $(az acr show --name acrprodaro --query id -o tsv)

		

Apply OpenShift Security Context Constraints (SCC)

Never run pods as root. Use the restricted-v2 SCC (default in OpenShift 4.11+):

			
apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1001
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop: ["ALL"]
        readOnlyRootFilesystem: true

		

Enable Microsoft Defender for Containers

			
az security pricing create \
  --name Containers \
  --tier Standard

Defender for Containers provides runtime threat detection, vulnerability scanning for images in ACR, and Kubernetes audit log analysis — all surfaced in Microsoft Defender for Cloud.

4. Observability Best Practices

Forward logs to Azure Monitor / Log Analytics

			
# Enable container insights on ARO
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --enable-managed-identity
# Deploy the monitoring add-on via Helm
helm repo add microsoft https://microsoft.github.io/charts/repo
helm install azuremonitor-containers \
  microsoft/azuremonitor-containers \
  --set omsagent.secret.wsid=<workspace-id> \
  --set omsagent.secret.key=<workspace-key> \
  --namespace kube-system

		

Use Azure Monitor alerts for cluster health

Alert	Metric	Threshold
Node CPU pressure	`cpuUsageNanoCores`	> 85% for 5 min
Node memory pressure	`memoryWorkingSetBytes`	> 80% of capacity
Pod restart loop	`restartCount`	> 5 in 10 min
PVC near full	`pvUsedBytes`	> 85% of capacity
Node not ready	`nodeCondition`	NotReady > 2 min

5. Day-2 Operations Best Practices

Cluster upgrade strategy

ARO manages the control plane upgrade automatically — you control timing for worker nodes:

			
# Check available upgrade versions
az aro get-upgrade-versions \
  --resource-group rg-aro \
  --name aro-prod
# Schedule upgrade in maintenance window
az aro update \
  --resource-group rg-aro \
  --name aro-prod \
  --version 4.14.12

		

Use the EUS (Extended Update Support) channel for production clusters — it allows staying on a minor version for up to 18 months while still receiving security patches, avoiding the churn of mandatory minor version upgrades every 45 days.

Worker node upgrade — use surge capacity

			
# MachineConfigPool surge upgrade strategy
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker
spec:
  maxUnavailable: 1          # Upgrade one node at a time

		

Upgrade workers one node at a time to maintain application availability — pods are gracefully drained before each node reboots into the new RHCOS version.

Summary — ARO Best Practice Checklist

Category	Practice
Network	Private cluster — no public API or ingress
Network	Egress via Azure Firewall with FQDN allow-list
Network	DNS Private Resolver for private endpoint resolution
Network	Worker subnet `/22` or larger — never resize after
Availability	Workers spread across AZs via 3 MachineSets
Availability	Cluster autoscaler min 3 per zone
Availability	Premium_ZRS disks for stateful workloads
Availability	Zone-redundant Azure Load Balancer
Security	Workload Identity — no credentials in pods
Security	Key Vault + CSI driver — no base64 secrets
Security	ACR via private endpoint — no public pull
Security	SCC restricted-v2 — no root containers
Security	Defender for Containers enabled
Observability	Container Insights → Log Analytics
Observability	Azure Monitor alerts on node and pod health
Operations	EUS channel for production stability
Operations	maxUnavailable: 1 for worker upgrades

Infra Cloud Solutions

Leave a comment Cancel reply