Best Practices for OpenShift on Azure: ARO Guide

OpenShift Container Platform on Azure — ARO Best Practices

Azure Red Hat OpenShift (ARO) is a fully managed OpenShift 4 service jointly operated by Microsoft and Red Hat — both companies share responsibility for the control plane, infrastructure, and SLA (99.95%).


1. Networking Best Practices

Always deploy a private cluster

A private ARO cluster hides the Kubernetes API server behind a private endpoint — no public IP, unreachable from the internet:

az aro create \
--resource-group rg-aro \
--name aro-prod \
--vnet aro-spoke-vnet \
--master-subnet master-subnet \
--worker-subnet worker-subnet \
--apiserver-visibility Private \ # ← API server private
--ingress-visibility Private \ # ← ingress private
--pull-secret @pull-secret.txt

Access to the private API server is then through Azure Bastion → jump host, or over ExpressRoute/VPN from on-premises.


Subnet sizing — get this right before deployment (cannot resize after)

ARO consumes IP addresses aggressively — every pod gets its own IP from the node’s subnet range:

SubnetMinimumRecommendedNotes
Master subnet/27/24Fixed 3 masters — needs room for Azure infra IPs
Worker subnet/27/23 or /22Every pod consumes an IP — size generously
Ingress subnet/28/27For LB / App Gateway front-end IPs
Private endpoints/28/27One IP per private endpoint
Worker subnet sizing example:
/23 = 512 addresses
Azure reserves 5
Available: 507
Max pods per node: 250 (default OpenShift SDN)
Nodes supportable: ~2 per node × workers
Plan for: 3× current need for growth headroom

Egress lockdown via Azure Firewall

ARO requires outbound internet access for Red Hat update servers, telemetry, and pull.registry.redhat.io. Lock this down with Azure Firewall application rules rather than allowing all outbound:

Azure Firewall Application Rules for ARO egress:
┌─────────────────────────────────────────────────────────┐
│ Name Target FQDN │
├─────────────────────────────────────────────────────────┤
│ aro-rh-registry registry.redhat.io │
│ registry.access.redhat.com │
│ quay.io │
│ cdn.quay.io │
├─────────────────────────────────────────────────────────┤
│ aro-azure-services *.blob.core.windows.net │
│ *.servicebus.windows.net │
│ *.table.core.windows.net │
├─────────────────────────────────────────────────────────┤
│ aro-monitoring *.ods.opinsights.azure.com │
│ *.oms.opinsights.azure.com │
├─────────────────────────────────────────────────────────┤
│ aro-rh-telemetry cert-api.access.redhat.com │
│ api.access.redhat.com │
└─────────────────────────────────────────────────────────┘

Apply a UDR on the master and worker subnets pointing 0.0.0.0/0 to the Azure Firewall private IP — same hub and spoke pattern as any spoke workload.


Use a custom DNS server

Point the ARO VNet DNS to your hub DNS Private Resolver so cluster nodes can resolve private endpoints and internal domains:

az network vnet update \
--resource-group rg-aro-network \
--name aro-spoke-vnet \
--dns-servers 10.0.5.4 # DNS Private Resolver inbound endpoint IP

2. Availability and Resilience Best Practices

Spread across all three Availability Zones

ARO deploys 3 master nodes — one per AZ automatically. Workers must be explicitly spread via MachineSets:

# MachineSet for AZ1 — replicate for AZ2, AZ3
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
name: aro-prod-worker-eastus-1
namespace: openshift-machine-api
spec:
replicas: 3
template:
spec:
providerSpec:
value:
zone: "1" # AZ1
vmSize: Standard_D16s_v3
osDisk:
diskSizeGB: 128
managedDisk:
storageAccountType: Premium_LRS

Create three MachineSets — one per zone — with equal replica counts. This ensures workloads survive a full AZ failure.


Enable cluster autoscaler

apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
name: default
spec:
resourceLimits:
maxNodesTotal: 24
scaleDown:
enabled: true
delayAfterAdd: 10m
delayAfterDelete: 5m
delayAfterFailure: 30s
---
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
name: aro-prod-worker-eastus-1
namespace: openshift-machine-api
spec:
minReplicas: 3
maxReplicas: 8
scaleTargetRef:
kind: MachineSet
name: aro-prod-worker-eastus-1

Use zone-redundant storage for persistent volumes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-zrs
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_ZRS # Zone-redundant storage — survives AZ failure
cachingMode: ReadOnly
reclaimPolicy: Retain # Retain on PVC delete — prevents data loss
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Use Premium_ZRS instead of Premium_LRS for stateful workloads — ZRS replicates the disk synchronously across three AZs so a pod can reschedule to another zone without losing its data.


3. Security Best Practices

Use Workload Identity (pod-level Azure RBAC)

Never put Azure credentials in pods. Use Workload Identity to give individual pods an Azure AD identity with scoped RBAC permissions:

# Enable workload identity on ARO cluster
az aro update \
--resource-group rg-aro \
--name aro-prod \
--enable-managed-identity
# Create a managed identity for a specific workload
az identity create \
--resource-group rg-aro-workloads \
--name id-payment-service
# Grant it only what it needs
az role assignment create \
--assignee <identity-client-id> \
--role "Key Vault Secrets User" \
--scope /subscriptions/.../vaults/kv-prod
# Annotate the service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: payment-service-sa
namespace: payments
annotations:
azure.workload.identity/client-id: "<managed-identity-client-id>"

Integrate Azure Key Vault for secrets via CSI driver

Never store secrets in OpenShift Secrets (base64 is not encryption). Use the Secrets Store CSI driver to mount Key Vault secrets directly into pods:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: azure-kv-secrets
namespace: payments
spec:
provider: azure
parameters:
usePodIdentity: "false"
clientID: "<managed-identity-client-id>"
keyvaultName: kv-prod
tenantID: "<tenant-id>"
objects: |
array:
- |
objectName: db-connection-string
objectType: secret
- |
objectName: api-key
objectType: secret

Integrate with Azure Container Registry via private endpoint

# Create ACR with private endpoint — no public access
az acr create \
--resource-group rg-aro \
--name acrprodaro \
--sku Premium \
--public-network-enabled false
# Private endpoint in ARO spoke
az network private-endpoint create \
--name pe-acr-prod \
--resource-group rg-aro-network \
--vnet-name aro-spoke-vnet \
--subnet private-endpoint-subnet \
--private-connection-resource-id $(az acr show --name acrprodaro --query id -o tsv) \
--group-id registry \
--connection-name pe-acr-conn
# Grant ARO pull access
az role assignment create \
--assignee <aro-kubelet-identity> \
--role AcrPull \
--scope $(az acr show --name acrprodaro --query id -o tsv)

Apply OpenShift Security Context Constraints (SCC)

Never run pods as root. Use the restricted-v2 SCC (default in OpenShift 4.11+):

apiVersion: v1
kind: Pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
readOnlyRootFilesystem: true

Enable Microsoft Defender for Containers

az security pricing create \
--name Containers \
--tier Standard

Defender for Containers provides runtime threat detection, vulnerability scanning for images in ACR, and Kubernetes audit log analysis — all surfaced in Microsoft Defender for Cloud.


4. Observability Best Practices

Forward logs to Azure Monitor / Log Analytics

# Enable container insights on ARO
az aro update \
--resource-group rg-aro \
--name aro-prod \
--enable-managed-identity
# Deploy the monitoring add-on via Helm
helm repo add microsoft https://microsoft.github.io/charts/repo
helm install azuremonitor-containers \
microsoft/azuremonitor-containers \
--set omsagent.secret.wsid=<workspace-id> \
--set omsagent.secret.key=<workspace-key> \
--namespace kube-system

Use Azure Monitor alerts for cluster health

AlertMetricThreshold
Node CPU pressurecpuUsageNanoCores> 85% for 5 min
Node memory pressurememoryWorkingSetBytes> 80% of capacity
Pod restart looprestartCount> 5 in 10 min
PVC near fullpvUsedBytes> 85% of capacity
Node not readynodeConditionNotReady > 2 min

5. Day-2 Operations Best Practices

Cluster upgrade strategy

ARO manages the control plane upgrade automatically — you control timing for worker nodes:

# Check available upgrade versions
az aro get-upgrade-versions \
--resource-group rg-aro \
--name aro-prod
# Schedule upgrade in maintenance window
az aro update \
--resource-group rg-aro \
--name aro-prod \
--version 4.14.12

Use the EUS (Extended Update Support) channel for production clusters — it allows staying on a minor version for up to 18 months while still receiving security patches, avoiding the churn of mandatory minor version upgrades every 45 days.


Worker node upgrade — use surge capacity

# MachineConfigPool surge upgrade strategy
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker
spec:
maxUnavailable: 1 # Upgrade one node at a time

Upgrade workers one node at a time to maintain application availability — pods are gracefully drained before each node reboots into the new RHCOS version.


Summary — ARO Best Practice Checklist

CategoryPractice
NetworkPrivate cluster — no public API or ingress
NetworkEgress via Azure Firewall with FQDN allow-list
NetworkDNS Private Resolver for private endpoint resolution
NetworkWorker subnet /22 or larger — never resize after
AvailabilityWorkers spread across AZs via 3 MachineSets
AvailabilityCluster autoscaler min 3 per zone
AvailabilityPremium_ZRS disks for stateful workloads
AvailabilityZone-redundant Azure Load Balancer
SecurityWorkload Identity — no credentials in pods
SecurityKey Vault + CSI driver — no base64 secrets
SecurityACR via private endpoint — no public pull
SecuritySCC restricted-v2 — no root containers
SecurityDefender for Containers enabled
ObservabilityContainer Insights → Log Analytics
ObservabilityAzure Monitor alerts on node and pod health
OperationsEUS channel for production stability
OperationsmaxUnavailable: 1 for worker upgrades

Leave a comment