Mastering OpenShift on VMware and Bare Metal: Key Insights

Administering OpenShift on VMware vSphere or Bare Metal is significantly more complex than cloud environments because you are responsible for the “underlay” (the physical or virtual infrastructure) as well as the “overlay” (OpenShift).

In a 2026 interview, expect a focus on automation, connectivity in restricted environments, and hardware lifecycle.


1. Installation & Provisioning (The Foundation)

Q1: Compare IPI vs. UPI in the context of VMware vSphere.

  • IPI (Installer-Provisioned Infrastructure): The installer has the vCenter credentials. It automatically creates the Folder, Virtual Machines, and Resource Pools. It also handles the VIP (Virtual IPs) for the API and Ingress via Keepalived.
  • UPI (User-Provisioned Infrastructure): You manually create the VMs, set up the Load Balancers (F5, HAProxy), and configure DNS.
  • Interview Tip: Mention that IPI is preferred for speed and “automated scaling,” but UPI is often mandatory in “Brownfield” environments where the networking team won’t give the installer full control over the VLANs.

Q2: How does OpenShift interact with physical hardware for Bare Metal?

Answer: It uses the Metal3 project and the Bare Metal Operator (BMO).

  • The admin provides the BMC (Baseboard Management Controller) details—like IPMI, iDRAC (Dell), or iLO (HP)—to OpenShift.
  • OpenShift uses these to remotely power on the server, PXE boot it, and install RHCOS (Red Hat Enterprise Linux CoreOS).

2. Infrastructure Operations

Q3: What is a “Disconnected” (Air-Gapped) Installation?

Answer: Common in on-prem data centers with high security.

  • The Problem: OpenShift usually pulls images from quay.io.
  • The Solution: You must set up a Local Mirror Registry (like Red Hat Quay or Sonatype Nexus).
  • Process: You use the oc mirror plugin to download all required images to a portable disk, move it inside the secure zone, and push them to your local registry. You then configure the cluster to use an ImageContentSourcePolicy to redirect all image pulls to your local IP.

Q4: How do you handle storage on VMware vs. Bare Metal?

  • VMware: Use the vSphere CSI Driver. This allows OpenShift to talk to vCenter and dynamically provision .vmdk files as Persistent Volumes (PVs).
  • Bare Metal: You typically use LVM (Local Storage Operator) for fast local SSDs or OpenShift Data Foundation (ODF) (based on Ceph). ODF is the industry standard for on-prem because it provides S3-compatible, Block, and File storage within the cluster itself.

3. High Availability & Networking

Q5: On Bare Metal, how do you handle Load Balancing for the API and Ingress?

Answer: Since there is no “AWS ELB” on-prem, you have two choices:

  1. External: Use a physical appliance like an F5 Big-IP or a pair of HAProxy nodes managed by your team.
  2. Internal (MetalLB): Use the MetalLB Operator. It allows you to assign a range of IPs from your corporate network to the OpenShift Router so it can act like a cloud load balancer.

Q6: What happens if a Master (Control Plane) node dies in a Bare Metal cluster?

Answer: * Quorum: You must have 3 Masters to maintain an etcd quorum. If one dies, the cluster survives. If two die, the API becomes read-only or crashes.

  • Recovery: On Bare Metal, recovery is manual. You must reinstall the OS, use the kube-etcd-operator to remove the old member, and then use the cluster-bootstrap process to add the new node back into the etcd ring.

4. Advanced Troubleshooting

Q7: A worker node is “NotReady” on VMware. What is your first check?

Answer: Beyond the logs, I check the VMware Tools status and Time Sync.

  • If the ESXi host and the VM have a clock drift (common if NTP is misconfigured), the certificates for the Kubelet will fail to validate, and the node will go NotReady.
  • I would also check the MachineConfigPool (MCP). If the node is stuck in “Updating,” it might be failing to pull an OS image from the internal registry.

Q8: What is “Assisted Installer”?

Answer: It’s the modern way to install OpenShift on-prem. It provides a web-based GUI that generates a “Discovery ISO.” You boot your physical servers with this ISO; they “check in” to the portal, and you can then click “Install” to deploy the whole cluster without writing complex YAML files.


Technical “Buzzwords” for 2026:

  • OVN-Kubernetes: The default network plugin (replaces OpenShift SDN).
  • LVM Storage: Used for high-performance databases on bare metal.
  • Red Hat Advanced Cluster Management (RHACM): If the company has multiple on-prem clusters, they will use this to manage them all from one dashboard.

Debugging etcd is the highest level of OpenShift administration. If etcd is healthy, the cluster is healthy; if etcd is failing, the API will be sluggish or completely unresponsive.

Here is the technical deep-dive on how to diagnose and fix etcd on-premise.


1. Checking the High-Level Status

Before diving into logs, check if the Etcd Operator is happy. If the operator is degraded, it usually means it’s struggling to manage the quorum.

Bash

# Check the status of the etcd cluster operator
oc get clusteroperator etcd
# Check the status of the individual etcd pods
oc get pods -n openshift-etcd -l app=etcd

2. Testing Quorum and Health (The etcdctl way)

In OpenShift 4.x, etcd runs as Static Pods on the master nodes. To run diagnostic commands, you must use a helper script or exec into the container.

The “Is it alive?” check:

Bash

# Get a list of etcd members and their health
oc rsh -n openshift-etcd etcd-master-0 etcdctl endpoint health --cluster -w table

The Performance check (Disk Latency):

On-premise (especially VMware), Disk I/O latency is the #1 killer of etcd. If your storage is slow, etcd will lose quorum.

Bash

# Check the sync duration
oc rsh -n openshift-etcd etcd-master-0 etcdctl check perf

Interview Pro-Tip: Mention that etcd requires fsync latency of less than 10ms. If it’s higher, your underlying VMware datastore or Bare Metal disks are too slow for an enterprise cluster.


3. Investigating Logs

If a pod is crashing, check the logs specifically for “leader” issues or “wal” (Write Ahead Log) errors.

Bash

# View the last 100 lines of logs from a specific member
oc logs -n openshift-etcd etcd-master-0 -c etcd --tail=100

What to look for:

  • "lost leader": Indicates network instability between master nodes.
  • "apply entries took too long": Indicates slow disk or high CPU pressure on the master node.
  • "database space exceeded": The 8GB quota has been reached (requires a defrag).

4. Critical Recovery: The “Master Node Replacement”

If a master node (e.g., master-1) hardware fails permanently on Bare Metal, you must follow these steps to restore the cluster health:

  1. Remove the ghost member:Tell etcd to stop looking for the dead node.Bashoc rsh -n openshift-etcd etcd-master-0 etcdctl member list oc rsh -n openshift-etcd etcd-master-0 etcdctl member remove <dead-member-id>
  2. Clean up the Node object:oc delete node master-1
  3. Re-provision: Boot the new hardware with the RHCOS ISO. If using IPI, the Machine API might do this for you. If UPI, you must manually trigger the CSR (Certificate Signing Request) approval.
  4. Approve CSRs:The new master won’t join until you approve its certificates:oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve

5. Compaction and Defragmentation

Over time, etcd keeps versions of objects. If the database grows too large, the cluster will stop accepting writes (Error: mvcc: database space exceeded).

The Fix:

OpenShift normally handles this automatically, but as an admin, you might need to force it:

Bash

# Defragment the endpoint
oc rsh -n openshift-etcd etcd-master-0 etcdctl defrag --cluster

The “Final Boss” Interview Question:

“We lost 2 out of 3 master nodes. The API is down. How do you recover?”

The Answer:

  1. Since quorum is lost (needs $n/2 + 1$ nodes), you must perform a Single Master Recovery.
  2. Stop the etcd service on the remaining healthy master.
  3. Run the etcd-snapshot-restore.sh script (shipped with OpenShift) using a previous backup.
  4. This forces the remaining master to become a “New Cluster” of one.
  5. Once the API is back up, you re-join the other two nodes as brand-new members.

Since OpenShift 4.12+, OVN-Kubernetes has become the default network provider, replacing the older OpenShift SDN. For an on-premise administrator, understanding this is vital because it changes how traffic flows from your physical switches into your pods.


1. OVN-Kubernetes Architecture

Unlike the old SDN which used Open vSwitch (OVS) in a basic way, OVN (Open Virtual Network) brings a distributed logical router and switch to every node.

  • Geneve Encap: OVN uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN to tunnel traffic between nodes. It’s more flexible and allows for more metadata.
  • The Gateway: Every node has a “Gateway” that handles traffic entering and exiting the cluster. On-premise, this is where your physical network interface (e.g., eno1 or ens192) meets the virtual world.

2. On-Premise Networking Challenges

Q1: How does OpenShift handle “External” IPs on-prem?

In the cloud, you have a LoadBalancer service. On-prem, you don’t.

The Admin Solution: MetalLB.

As an admin, you configure a MetalLB Operator with an IP address pool from your actual data center VLAN. When a developer creates a Service of type LoadBalancer, MetalLB uses ARP (Layer 2) or BGP (Layer 3) to announce that IP address to your physical routers.

Q2: What is the “Ingress VIP” and “API VIP”?

During a VMware/Bare Metal IPI install, you are asked for two IPs:

  1. API VIP: The floating IP used to talk to the control plane (Port 6443).
  2. Ingress VIP: The floating IP for all application traffic (Ports 80/443).Mechanism: OpenShift uses Keepalived and HAProxy internally to float these IPs between the master nodes (for API) or worker nodes (for Ingress). If the node holding the IP fails, it “floats” to another node in seconds.

3. Troubleshooting the Network

If pods can’t talk to each other, follow this “inside-out” path:

Step 1: Check the Cluster Network Operator (CNO)

If the CNO is degraded, the entire network is unstable.

Bash

oc get clusteroperator network

Step 2: Trace the Flow with oc adm network

OpenShift provides a built-in tool to verify if two pods can actually talk to each other across nodes:

Bash

oc adm pod-network diagnostic

Step 3: Inspect the OVN Database

Since OVN stores the network state in a database (Northbound and Southbound DBs), you can check if the logical flows are actually created.

Bash

# Get the logs of the ovnkube-master
oc logs -n openshift-ovn-kubernetes -l app=ovnkube-master

4. Key Concepts for Interview Scenarios

Scenario: “Applications are slow only when talking to external databases.”

  • Likely Culprit: MTU Mismatch. * Explanation: Geneve encapsulation adds 100 bytes of overhead to every packet. If your physical network is set to standard MTU (1500), but OpenShift is also sending 1500, the packets get fragmented, causing a massive performance hit.
  • The Fix: Ensure the cluster MTU is set to 1400 (1500 – 100) or enable Jumbo Frames (9000) on your physical switches.

Scenario: “How do you isolate traffic between two departments on the same cluster?”

  • The Answer: NetworkPolicies. * OVN-Kubernetes supports standard Kubernetes NetworkPolicy objects. By default, all pods can talk to all pods. I would implement a “Deny-All” default policy and then explicitly allow traffic only between required microservices.

Summary for Administrator Interview

FeatureOpenShift SDN (Old)OVN-Kubernetes (New/Standard)
EncapsulationVXLANGeneve
Network PolicyLimitedFully Featured (Egress/Ingress)
Hybrid CloudHard to implementDesigned for it (IPsec support)
Windows SupportNoYes

Essential OpenShift Q&A: Architecture, Security & Workflow

In an OpenShift interview, the questions typically fall into three categories: Architecture/Concepts, Security (SCCs/RBAC), and Developer Workflow (S2I/Builds).

Here is a curated list of the most common and high-impact questions for 2026.


1. Core Architecture & Concepts

Q1: What is the fundamental difference between OpenShift and Kubernetes?

Answer: While Kubernetes is an open-source orchestration engine, OpenShift is a downstream, enterprise-grade distribution of Kubernetes by Red Hat.

  • The “Plus” Factor: OpenShift includes everything in Kubernetes but adds a built-in container registry, integrated CI/CD pipelines (Tekton), a developer-friendly web console, and enhanced security defaults.
  • Security: By default, OpenShift forbids containers from running as root, whereas vanilla Kubernetes is “open” by default.

Q2: What is an OpenShift “Project” vs. a Kubernetes “Namespace”?

Answer: A Project is simply an abstraction on top of a Kubernetes Namespace.

  • It adds metadata and facilitates Self-Service: users can request projects via the CLI (oc new-project) or Web Console.
  • It automatically applies default Resource Quotas and Limit Ranges to the namespace to prevent a single user from crashing the cluster.

Q3: Explain the role of the Router (HAProxy) in OpenShift.

Answer: In vanilla Kubernetes, you typically install an Ingress Controller (like NGINX). In OpenShift, the Router (based on HAProxy) is a core component. It provides the external entry point for traffic, mapping an external URL (a Route) to an internal Service.


2. Developer & Build Workflow

Q4: What is Source-to-Image (S2I) and why is it used?

Answer: S2I is a toolkit that allows developers to provide only their source code (via a Git URL). OpenShift then:

  1. Detects the language (Java, Python, Node, etc.).
  2. Injects the code into a “Builder Image.”
  3. Assembles the final application image.Benefit: Developers don’t need to know how to write a Dockerfile or manage base images, ensuring consistent security patches at the base layer.

Q5: What is a BuildConfig?

Answer: A BuildConfig is the definition of the entire build process. It specifies:

  • Source: Where the code is (Git).
  • Strategy: How to build it (S2I, Docker, or Pipeline).
  • Output: Where to push the resulting image (internal registry).
  • Triggers: Events that start a build (e.g., a code commit or an update to the base image).

3. Security & Operations

Q6: What are Security Context Constraints (SCCs)?

Answer: SCCs are one of the most important security features in OpenShift. They control what actions a pod can perform.

  • Restricted SCC: The default. It prevents pods from running as root and limits access to the host filesystem.
  • Anyuid SCC: Often used when migrating legacy Docker images that must run as a specific user.
  • Privileged SCC: Full access (usually reserved for infra components like logging or monitoring).

Q7: How does OpenShift handle Persistent Storage?

Answer: OpenShift uses the Persistent Volume (PV) and Persistent Volume Claim (PVC) model.

  • An administrator provisions PVs (storage chunks).
  • A developer requests storage via a PVC.
  • OpenShift uses Storage Classes to dynamically provision storage on the fly (e.g., on AWS EBS or VMware vSphere) when a PVC is created.

4. Scenario-Based “Pro” Question

Q8: “A pod is failing with a CrashLoopBackOff. How do you troubleshoot it in OpenShift?”

Answer: Walk through these 4 steps to show you have hands-on experience:

  1. Check Status: oc get pods to see the status.
  2. Examine Logs: oc logs <pod_name> (use --previous if the container already restarted).
  3. Inspect Events: oc describe pod <pod_name> to look for failed mounts, scheduling issues, or “Back-off” events.
  4. Debug Session: Use oc debug pod/<pod_name> to launch a terminal inside a clone of the failing pod to inspect the filesystem and environment variables.

5. Rapid-Fire Command Cheat Sheet

TaskCommand
Loginoc login <api-url>
Create Appoc new-app https://github.com/user/repo
Scale Appoc scale --replicas=3 dc/my-app
Expose Serviceoc expose svc/my-service
View Resourcesoc get all
Check SCCsoc get scc

For the Administrator track, the interview will shift away from “how to deploy an app” toward Cluster Health, Lifecycle Management, and Infrastructure Stability.

In OpenShift 4.x (the modern standard), the “Operator-focused” architecture is the star of the show. Here are the heavy-hitting admin questions you should be ready for.


1. The Operator Framework

Q1: What is the “Operator Pattern” and why is it central to OpenShift 4?

Answer: In OpenShift 4, the entire cluster is managed by Operators. An Operator is a custom controller that encodes human operational knowledge into software.

  • The Loop: It constantly monitors the Actual State of a component (like the Internal Registry or Monitoring stack) and compares it to the Desired State. If they differ, the Operator automatically fixes it.
  • Cluster Version Operator (CVO): This is the “Master Operator” that manages the updates of the cluster itself, ensuring all core components are at the correct version.

Q2: How do you perform a Cluster Upgrade in OpenShift 4?

Answer: Upgrades are managed via the Cluster Version Operator (CVO).

  • Process: You typically update the “Channel” (e.g., stable-4.14) and then trigger the upgrade via the console or oc adm upgrade.
  • Mechanism: The CVO orchestrates the update of every operator in the cluster. The Machine Config Operator (MCO) handles the rolling reboot of nodes to update the underlying Red Hat Enterprise Linux CoreOS (RHCOS).

2. Infrastructure & Nodes

Q3: What is the Machine Config Operator (MCO)?

Answer: The MCO is one of the most important components for an admin. It treats the underlying nodes like “cattle, not pets.”

  • It manages the operating system (RHCOS) itself.
  • If you need to change a kernel parameter, add a SSH key, or change a NTP setting across 50 nodes, you create a MachineConfig object. The MCO then applies that change and reboots nodes in a rolling fashion to ensure zero downtime.

Q4: Explain the difference between IPI and UPI installation.

Answer: * IPI (Installer-Provisioned Infrastructure): Full automation. The OpenShift installer has credentials to your cloud (AWS, Azure, etc.) and creates the VMs, VPCs, and Load Balancers for you.

  • UPI (User-Provisioned Infrastructure): The admin manually creates the infrastructure (VMs, networking, storage). You then run the installer to “bootstrap” OpenShift onto those pre-existing resources. (Common in highly regulated or air-gapped environments).

3. Storage & Networking

Q5: How do you troubleshoot a Node that is in “NotReady” status?

Answer: I follow a systematic checklist:

  1. Check Node Details: oc describe node <node_name> to look at the “Conditions” section (e.g., MemoryPressure, DiskPressure, or NetworkUnavailable).
  2. Verify Kubelet: SSH into the node (or use oc debug node) and check the kubelet logs: journalctl -u kubelet.
  3. Resource Usage: Check if the node has run out of PIDs or Disk space.
  4. CSRs: If the node was recently added/rebooted, check if there are pending Certificate Signing Requests: oc get csr and approve them if necessary.

Q6: What is the “In-tree” to CSI migration?

Answer: Older versions of OpenShift used storage drivers built directly into the Kubernetes binary (“In-tree”). Modern OpenShift is moving to CSI (Container Storage Interface) drivers. As an admin, this means storage is now handled by standalone operators, allowing for easier updates without upgrading the whole cluster.


4. Security & Etcd

Q7: Why is the etcd backup critical, and how do you perform it?

Answer: etcd is the “brain” of the cluster; it stores every configuration and state. If etcd is lost, the cluster is dead.

  • Backup: You use the cluster-etcd-operator. I would run a specific debug script provided by Red Hat: oc debug node/<master-node> -- /usr/local/bin/cluster-backup.sh /home/core/assets/backup.
  • Strategy: Always take a backup before a cluster upgrade.

5. Monitoring & Logging

Q8: What stack does OpenShift use for Cluster Monitoring?

Answer: OpenShift comes with a pre-configured Prometheus, Grafana, and Alertmanager stack (managed by the Monitoring Operator).

  • Note: Admins use this to monitor cluster health (CPU/Mem of nodes).
  • User Workload Monitoring: In newer versions, admins can enable “User Workload Monitoring” to allow developers to use the same Prometheus stack for their own applications without interfering with the cluster’s core monitoring.

Summary Checklist for your Interview

[!TIP]

If they ask about a problem you can’t solve: Always mention “Looking at the Operators.” In OpenShift 4, if something is broken, check

oc get clusteroperators.

If one is DEGRADED=True, that is your smoking gun.

OCP can run on different environment, such as on-premise (VMware/Bare Metal) or a managed service (ROSA/ARO)

Understanding OCP Backup: Two Essential Layers

Here’s a comprehensive breakdown of OCP backup — covering the two distinct layers you need to protect.


The two backup layers in OCP

OCP backup is not a single thing — you need two separate strategies working together:

LayerWhat it protectsTool
Control plane (etcd)Cluster state — all Kubernetes/OCP objects, CRDs, configs, RBACcluster-backup.sh / EtcdBackup CR
Application dataNamespaces, workloads, PVs/PVCs, imagesOADP (OpenShift API for Data Protection)

Use etcd backups with automated snapshots to protect and recover the cluster itself. Use OADP to protect and recover your applications and their data on top of a healthy cluster. — they are complementary, not interchangeable. OADP will not successfully backup and restore operators or etcd.


Layer 1 — etcd backup (control plane)

etcd is the key-value store for OpenShift Container Platform, which persists the state of all resource objects. An etcd backup plays a crucial role in disaster recovery.

What the backup produces

Running cluster-backup.sh on a control plane node generates two files:

  • snapshot_<timestamp>.db — the etcd snapshot (all cluster state)
  • static_kuberesources_<timestamp>.tar.gz — static pod manifests + encryption keys (if etcd encryption is enabled)

How to take a manual backup

# SSH into any control plane node
ssh core@master-0.example.com
# Run the built-in backup script
sudo /usr/local/bin/cluster-backup.sh /home/core/backup
# Copy the backup off-cluster immediately
scp core@master-0:/home/core/backup/* /safe/offsite/location/

Automated scheduled backup (OCP 4.14+)

You can create a CRD to define the schedule and retention type of automated backups:

# 1. Create a PVC for backup storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: etcd-backup-pvc
namespace: openshift-etcd
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
---
# 2. Schedule recurring backups
apiVersion: config.openshift.io/v1alpha1
kind: Backup
metadata:
name: etcd-recurring-backup
spec:
etcd:
schedule: "20 4 * * *" # Daily at 04:20 UTC
timeZone: "UTC"
pvcName: etcd-backup-pvc
retentionPolicy:
retentionType: RetentionNumber
retentionNumber:
maxNumberOfBackups: 15

Key rules for etcd backups

Do not take an etcd backup before the first certificate rotation completes, which occurs 24 hours after installation, otherwise the backup will contain expired certificates. It is also recommended to take etcd backups during non-peak usage hours, as it is a blocking action.

  • Backups only need to be taken from one master — there is no need to run on every master. Store backups in either an offsite location or somewhere off the server.
  • Be sure to take an etcd backup after you upgrade your cluster. When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release — for example, an OCP 4.14.2 cluster must use a backup taken from 4.14.2.

Restore procedure (high level)

# On the designated recovery control plane node:
sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
# After restore completes, force etcd redeployment:
oc edit etcd cluster
# Add under spec:
# unsupportedConfigOverrides:
# forceRedeploymentReason: recovery-2025-04-17
# Monitor etcd pods coming back up
oc get pods -n openshift-etcd | grep -v quorum

Layer 2 — OADP (application backup)

OADP uses Velero to perform both backup and restore tasks for either resources and/or internal images, while also being capable of working with persistent volumes via Restic or with snapshots.

Install OADP via OperatorHub

Operators → OperatorHub → search "OADP" → Install

Configure a backup location (S3 example)

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa-cluster
namespace: openshift-adp
spec:
configuration:
velero:
defaultPlugins:
- openshift # Required for OCP-specific resources
- aws
nodeAgent:
enable: true
uploaderType: kopia # Preferred over restic in OADP 1.3+
backupLocations:
- name: default
velero:
provider: aws
default: true
objectStorage:
bucket: my-ocp-backups
prefix: cluster-1
credential:
name: cloud-credentials
key: cloud
snapshotLocations:
- name: default
velero:
provider: aws
config:
region: ca-central-1

Taking an application backup

# Backup a specific namespace
apiVersion: velero.io/v1
kind: Backup
metadata:
name: my-app-backup
namespace: openshift-adp
spec:
includedNamespaces:
- my-app
- my-app-db
defaultVolumesToFsBackup: true # Use kopia/restic for PVs
storageLocation: default
ttl: 720h0m0s # 30-day retention
# Scheduled backup (daily at 2am)
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-app-backup
namespace: openshift-adp
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- "*" # All namespaces
excludedNamespaces:
- openshift-* # Exclude platform namespaces
- kube-*
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h0m0s # 7-day retention

Restoring from OADP

apiVersion: velero.io/v1
kind: Restore
metadata:
name: my-app-restore
namespace: openshift-adp
spec:
backupName: my-app-backup
includedNamespaces:
- my-app
restorePVs: true

PV backup methods

MethodHow it worksBest for
CSI SnapshotsPoint-in-time volume snapshot via storage driverCloud PVs (AWS EBS, Azure Disk, Ceph RBD)
Kopia/Restic (fs backup)File-level copy streamed to object storageAny PV, slower but universal

Supported backup storage targets

OADP supports AWS, MS Azure, GCP, Multicloud Object Gateway, and S3-compatible object storage (MinIO, NooBaa, etc.). Snapshot backups can be performed for AWS, Azure, GCP, and CSI snapshot-enabled cloud storage such as Ceph FS and Ceph RBD.


Best practices summary

PracticeDetail
3-2-1 rule3 copies, 2 media types, 1 offsite — etcd snapshots must be stored outside the cluster
Test restoresRegularly restore to a test cluster — an untested backup is not a backup
Version locketcd restores must use a backup from the same OCP z-stream version
Frequencyetcd: at minimum daily; before every upgrade; OADP: daily or per RPO requirement
Exclude platform namespacesDon’t include openshift-* in OADP — OADP doesn’t restore operators or etcd
EncryptionEncrypt backup storage at rest; etcd snapshot includes encryption keys if etcd encryption is on
Monitor backup jobsSet up alerts on failed Schedule or EtcdBackup CRs