Recovering from a lost Red Hat OpenShift Container Platform (OCP) cluster depends on the extent of the failure. In an enterprise disaster recovery (DR) strategy, scenarios are split into two categories:
- Control Plane Quorum Loss (In-Place Recovery): The underlying infrastructure (virtual machines or bare metal) is still intact, but the etcd cluster has lost quorum and cannot recover on its own.
- Total Cluster/Site Loss (Fresh Rebuild): The infrastructure is completely gone, and you must provision a brand-new OCP cluster and restore your state.
Scenario 1: Recovering Control Plane Quorum (In-Place etcd Restore)
If your master/control plane nodes are online but etcd is completely corrupted, you must execute a authoritative single-node restore. This forces the entire cluster to re-initialize its state from a known, healthy historical snapshot.
Step 1: Establish SSH to All Control Plane Nodes
Because the Kubernetes API server will be completely offline during this process, oc commands will not work. Open separate terminal windows and SSH directly into all of your master nodes as the core user.
Step 2: Select the “Recovery” Master Node
Pick one master node to be your source of truth. Copy your healthy, pre-existing etcd backup file (typically generated via the /usr/local/bin/cluster-backup.sh script) onto this node:
Bash
# Verify the backup artifacts exist in your recovery folderls -l /home/core/assets/backup/# Expected: etcd-snapshot-xxxx.db and static-kuberesources-xxxx.tar.gz
Step 3: Stop the Control Plane on the Other Master Nodes
On every master node EXCEPT your chosen recovery node, manually move the static pod manifests out of the kubelet scanning directory to stop the core services and wipe out the broken etcd data store:
# Run this on Master 2 and Master 3sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp/sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/# Wait for the containers to stop completelysudo crictl ps | grep -E 'etcd|kube-apiserver'# Clear out the corrupted database directorysudo rm -rf /var/lib/etcd/*
Step 4: Run the Recovery Script
Switch back to your Recovery Master Node and initiate the built-in restore utility, passing it the path to your backup directory:
sudo -E /usr/local/bin/cluster-restore.sh /home/core/assets/backup
The script will systematically stop the local control plane, wipe the local etcd directory, map the historical snapshot database, and spin up an isolated, single-member etcd node.
Step 5: Restart Kubelet and Verify
Restart the kubelet service across all control plane nodes to force them to pick up the new structural cluster configuration:
sudo systemctl restart kubelet.service
Log back into your terminal via the OpenShift CLI and check that your primary components are successfully syncing:
oc get nodesoc get pods -n openshift-etcd
Step 6: Force a Rollout of the Control Plane
To force the remaining master nodes to download the newly restored data state and rejoin the cluster fabric, update the etcd cluster definition:
oc patch etcd cluster -p '{"spec": {"forceRedeploymentReason": "recovery-'$(date --rfc-3339=ns)'"}}' --type=merge
Scenario 2: Total Site/Cluster Loss (The “Clean Slate” Pattern)
If the underlying compute infrastructure or public cloud region experiences catastrophic failure, do not try to repair individual machines. You must adopt a modern cloud-native architectural failover pattern.
Plaintext
[ Infrastructure-as-Code ] ──> Provisions Fresh OCP Cluster Base │[ GitOps (ArgoCD / Flux) ] ──> Re-installs Core Operators & Platform Configs │[ Red Hat OADP (Velero) ] ──> Restores Application PVs & Active Persistent States
Step 1: Re-Provision the Cluster Infrastructure
Use your automated Infrastructure-as-Code (IaC) deployment pipelines (Ansible, Terraform, or the OpenShift Assisted Installer API) to instantiate an identical, blank OpenShift cluster platform inside your designated DR location.
Step 2: Re-Apply Platform Cluster Configurations via GitOps
Once the new cluster API server is reachable, log into your central GitOps controller (such as OpenShift Advanced Cluster Management (ACM) or ArgoCD). Point the target destination properties to your new cluster API endpoint.
Your GitOps engine will automatically rebuild the state of the cluster by deploying:
- Custom namespaces, ClusterRoles, and RBAC policies.
- Global operators (Service Mesh, Serverless, ACS, Logging, Logging agents).
- Network Policies and Ingress/Route definitions.
Step 3: Restore Application States via OADP
While GitOps reconstructs your stateless resources, you need to restore your live persistent volume data. This is managed via the OpenShift API for Data Protection (OADP) (powered by Velero).
Configure OADP on the new cluster to point to the exact same object storage location (e.g., AWS S3, MinIO) where your previous application data backups reside. Create a Restore Custom Resource to pull the persistent volumes back down:
apiVersion: velero.io/v1kind: Restoremetadata: name: total-disaster-recovery-restore namespace: openshift-adpspec: backupName: scheduled-daily-enterprise-backup # The name of your healthy historical backup includedNamespaces: - core-banking-prod - customer-db-prod restorePVs: true # Directs OADP to bind back to storage snapshots
Apply this manifest to start downloading your application state:
oc apply -f restore-manifest.yaml
Step 4: DNS Redirection
Update your global traffic manager (GTM) or external corporate DNS infrastructure (e.g., F5 BIG-IP, Cloudflare, Route53) to swing your application wildcards (*.apps.old-cluster.com $\rightarrow$ *.apps.new-cluster.com) toward the new OCP router ingress load balancers.