How to Recover Your Red Hat OpenShift Cluster After a Failure

Recovering from a lost Red Hat OpenShift Container Platform (OCP) cluster depends on the extent of the failure. In an enterprise disaster recovery (DR) strategy, scenarios are split into two categories:

  1. Control Plane Quorum Loss (In-Place Recovery): The underlying infrastructure (virtual machines or bare metal) is still intact, but the etcd cluster has lost quorum and cannot recover on its own.
  2. Total Cluster/Site Loss (Fresh Rebuild): The infrastructure is completely gone, and you must provision a brand-new OCP cluster and restore your state.

Scenario 1: Recovering Control Plane Quorum (In-Place etcd Restore)

If your master/control plane nodes are online but etcd is completely corrupted, you must execute a authoritative single-node restore. This forces the entire cluster to re-initialize its state from a known, healthy historical snapshot.

Step 1: Establish SSH to All Control Plane Nodes

Because the Kubernetes API server will be completely offline during this process, oc commands will not work. Open separate terminal windows and SSH directly into all of your master nodes as the core user.

Step 2: Select the “Recovery” Master Node

Pick one master node to be your source of truth. Copy your healthy, pre-existing etcd backup file (typically generated via the /usr/local/bin/cluster-backup.sh script) onto this node:

Bash

# Verify the backup artifacts exist in your recovery folder
ls -l /home/core/assets/backup/
# Expected: etcd-snapshot-xxxx.db and static-kuberesources-xxxx.tar.gz
Step 3: Stop the Control Plane on the Other Master Nodes

On every master node EXCEPT your chosen recovery node, manually move the static pod manifests out of the kubelet scanning directory to stop the core services and wipe out the broken etcd data store:

# Run this on Master 2 and Master 3
sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp/
sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/
# Wait for the containers to stop completely
sudo crictl ps | grep -E 'etcd|kube-apiserver'
# Clear out the corrupted database directory
sudo rm -rf /var/lib/etcd/*
Step 4: Run the Recovery Script

Switch back to your Recovery Master Node and initiate the built-in restore utility, passing it the path to your backup directory:

sudo -E /usr/local/bin/cluster-restore.sh /home/core/assets/backup

The script will systematically stop the local control plane, wipe the local etcd directory, map the historical snapshot database, and spin up an isolated, single-member etcd node.

Step 5: Restart Kubelet and Verify

Restart the kubelet service across all control plane nodes to force them to pick up the new structural cluster configuration:

sudo systemctl restart kubelet.service

Log back into your terminal via the OpenShift CLI and check that your primary components are successfully syncing:

oc get nodes
oc get pods -n openshift-etcd
Step 6: Force a Rollout of the Control Plane

To force the remaining master nodes to download the newly restored data state and rejoin the cluster fabric, update the etcd cluster definition:

oc patch etcd cluster -p '{"spec": {"forceRedeploymentReason": "recovery-'$(date --rfc-3339=ns)'"}}' --type=merge

Scenario 2: Total Site/Cluster Loss (The “Clean Slate” Pattern)

If the underlying compute infrastructure or public cloud region experiences catastrophic failure, do not try to repair individual machines. You must adopt a modern cloud-native architectural failover pattern.

Plaintext

[ Infrastructure-as-Code ] ──> Provisions Fresh OCP Cluster Base
[ GitOps (ArgoCD / Flux) ] ──> Re-installs Core Operators & Platform Configs
[ Red Hat OADP (Velero) ] ──> Restores Application PVs & Active Persistent States
Step 1: Re-Provision the Cluster Infrastructure

Use your automated Infrastructure-as-Code (IaC) deployment pipelines (Ansible, Terraform, or the OpenShift Assisted Installer API) to instantiate an identical, blank OpenShift cluster platform inside your designated DR location.

Step 2: Re-Apply Platform Cluster Configurations via GitOps

Once the new cluster API server is reachable, log into your central GitOps controller (such as OpenShift Advanced Cluster Management (ACM) or ArgoCD). Point the target destination properties to your new cluster API endpoint.

Your GitOps engine will automatically rebuild the state of the cluster by deploying:

  • Custom namespaces, ClusterRoles, and RBAC policies.
  • Global operators (Service Mesh, Serverless, ACS, Logging, Logging agents).
  • Network Policies and Ingress/Route definitions.
Step 3: Restore Application States via OADP

While GitOps reconstructs your stateless resources, you need to restore your live persistent volume data. This is managed via the OpenShift API for Data Protection (OADP) (powered by Velero).

Configure OADP on the new cluster to point to the exact same object storage location (e.g., AWS S3, MinIO) where your previous application data backups reside. Create a Restore Custom Resource to pull the persistent volumes back down:

apiVersion: velero.io/v1
kind: Restore
metadata:
name: total-disaster-recovery-restore
namespace: openshift-adp
spec:
backupName: scheduled-daily-enterprise-backup # The name of your healthy historical backup
includedNamespaces:
- core-banking-prod
- customer-db-prod
restorePVs: true # Directs OADP to bind back to storage snapshots

Apply this manifest to start downloading your application state:

oc apply -f restore-manifest.yaml
Step 4: DNS Redirection

Update your global traffic manager (GTM) or external corporate DNS infrastructure (e.g., F5 BIG-IP, Cloudflare, Route53) to swing your application wildcards (*.apps.old-cluster.com $\rightarrow$ *.apps.new-cluster.com) toward the new OCP router ingress load balancers.

Leave a Reply