Disaster Recovery for OpenShift: A Step-by-Step Guide

This is the worst-case disaster recovery scenario for
OpenShift Container Platform:

❌ Entire cluster is gone (infra failure, region loss, etc.)
❌ You must rebuild from scratch
❌ Then restore everything

This is what real DR plans are built around.


Goal

Rebuild cluster → restore:

  • etcd (cluster state)
  • apps (via Velero)
  • persistent data

High-level flow

New infra → Install OpenShift → Restore etcd → Restore apps → Validate

Phase 1: Rebuild infrastructure

Recreate:

  • VMs / instances
  • networking (VPC, subnets, LB)
  • DNS records

Must match original cluster as closely as possible


Phase 2: Reinstall OpenShift

Use installer (IPI/UPI):

openshift-install create cluster

Important:

  • Same version (or compatible)
  • Same cluster name if possible

STOP POINT (very important)

Do NOT start workloads or changes yet.

You will overwrite this cluster with etcd restore


Phase 3: Copy backup to control plane

Move your backup to a master node:

scp -r backup core@<master>:/home/core/

Phase 4: Restore etcd

SSH into master:

ssh core@<master>
sudo -i

Run:

/usr/local/bin/cluster-restore.sh /home/core/backup-dir

What happens now

  • New cluster API is replaced
  • etcd restored to old state
  • Control plane reconfigured

Cluster identity becomes the OLD cluster


Phase 5: Wait for API recovery

oc get nodes

Then:

oc get co

Wait until stable.


At this point

You now have:

  • Old cluster config
  • Old namespaces
  • Old objects

BUT:
Persistent volumes may still be missing


Phase 6: Restore persistent volumes

Depends on your storage:


Option A: Cloud snapshots

  • Reattach volumes to nodes
  • Ensure PVCs bind correctly

Option B: Velero restore

velero restore create --from-backup <backup-name>

Phase 7: Restore applications

If using Velero:

velero restore create full-restore \
--from-backup <backup-name>

Or restore specific namespaces.


Phase 8: Fix external dependencies

Update:

  • DNS records → point to new cluster
  • Load balancers
  • external endpoints

Phase 9: Validate everything

Check:

oc get pods -A
oc get routes
oc get pvc

Test:

  • apps
  • APIs
  • DB connections

Critical validations

  • Routes accessible?
  • Storage attached?
  • Operators healthy?
  • No pending pods?

Real-world gotchas

1. Storage mismatch

  • PVs not reattached → apps broken

2. Cluster identity conflicts

  • certs or DNS mismatch

3. Missing secrets

  • if not restored properly

4. External integrations broken

  • IAM, APIs, webhooks

Key rule

etcd restore does NOT restore:

  • actual storage data (unless snapshots)
  • external systems

Full DR architecture

        Backup Side
        ├── etcd snapshots
        ├── Velero backups
        └── PV snapshots

        Disaster Happens ❌

        Recovery Side
        ├── New OpenShift cluster
        ├── etcd restore
        ├── PV restore
        └── Velero restore


DR maturity levels

Basic

  • etcd backup only

Good

  • etcd + Velero

Enterprise

  • cross-region backups
  • automated rebuild scripts
  • tested DR runbooks

Pro tips (real-world)

  • Automate cluster install (Terraform + Ansible)
  • Keep backups in separate account/region
  • Test full rebuild quarterly
  • Document EVERYTHING
  • Use same infra templates

Final takeaway

Worst-case DR requires:

  1. Rebuild cluster
  2. Restore etcd (cluster brain)
  3. Restore data (storage)
  4. Restore apps

Miss any one → recovery incomplete


Leave a comment