Disaster Recovery for OpenShift: A Step-by-Step Guide

This is the worst-case disaster recovery scenario for
OpenShift Container Platform:

❌ Entire cluster is gone (infra failure, region loss, etc.)
❌ You must rebuild from scratch
❌ Then restore everything

This is what real DR plans are built around.

Goal

Rebuild cluster → restore:

etcd (cluster state)
apps (via Velero)
persistent data

High-level flow

New infra → Install OpenShift → Restore etcd → Restore apps → Validate

Phase 1: Rebuild infrastructure

Recreate:

VMs / instances
networking (VPC, subnets, LB)
DNS records

Must match original cluster as closely as possible

Phase 2: Reinstall OpenShift

Use installer (IPI/UPI):

openshift-install create cluster

Important:

Same version (or compatible)
Same cluster name if possible

STOP POINT (very important)

Do NOT start workloads or changes yet.

You will overwrite this cluster with etcd restore

Phase 3: Copy backup to control plane

Move your backup to a master node:

scp -r backup core@<master>:/home/core/

Phase 4: Restore etcd

SSH into master:

			
ssh core@<master>
sudo -i

Run:

/usr/local/bin/cluster-restore.sh /home/core/backup-dir

What happens now

New cluster API is replaced
etcd restored to old state
Control plane reconfigured

Cluster identity becomes the OLD cluster

Phase 5: Wait for API recovery

oc get nodes

Then:

oc get co

Wait until stable.

At this point

You now have:

Old cluster config
Old namespaces
Old objects

BUT:
Persistent volumes may still be missing

Phase 6: Restore persistent volumes

Depends on your storage:

Option A: Cloud snapshots

Reattach volumes to nodes
Ensure PVCs bind correctly

Option B: Velero restore

velero restore create --from-backup <backup-name>

Phase 7: Restore applications

If using Velero:

			
velero restore create full-restore \
  --from-backup <backup-name>

Or restore specific namespaces.

Phase 8: Fix external dependencies

Update:

DNS records → point to new cluster
Load balancers
external endpoints

Phase 9: Validate everything

Check:

			
oc get pods -A
oc get routes
oc get pvc

Test:

apps
APIs
DB connections

Critical validations

Routes accessible?
Storage attached?
Operators healthy?
No pending pods?

Real-world gotchas

1. Storage mismatch

PVs not reattached → apps broken

2. Cluster identity conflicts

certs or DNS mismatch

3. Missing secrets

if not restored properly

4. External integrations broken

IAM, APIs, webhooks

Key rule

etcd restore does NOT restore:

actual storage data (unless snapshots)
external systems

Full DR architecture

        Backup Side
        ├── etcd snapshots
        ├── Velero backups
        └── PV snapshots

        Disaster Happens ❌

        Recovery Side
        ├── New OpenShift cluster
        ├── etcd restore
        ├── PV restore
        └── Velero restore

DR maturity levels

Basic

etcd backup only

Good

etcd + Velero

Enterprise

cross-region backups
automated rebuild scripts
tested DR runbooks

Pro tips (real-world)

Automate cluster install (Terraform + Ansible)
Keep backups in separate account/region
Test full rebuild quarterly
Document EVERYTHING
Use same infra templates

Final takeaway

Worst-case DR requires:

Rebuild cluster
Restore etcd (cluster brain)
Restore data (storage)
Restore apps

Miss any one → recovery incomplete

Infra Cloud Solutions

Disaster Recovery for OpenShift: A Step-by-Step Guide

Goal

High-level flow

Phase 1: Rebuild infrastructure

Phase 2: Reinstall OpenShift

STOP POINT (very important)

Phase 3: Copy backup to control plane

Phase 4: Restore etcd

What happens now

Phase 5: Wait for API recovery

At this point

Phase 6: Restore persistent volumes

Option A: Cloud snapshots

Option B: Velero restore

Phase 7: Restore applications

Phase 8: Fix external dependencies

Phase 9: Validate everything

Critical validations

Real-world gotchas

1. Storage mismatch

2. Cluster identity conflicts

3. Missing secrets

4. External integrations broken

Key rule

Full DR architecture

DR maturity levels

Basic

Good

Enterprise

Pro tips (real-world)

Final takeaway

Leave a comment Cancel reply

Goal

High-level flow

Phase 1: Rebuild infrastructure

Phase 2: Reinstall OpenShift

STOP POINT (very important)

Phase 3: Copy backup to control plane

Phase 4: Restore etcd

What happens now

Phase 5: Wait for API recovery

At this point

Phase 6: Restore persistent volumes

Option A: Cloud snapshots

Option B: Velero restore

Phase 7: Restore applications

Phase 8: Fix external dependencies

Phase 9: Validate everything

Critical validations

Real-world gotchas

1. Storage mismatch

2. Cluster identity conflicts

3. Missing secrets

4. External integrations broken

Key rule

Full DR architecture

DR maturity levels

Basic

Good

Enterprise

Pro tips (real-world)

Final takeaway

Share this:

Related

Leave a comment Cancel reply