This is the worst-case disaster recovery scenario for
OpenShift Container Platform:
❌ Entire cluster is gone (infra failure, region loss, etc.)
❌ You must rebuild from scratch
❌ Then restore everything
This is what real DR plans are built around.
Goal
Rebuild cluster → restore:
- etcd (cluster state)
- apps (via Velero)
- persistent data
High-level flow
New infra → Install OpenShift → Restore etcd → Restore apps → Validate
Phase 1: Rebuild infrastructure
Recreate:
- VMs / instances
- networking (VPC, subnets, LB)
- DNS records
Must match original cluster as closely as possible
Phase 2: Reinstall OpenShift
Use installer (IPI/UPI):
openshift-install create cluster
Important:
- Same version (or compatible)
- Same cluster name if possible
STOP POINT (very important)
Do NOT start workloads or changes yet.
You will overwrite this cluster with etcd restore
Phase 3: Copy backup to control plane
Move your backup to a master node:
scp -r backup core@<master>:/home/core/
Phase 4: Restore etcd
SSH into master:
ssh core@<master>sudo -i
Run:
/usr/local/bin/cluster-restore.sh /home/core/backup-dir
What happens now
- New cluster API is replaced
- etcd restored to old state
- Control plane reconfigured
Cluster identity becomes the OLD cluster
Phase 5: Wait for API recovery
oc get nodes
Then:
oc get co
Wait until stable.
At this point
You now have:
- Old cluster config
- Old namespaces
- Old objects
BUT:
Persistent volumes may still be missing
Phase 6: Restore persistent volumes
Depends on your storage:
Option A: Cloud snapshots
- Reattach volumes to nodes
- Ensure PVCs bind correctly
Option B: Velero restore
velero restore create --from-backup <backup-name>
Phase 7: Restore applications
If using Velero:
velero restore create full-restore \ --from-backup <backup-name>
Or restore specific namespaces.
Phase 8: Fix external dependencies
Update:
- DNS records → point to new cluster
- Load balancers
- external endpoints
Phase 9: Validate everything
Check:
oc get pods -Aoc get routesoc get pvc
Test:
- apps
- APIs
- DB connections
Critical validations
- Routes accessible?
- Storage attached?
- Operators healthy?
- No pending pods?
Real-world gotchas
1. Storage mismatch
- PVs not reattached → apps broken
2. Cluster identity conflicts
- certs or DNS mismatch
3. Missing secrets
- if not restored properly
4. External integrations broken
- IAM, APIs, webhooks
Key rule
etcd restore does NOT restore:
- actual storage data (unless snapshots)
- external systems
Full DR architecture
Backup Side
├── etcd snapshots
├── Velero backups
└── PV snapshots
Disaster Happens ❌
Recovery Side
├── New OpenShift cluster
├── etcd restore
├── PV restore
└── Velero restore
DR maturity levels
Basic
- etcd backup only
Good
- etcd + Velero
Enterprise
- cross-region backups
- automated rebuild scripts
- tested DR runbooks
Pro tips (real-world)
- Automate cluster install (Terraform + Ansible)
- Keep backups in separate account/region
- Test full rebuild quarterly
- Document EVERYTHING
- Use same infra templates
Final takeaway
Worst-case DR requires:
- Rebuild cluster
- Restore etcd (cluster brain)
- Restore data (storage)
- Restore apps
Miss any one → recovery incomplete