This is the real deal DR scenario for
OpenShift Container Platform.
We’ll simulate a control plane failure and walk through a full etcd restore.
Scenario (real outage)
❌ Cluster API is down
❌oc get pods→ fails
❌ Masters are unhealthy
❌ etcd is corrupted or lost
👉 You must restore etcd from backup
What you’re restoring
- Entire cluster state:
- deployments
- services
- secrets
- configs
- routes
This is the cluster brain
Before you start (critical warnings)
- This is a destructive operation
- Cluster will roll back to backup time
- New objects after backup → LOST
- Must run on control plane node
Step-by-step etcd restore
Step 1: Access a master node
SSH into any control plane node:
ssh core@<master-node>
Step 2: Become root
sudo -i
Step 3: Locate your backup
Example:
/home/core/backup/etcd-2026-04-20_120000/
Inside you should see:
snapshot.dbstatic_kuberesources_*.tar.gz
Step 4: Run restore script
OpenShift provides a built-in script:
/usr/local/bin/cluster-restore.sh /path/to/backup
Example:
/usr/local/bin/cluster-restore.sh /home/core/backup/etcd-2026-04-20_120000
What this does internally
- Stops kube-apiserver
- Stops etcd
- Restores snapshot
- Rewrites static pod manifests
- Restarts control plane
Step 5: Wait for control plane recovery
Check status:
crictl ps | grep etcdcrictl ps | grep kube-apiserver
Step 6: Exit debug + test API
From your workstation:
oc get nodes
If successful → API is back
Step 7: Verify cluster health
oc get co
Wait until:
- All operators →
Available=True
Step 8: Check workloads
oc get pods -A
Expect:
- Pods recreated from restored state
What just happened (important)
You restored:
- etcd database
- static control plane resources
Cluster is now exactly as it was at backup time
Real-world failure patterns
Case 1: etcd corruption
- API down
- restore fixes everything
Case 2: accidental deletion
- namespace deleted
- restore brings it back
Case 3: upgrade failure
- rollback using backup
Post-restore actions (VERY important)
1. Re-sync nodes
Sometimes nodes need time:
oc get nodes
2. Check certificates
oc get csr
Approve if needed.
3. Verify operators
oc get co
4. Validate apps
- routes working
- DB connections ok
- storage attached
Common mistakes
❌ Running restore on worker node
❌ Using wrong backup directory
❌ Not stopping at first master
❌ Ignoring operator health after restore
❌ Not testing restore beforehand
Visual flow (mental model)
Failure → etcd corrupted ↓SSH to master ↓Run cluster-restore.sh ↓Restore snapshot ↓Control plane restarts ↓API comes back ↓Operators reconcile ↓Cluster stable
Pro tips (from real incidents)
- Always keep multiple backups
- Store backups off-cluster
- Test restore quarterly
- Automate backup + retention
- Document exact restore steps
Final takeaway
- etcd restore = full cluster rollback
- Fastest way to recover catastrophic failure
- Must be practiced before real outage