Best Practices for Restoring OpenShift etcd Backups

This is the real deal DR scenario for
OpenShift Container Platform.

We’ll simulate a control plane failure and walk through a full etcd restore.


Scenario (real outage)

❌ Cluster API is down
oc get pods → fails
❌ Masters are unhealthy
❌ etcd is corrupted or lost

👉 You must restore etcd from backup


What you’re restoring

  • Entire cluster state:
    • deployments
    • services
    • secrets
    • configs
    • routes

This is the cluster brain


Before you start (critical warnings)

  • This is a destructive operation
  • Cluster will roll back to backup time
  • New objects after backup → LOST
  • Must run on control plane node

Step-by-step etcd restore


Step 1: Access a master node

SSH into any control plane node:

ssh core@<master-node>

Step 2: Become root

sudo -i

Step 3: Locate your backup

Example:

/home/core/backup/etcd-2026-04-20_120000/

Inside you should see:

  • snapshot.db
  • static_kuberesources_*.tar.gz

Step 4: Run restore script

OpenShift provides a built-in script:

/usr/local/bin/cluster-restore.sh /path/to/backup

Example:

/usr/local/bin/cluster-restore.sh /home/core/backup/etcd-2026-04-20_120000

What this does internally

  • Stops kube-apiserver
  • Stops etcd
  • Restores snapshot
  • Rewrites static pod manifests
  • Restarts control plane

Step 5: Wait for control plane recovery

Check status:

crictl ps | grep etcd
crictl ps | grep kube-apiserver

Step 6: Exit debug + test API

From your workstation:

oc get nodes

If successful → API is back


Step 7: Verify cluster health

oc get co

Wait until:

  • All operators → Available=True

Step 8: Check workloads

oc get pods -A

Expect:

  • Pods recreated from restored state

What just happened (important)

You restored:

  • etcd database
  • static control plane resources

Cluster is now exactly as it was at backup time


Real-world failure patterns

Case 1: etcd corruption

  • API down
  • restore fixes everything

Case 2: accidental deletion

  • namespace deleted
  • restore brings it back

Case 3: upgrade failure

  • rollback using backup

Post-restore actions (VERY important)

1. Re-sync nodes

Sometimes nodes need time:

oc get nodes

2. Check certificates

oc get csr

Approve if needed.


3. Verify operators

oc get co

4. Validate apps

  • routes working
  • DB connections ok
  • storage attached

Common mistakes

❌ Running restore on worker node
❌ Using wrong backup directory
❌ Not stopping at first master
❌ Ignoring operator health after restore
❌ Not testing restore beforehand


Visual flow (mental model)

Failure → etcd corrupted
SSH to master
Run cluster-restore.sh
Restore snapshot
Control plane restarts
API comes back
Operators reconcile
Cluster stable

Pro tips (from real incidents)

  • Always keep multiple backups
  • Store backups off-cluster
  • Test restore quarterly
  • Automate backup + retention
  • Document exact restore steps

Final takeaway

  • etcd restore = full cluster rollback
  • Fastest way to recover catastrophic failure
  • Must be practiced before real outage

Leave a comment