Best Practices for Restoring OpenShift etcd Backups

This is the real deal DR scenario for
OpenShift Container Platform.

We’ll simulate a control plane failure and walk through a full etcd restore.

Scenario (real outage)

❌ Cluster API is down
❌ oc get pods → fails
❌ Masters are unhealthy
❌ etcd is corrupted or lost

👉 You must restore etcd from backup

What you’re restoring

Entire cluster state:
- deployments
- services
- secrets
- configs
- routes

This is the cluster brain

Before you start (critical warnings)

This is a destructive operation
Cluster will roll back to backup time
New objects after backup → LOST
Must run on control plane node

Step-by-step etcd restore

Step 1: Access a master node

SSH into any control plane node:

ssh core@<master-node>

Step 2: Become root

sudo -i

Step 3: Locate your backup

Example:

/home/core/backup/etcd-2026-04-20_120000/

Inside you should see:

snapshot.db
static_kuberesources_*.tar.gz

Step 4: Run restore script

OpenShift provides a built-in script:

/usr/local/bin/cluster-restore.sh /path/to/backup

Example:

/usr/local/bin/cluster-restore.sh /home/core/backup/etcd-2026-04-20_120000

What this does internally

Stops kube-apiserver
Stops etcd
Restores snapshot
Rewrites static pod manifests
Restarts control plane

Step 5: Wait for control plane recovery

Check status:

			
crictl ps | grep etcd
crictl ps | grep kube-apiserver

Step 6: Exit debug + test API

From your workstation:

oc get nodes

If successful → API is back

Step 7: Verify cluster health

oc get co

Wait until:

All operators → Available=True

Step 8: Check workloads

oc get pods -A

Expect:

Pods recreated from restored state

What just happened (important)

You restored:

etcd database
static control plane resources

Cluster is now exactly as it was at backup time

Real-world failure patterns

Case 1: etcd corruption

API down
restore fixes everything

Case 2: accidental deletion

namespace deleted
restore brings it back

Case 3: upgrade failure

rollback using backup

Post-restore actions (VERY important)

1. Re-sync nodes

Sometimes nodes need time:

oc get nodes

2. Check certificates

oc get csr

Approve if needed.

3. Verify operators

oc get co

4. Validate apps

routes working
DB connections ok
storage attached

Common mistakes

❌ Running restore on worker node
❌ Using wrong backup directory
❌ Not stopping at first master
❌ Ignoring operator health after restore
❌ Not testing restore beforehand

Visual flow (mental model)

			
Failure → etcd corrupted
        ↓
SSH to master
        ↓
Run cluster-restore.sh
        ↓
Restore snapshot
        ↓
Control plane restarts
        ↓
API comes back
        ↓
Operators reconcile
        ↓
Cluster stable

		

Pro tips (from real incidents)

Always keep multiple backups
Store backups off-cluster
Test restore quarterly
Automate backup + retention
Document exact restore steps

Final takeaway

etcd restore = full cluster rollback
Fastest way to recover catastrophic failure
Must be practiced before real outage

Infra Cloud Solutions

Leave a comment Cancel reply