Top Best Practices for OpenShift Backup and Recovery

When it comes to backing up OpenShift Container Platform, the goal isn’t just “take backups” — it’s guaranteed recovery. That means combining etcd backups + application backups + storage strategy + testing.

Here’s a clear, real-world best-practice guide 👇

1. Understand what you MUST back up

OpenShift has two critical layers:

Cluster state (control plane)

Stored in etcd
Includes:
- API objects (pods, deployments, routes, secrets, etc.)
- cluster config

Without this → cluster is gone

Application data

Persistent Volumes (databases, files)
App-specific configs

Without this → apps restore empty/broken

2. Back up etcd (CRITICAL)

Use OpenShift’s built-in backup:

			
oc debug node/<master-node>
chroot /host
/usr/local/bin/cluster-backup.sh /backup/location

Best practices:

Run daily (or more frequently)
Store backups off-cluster
Encrypt backups (contains secrets!)
Keep multiple copies (rotation)

3. Use Velero for app-level backups

Use Velero for:

Namespaces
Kubernetes resources
Persistent volumes

Best practices:

Backup per namespace/app, not always full cluster
Use labels:

velero backup create app-backup --selector app=myapp

Schedule backups:

velero schedule create daily --schedule="0 1 * * *"

4. Handle persistent volumes properly

Choose one strategy:

Option A: Storage snapshots

Fast, consistent
Cloud-native (AWS EBS, etc.)

Option B: File-level backups (Restic via Velero)

Works everywhere
Slower but portable

👉 Best practice:

Use snapshots for databases
Use Velero + Restic for portability

5. Backup important namespaces

Focus on:

openshift-* (critical configs)
app namespaces
operators (stateful ones)

Avoid blindly backing everything unless needed.

6. Secure your backups

Encrypt at rest (S3, etc.)
Restrict access (IAM roles)
Never expose etcd backups publicly

Remember:
etcd backup contains ALL secrets

7. Test restores regularly (MOST IMPORTANT)

A backup is useless if restore fails.

Test:

velero restore create --from-backup <backup-name>

Also test:

full cluster rebuild from etcd
namespace restore

Do this in a staging cluster

8. Use off-cluster storage

Never store backups only inside cluster.

Use:

S3 / object storage
external NFS
backup systems

9. Define RPO / RTO

RPO (data loss tolerance)
RTO (recovery time)

Example:

etcd backup every 6 hours
app backup every 1 hour

10. Common mistakes to avoid

❌ Only backing up etcd
❌ Not backing up PV data
❌ Never testing restore
❌ Storing backups inside cluster
❌ No encryption
❌ Backing up everything blindly (slow, noisy)

Recommended architecture

        OpenShift Cluster
        ├── etcd backup → secure storage (daily)
        ├── Velero backups → object storage
        └── PV snapshots → cloud storage

        External Storage
        ├── S3 bucket
        ├── encrypted + versioned
        └── lifecycle policies

Pro-level setup (what enterprises do)

Velero + S3 + IAM roles
Automated schedules (hourly/daily)
Separate backup account
Cross-region replication
Periodic DR drills

Key takeaway

etcd backup = cluster brain
Velero backup = workloads
PV backup = actual data

You need all three for real disaster recovery.

Infra Cloud Solutions

Leave a comment Cancel reply

1. Understand what you MUST back up

Cluster state (control plane)

Application data

2. Back up etcd (CRITICAL)

Best practices:

3. Use Velero for app-level backups

Best practices:

4. Handle persistent volumes properly

Option A: Storage snapshots

Option B: File-level backups (Restic via Velero)

5. Backup important namespaces

6. Secure your backups

7. Test restores regularly (MOST IMPORTANT)

8. Use off-cluster storage

9. Define RPO / RTO

10. Common mistakes to avoid

Recommended architecture

Pro-level setup (what enterprises do)

Key takeaway

Share this:

Related

Leave a comment Cancel reply