Top Best Practices for OpenShift Backup and Recovery

When it comes to backing up OpenShift Container Platform, the goal isn’t just “take backups” — it’s guaranteed recovery. That means combining etcd backups + application backups + storage strategy + testing.

Here’s a clear, real-world best-practice guide 👇


1. Understand what you MUST back up

OpenShift has two critical layers:

Cluster state (control plane)

  • Stored in etcd
  • Includes:
    • API objects (pods, deployments, routes, secrets, etc.)
    • cluster config

Without this → cluster is gone


Application data

  • Persistent Volumes (databases, files)
  • App-specific configs

Without this → apps restore empty/broken


2. Back up etcd (CRITICAL)

Use OpenShift’s built-in backup:

oc debug node/<master-node>
chroot /host
/usr/local/bin/cluster-backup.sh /backup/location

Best practices:

  • Run daily (or more frequently)
  • Store backups off-cluster
  • Encrypt backups (contains secrets!)
  • Keep multiple copies (rotation)

3. Use Velero for app-level backups

Use Velero for:

  • Namespaces
  • Kubernetes resources
  • Persistent volumes

Best practices:

  • Backup per namespace/app, not always full cluster
  • Use labels:
velero backup create app-backup --selector app=myapp
  • Schedule backups:
velero schedule create daily --schedule="0 1 * * *"

4. Handle persistent volumes properly

Choose one strategy:

Option A: Storage snapshots

  • Fast, consistent
  • Cloud-native (AWS EBS, etc.)

Option B: File-level backups (Restic via Velero)

  • Works everywhere
  • Slower but portable

👉 Best practice:

  • Use snapshots for databases
  • Use Velero + Restic for portability

5. Backup important namespaces

Focus on:

  • openshift-* (critical configs)
  • app namespaces
  • operators (stateful ones)

Avoid blindly backing everything unless needed.


6. Secure your backups

  • Encrypt at rest (S3, etc.)
  • Restrict access (IAM roles)
  • Never expose etcd backups publicly

Remember:
etcd backup contains ALL secrets


7. Test restores regularly (MOST IMPORTANT)

A backup is useless if restore fails.

Test:

velero restore create --from-backup <backup-name>

Also test:

  • full cluster rebuild from etcd
  • namespace restore

Do this in a staging cluster


8. Use off-cluster storage

Never store backups only inside cluster.

Use:

  • S3 / object storage
  • external NFS
  • backup systems

9. Define RPO / RTO

  • RPO (data loss tolerance)
  • RTO (recovery time)

Example:

  • etcd backup every 6 hours
  • app backup every 1 hour

10. Common mistakes to avoid

❌ Only backing up etcd
❌ Not backing up PV data
❌ Never testing restore
❌ Storing backups inside cluster
❌ No encryption
❌ Backing up everything blindly (slow, noisy)


Recommended architecture

        OpenShift Cluster
        ├── etcd backup → secure storage (daily)
        ├── Velero backups → object storage
        └── PV snapshots → cloud storage

        External Storage
        ├── S3 bucket
        ├── encrypted + versioned
        └── lifecycle policies


Pro-level setup (what enterprises do)

  • Velero + S3 + IAM roles
  • Automated schedules (hourly/daily)
  • Separate backup account
  • Cross-region replication
  • Periodic DR drills

Key takeaway

  • etcd backup = cluster brain
  • Velero backup = workloads
  • PV backup = actual data

You need all three for real disaster recovery.


Leave a comment