Automated DR Setup with Terraform, Velero, and OCP

A solid Terraform + Velero + OCP automated DR setup usually splits into three lanes:

  1. Terraform rebuilds the cluster infrastructure and base OCP install.
  2. OADP/Velero backs up and restores applications, namespaces, and PV data.
  3. etcd backup/restore protects the control plane state and must use a backup from the same OCP z-stream when restoring. (Red Hat Documentation)

Recommended architecture

Git / CI
├─ Terraform
│ ├─ network, subnets, DNS, LB, IAM
│ ├─ OCP install prerequisites
│ └─ optional object storage + KMS
├─ OCP bootstrap/install
├─ Post-install automation
│ ├─ OADP Operator
│ ├─ DataProtectionApplication
│ ├─ BackupStorageLocation
│ └─ VolumeSnapshotLocation
├─ Scheduled protection
│ ├─ etcd snapshots
│ ├─ Velero/OADP schedules
│ └─ CSI snapshots or file-system backup
└─ DR pipeline
├─ Terraform recreate infra
├─ reinstall OCP
├─ etcd restore if doing full cluster rollback
└─ Velero restore for apps/data

That layout matches Red Hat’s split between control plane backup/restore and application backup/restore via OADP, and OADP exposes the main objects you automate: Backup, Restore, Schedule, BackupStorageLocation, and VolumeSnapshotLocation. (Red Hat Documentation)

What each piece should own

Terraform should manage

  • cloud network, subnets, routes, load balancers, DNS, IAM, object storage, encryption, and the repeatable OCP install scaffolding. This keeps rebuilds deterministic. The OCP install docs cover cluster-wide installation configuration, while backup guidance expects you to recover onto working infrastructure. (Red Hat Documentation)

OADP/Velero should manage

  • namespace-scoped app backups, cluster resources related to apps, and PV protection. Red Hat recommends OADP for application backup/restore on OpenShift, and Velero supports both CSI snapshots and file-system backup. (Red Hat Documentation)

etcd should be separate

  • use OpenShift’s control-plane backup flow for etcd. Red Hat explicitly says a restore must use an etcd backup from the same z-stream release, and OpenShift provides cluster-restore.sh and quorum-restore.sh to simplify recovery. (Red Hat Documentation)

Best-practice deployment pattern

Use Terraform for infra, then GitOps or post-install automation to apply OADP resources. I would not use Terraform to micromanage every backup object forever; it is better for bootstrap and guardrails than for day-to-day backup lifecycle.

A practical pattern is:

  • Terraform creates bucket, IAM, KMS, DNS, LB, install config, and optional cluster manifests.
  • OCP comes up.
  • A post-install job applies:
    • OADP Operator
    • cloud credentials secret
    • DataProtectionApplication
    • one BackupStorageLocation
    • one or more VolumeSnapshotLocation
    • Schedule objects per app tier.
      This lines up with Red Hat’s OADP install flow and Velero’s native schedule model. (Red Hat Documentation)

Reference implementation

1) Terraform: object storage and IAM

This is the part Terraform is best at. Exact provider blocks vary by cloud, but the minimum is:

  • object storage bucket for backups
  • encryption
  • versioning / lifecycle
  • IAM role or credentials for Velero/OADP
resource "aws_s3_bucket" "velero" {
bucket = var.velero_bucket_name
}
resource "aws_s3_bucket_versioning" "velero" {
bucket = aws_s3_bucket.velero.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "velero" {
bucket = aws_s3_bucket.velero.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = var.kms_key_arn
}
}
}

2) OADP install on OpenShift

On current OpenShift, app backup/restore is done through the OADP Operator, which provides the main backup objects and integrates Velero with supported storage providers. (Red Hat Documentation)

3) DataProtectionApplication

This is the core OADP object that wires backup and snapshot locations.

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: dpa
namespace: openshift-adp
spec:
backupLocations:
- velero:
provider: aws
default: true
objectStorage:
bucket: infra-cloud-velero-prod
prefix: ocp-prod
config:
region: us-east-1
snapshotLocations:
- velero:
provider: aws
config:
region: us-east-1
configuration:
velero:
defaultPlugins:
- openshift
- aws
- csi

OADP’s API surface includes BackupStorageLocation and VolumeSnapshotLocation, and CSI snapshot support is the preferred volume path when your storage supports it. (Red Hat Documentation)

4) Scheduled backups

Velero schedules are cron-based repeatable backup requests. (Velero)

Example for critical apps:

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: apps-hourly
namespace: openshift-adp
spec:
schedule: "0 * * * *"
template:
includedNamespaces:
- payments
- customer-api
snapshotVolumes: true
ttl: 720h

Example for lower-priority namespaces:

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: apps-daily
namespace: openshift-adp
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- reporting
- internal-tools
snapshotVolumes: true
ttl: 2160h

Velero also supports filtering by namespace, labels, and resource type, which is useful for separating critical workloads from everything else. (Velero)

5) etdc backup automation

Keep this outside Velero. OpenShift’s backup docs separate control-plane backup from OADP app backup, and Red Hat says you only need to save the etcd backup from a single control plane host. (Red Hat Documentation)

Typical automation pattern:

  • privileged automation job or external runner
  • SSH to one control plane node
  • run cluster-backup.sh
  • copy backup artifacts off-cluster to encrypted object storage
  • tag with OCP version and timestamp

Recovery workflow

App-only DR

Use this when the cluster still exists:

  1. Reinstall missing operator/app prerequisites if needed.
  2. Run Velero/OADP restore for selected namespaces or apps.
  3. CSI-backed PV restore happens through the CSI plugin during PVC restore. (Velero)

Full-cluster DR

Use this when the cluster is gone:

  1. Terraform recreates infra.
  2. Reinstall OCP.
  3. Restore etcd from a same-z-stream backup.
  4. Reconcile operators.
  5. Use OADP/Velero to restore app data and resources that are outside or after the control-plane restore point. (Red Hat Documentation)

Practical backup policy

A good production baseline is:

  • etcd: daily plus pre-upgrade snapshot
  • tier-1 apps: hourly schedule
  • tier-2 apps: daily schedule
  • PVs: CSI snapshots where supported, file-system backup where snapshots are unavailable or portability matters. Velero documents both CSI snapshot support and file-system backup, including snapshot-data movement options. (Red Hat Documentation)

Guardrails to add

  • Encrypt the backup bucket.
  • Keep backups off-cluster.
  • Tag every etcd backup with OCP z-stream.
  • Separate schedules by business tier, not “back up everything hourly.”
  • Test both restore and full rebuild regularly. Red Hat’s backup docs are explicitly framed around recovering from disaster scenarios, not just creating backups. (Red Hat Documentation)

What I would automate first

If you want the highest payoff with the least complexity, automate this order:

  1. Terraform for infra + bucket + IAM
  2. OCP install
  3. OADP Operator + DataProtectionApplication
  4. namespace-based Schedule objects
  5. etcd backup job to off-cluster storage
  6. one restore drill for app-only recovery
  7. one restore drill for full cluster rebuild

My recommendation

For OpenShift, use OADP on top of Velero rather than installing raw Velero by hand unless you have a very specific reason. That is the supported OpenShift path for application backup/restore, while etcd remains a separate control-plane backup stream. (Red Hat Documentation)

Leave a comment