Debugging DNS Issues in OpenShift Pods

DNS works for some pods but not others: this one is tricky because it often looks like OVN, but a lot of the time it is actually DNS path, namespace lookup, or pod DNS config.

In OpenShift, the DNS Operator manages CoreDNS for pod and service name resolution, and CoreDNS runs as the dns-default daemon set in openshift-dns. Pods rely on kubelet-provided DNS settings in /etc/resolv.conf to reach those DNS servers. (Red Hat Documentation)

Scenario

Some pods can resolve service names, but others cannot.

Examples:

  • Pod A: nslookup backend-service
  • Pod B: nslookup backend-service

That usually means one of these:

  • the failing pod has bad DNS settings,
  • the query is being made from the wrong namespace,
  • only some nodes can reach the DNS pods,
  • or the DNS pods themselves are unhealthy on part of the cluster. (Red Hat Documentation)

Diagram

                +------------------------------+
                |        failing pod           |
                |  /etc/resolv.conf            |
                |  nameserver -> DNS service   |
                +--------------+---------------+
                               |
                               v
                    +---------------------+
                    |   CoreDNS /         |
                    |   dns-default pods  |
                    |   in openshift-dns  |
                    +----------+----------+
                               |
                 resolves svc/pod names from cluster state
                               |
                               v
                    +---------------------+
                    |  Service / Pod DNS  |
                    |  records            |
                    +---------------------+

Where it breaks:
1) Pod resolv.conf is wrong
2) Pod queries wrong namespace
3) Pod/node cannot reach dns-default
4) dns-default pods unhealthy
5) Name exists, but target service/endpoints are wrong


How to debug it

1. Prove it is DNS and not general networking

From a good pod and a bad pod, test both DNS and direct IP access:

oc exec -it <good-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- curl http://<service-cluster-ip>:<port>
oc exec -it <bad-pod> -- curl http://<pod-ip>:<port>

If IP-based access works but nslookup fails, that points strongly to DNS rather than OVN datapath routing. Kubernetes service and pod discovery are meant to work through DNS records. (Kubernetes)

2. Check the failing pod’s /etc/resolv.conf

This is one of the fastest checks:

oc exec -it <bad-pod> -- cat /etc/resolv.conf

A normal pod DNS config should include a cluster DNS nameserver and search domains such as the pod namespace, svc.cluster.local, and cluster.local; Kubernetes documents options ndots:5 as typical too. If those are missing or odd, the pod DNS setup is wrong. (Kubernetes)

3. Make sure the pod is querying the right namespace

A very common false alarm:

oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service.<namespace>

Kubernetes says unqualified service names are resolved relative to the pod’s own namespace. So backend-service from namespace frontend will not find a service that lives in namespace backend unless you query backend-service.backend. (Kubernetes)

4. Check whether the DNS pods are healthy

In OpenShift, look at the DNS operator and DNS pods:

oc get clusteroperator dns
oc get pods -n openshift-dns
oc get pods -n openshift-dns-operator

Red Hat documents that the DNS Operator manages CoreDNS, and that CoreDNS runs as the dns-default daemon set. If those pods are crashlooping, pending, or missing on expected nodes, pods may lose name resolution. (Red Hat Documentation)

5. Check whether only some nodes are affected

If only pods on one worker fail DNS, compare node placement:

oc get pods -A -o wide | grep <failing-node>
oc get pods -n openshift-dns -o wide

Red Hat notes DNS is available to all pods if DNS pods are running on some nodes and nodes without DNS pods still have network connectivity to nodes with DNS pods. So “only pods on node X fail DNS” often means node-to-DNS connectivity is broken rather than CoreDNS being globally broken. (Red Hat Documentation)

6. Test from a clean debug pod

This removes app-side noise:

oc run dns-debug --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -it --rm -- sh
nslookup kubernetes.default
nslookup backend-service.<namespace>
cat /etc/resolv.conf

Kubernetes recommends creating a simple test pod and using nslookup kubernetes.default as a baseline DNS test. (Kubernetes)

7. Check DNS service reachability from the bad pod

If you know the DNS service IP from /etc/resolv.conf, test whether the pod can even reach it. If the DNS nameserver is unreachable from only some pods or nodes, the issue is likely network path to DNS, not DNS records themselves. This is an inference from the Kubernetes debug flow and OpenShift’s note about node connectivity to DNS pods. (Kubernetes)

8. Check logs from the DNS pods

If the DNS pods are up but resolution still fails:

oc logs -n openshift-dns <dns-default-pod>

If you are testing a workaround, Red Hat documents that the DNS Operator can be set to Unmanaged, but they also note you cannot upgrade while it remains unmanaged. (Red Hat Documentation)

What this usually turns out to be

Most common causes:

  • Wrong namespace lookup: querying service instead of service.namespace. (Kubernetes)
  • Bad pod DNS config: strange or missing nameserver/search domains in /etc/resolv.conf. (Kubernetes)
  • DNS pods unhealthy: dns-default issues in openshift-dns. (Red Hat Documentation)
  • Node-specific connectivity issue: pods on one node cannot reach DNS pods running elsewhere. (Red Hat Documentation)
  • Service confusion: DNS resolves, but the target service or endpoints are wrong, making it look like DNS. Kubernetes DNS only gives you the name-to-record mapping; the service still has to be valid. (Kubernetes)

Fast triage sequence

oc exec -it <bad-pod> -- cat /etc/resolv.conf
oc exec -it <bad-pod> -- nslookup kubernetes.default
oc exec -it <bad-pod> -- nslookup <service>.<namespace>
oc get clusteroperator dns
oc get pods -n openshift-dns -o wide
oc logs -n openshift-dns <dns-default-pod>

Mental model

When DNS fails only for some pods:

  • if all traffic is broken, think OVN/node networking
  • if IP access works but names fail, think DNS
  • if short names fail but FQDN works, think namespace/search path
  • if only one node’s pods fail, think node-to-dns connectivity

Debugging ClusterIP Issues in OVN Kubernetes

Great—let’s go through another very common real-world issue and include a simple visual to make it click.


Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

  • frontend → calling backend
  • Direct call works:curl http://10.128.2.15:8080 ✅
  • Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes


Mental model (diagram)

Image

Interpretation:

  • Pod → Pod = direct routing (works)
  • Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

# Service selector
selector:
app: backend

But pod has:

labels:
app: api ❌ mismatch

Fix labels → service starts working instantly


Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

  • correct port
  • correct targetPort

Common mistake:

port: 80
targetPort: 8080 ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

  • ❌ fails → OVN load balancer issue
  • ✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service
If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service


Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

  • load balancer sync errors
  • endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables


Real root causes (from production)

1. Label mismatch (MOST COMMON)

  • Service selector doesn’t match pod
    → no endpoints → service dead

2. Wrong port/targetPort

  • Service pointing to wrong container port
    → connection refused

3. OVN load balancer not programmed

  • OVN DB out of sync
    → ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

  • Pod allows direct IP but blocks service path
    (less common but happens)

5. DNS issue (misdiagnosed often)

  • Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

  1. Endpoints exist?
    • ❌ → labels problem
  2. ClusterIP works?
    • ❌ → OVN load balancing
  3. DNS works?
    • ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

  • DNS
  • service
  • networking

Key takeaway

  • Pod IP = routing layer (OVN switching)
  • Service IP = OVN load balancer layer
  • If one works and the other doesn’t → you know exactly where to look

Troubleshooting Node-Specific Pod Traffic Failures

Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

  • frontend on worker-1 can reach backend
  • same app on worker-2 cannot

That pattern is a huge clue.


How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.


2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

  • restarts
  • readiness failures
  • DB connection errors
  • OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.


3. Check node readiness and basic health

oc get node
oc describe node <bad-node>

Look for:

  • NotReady
  • memory/disk pressure
  • network-related events

Sometimes OVN is fine and the node itself is degraded.


4. Inspect OVS on the bad node

Open a debug shell:

oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

  • missing br-int
  • interfaces missing
  • counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.


5. Check the node’s host networking

Still on the node:

ip addr
ip route
ip link

Look for:

  • missing routes
  • down interfaces
  • wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.


6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

  • DNS works sometimes
  • small pings work
  • larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.


7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.


8. Test service vs direct pod IP

From a failing pod:

curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

  • both fail → node/local OVN path likely broken
  • pod IP works, service fails → service/load-balancer programming problem
  • DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.


9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.


10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

  • cordon/drain the bad node if workloads are impacted
  • restart or recover the bad node’s OVN/OVS components
  • verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.


What this usually turns out to be

Most common causes:

  • ovnkube-node unhealthy on one node
  • broken or stale OVS state on that node
  • host NIC / route / MTU mismatch
  • node-specific firewall or kernel/network issue
  • the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

That usually gets you very close.


Mental model

When only one node is broken:

  • cluster-wide policy is less likely
  • app config is less likely
  • service config is less likely
  • node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU


Here’s a realistic example:

  • pods on worker-2 cannot reach anything off-node
  • pods on worker-1 are fine
  • ovnkube-node on worker-2 shows repeated connection/programming errors
  • ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Debugging OVN Issues in OpenShift

Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.


Scenario

A frontend pod cannot reach a backend service

You have:

  • frontend pod
  • backend pod
  • backend-service (ClusterIP)

And:

curl http://backend-service

fails


Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

oc get pods -o wide

You want:

  • Backend pod = Running
  • Has an IP (e.g., 10.128.2.15)

If pod is not running → stop here (not an OVN issue)


Step 2: Test direct pod-to-pod connectivity

From frontend pod:

oc exec -it frontend -- curl http://10.128.2.15

Outcomes:

Case A: This FAILS

→ Problem is networking (OVN / policy / routing)

Case B: This WORKS

→ Networking is fine → problem is service layer


Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

oc get networkpolicy -A

Look for anything like:

  • Deny all ingress
  • Missing allow rules

Quick test:
Create temporary allow-all policy

If it suddenly works → root cause = NetworkPolicy


Step 4A: Check node-level OVN

Find nodes:

oc get pods -o wide

Then:

oc get pods -n openshift-ovn-kubernetes -o wide

Check:

  • Is ovnkube-node running on both nodes?
  • Any restarts?

Step 5A: Test OVS health

oc debug node/<node>
chroot /host
ovs-vsctl show

Look for:

  • br-int bridge
  • Proper interfaces

Missing interfaces = OVN not wiring pods correctly


Step 6A: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-node>

Common errors:

  • Flow install failures
  • DB sync issues

Branch B: Pod-to-pod WORKS, Service FAILS

This is VERY common and often misunderstood.


Step 3B: Check service

oc get svc backend-service -o wide

Check:

  • ClusterIP exists
  • Correct port

Step 4B: Check endpoints

oc get endpoints backend-service

If EMPTY:

→ Service is not linked to pods

Root cause:

  • Wrong selector labels

Fix:

selector:
app: backend

Step 5B: Test service IP directly

curl <ClusterIP>

Fails but pod IP works:

→ OVN load-balancing issue


Step 6B: Check OVN load balancer

On node:

ovn-nbctl lb-list

You should see:

  • Service IP mapped to pod IPs

If missing → OVN not programming service


Bonus: DNS check (often confused with OVN)

From frontend:

nslookup backend-service

If fails:

→ DNS issue, NOT OVN

Check:

oc get pods -n openshift-dns

Real root cause examples (from production)

Case 1: Wrong labels

  • Service selector doesn’t match pod
    → No endpoints → service fails

Case 2: NetworkPolicy blocking traffic

  • Default deny policy applied
    → Pods isolated

Case 3: OVN desync

  • Pod exists but not in OVN DB
    → No routing

Case 4: Node issue

  • Only pods on one node fail
    ovnkube-node broken there

Case 5: MTU mismatch

  • Small packets work, large fail
    → Very tricky to spot

The mental model (this is what experts use)

When debugging:

  1. Pod IP → works?
    • ❌ → OVN / policy / routing
    • ✅ → go to service layer
  2. Service endpoints exist?
    • ❌ → labels problem
    • ✅ → OVN load balancing
  3. DNS works?
    • ❌ → DNS, not OVN

Pro move (what senior engineers do)

Spin up a debug pod:

oc run debug --image=busybox -it --rm -- sh

Then test:

  • ping
  • curl
  • nslookup

This removes app complexity completely.


Understanding OVN in OpenShift: A Networking Overview

In OpenShift Container Platform (OCP), OVN refers to Open Virtual Network, used via OVN-Kubernetes. It’s the default networking solution in modern OpenShift clusters.


What OVN is (in simple terms)

OVN is a software-defined networking (SDN) system that:

  • Creates virtual networks for pods
  • Handles routing, switching, and network policies
  • Replaces older OpenShift SDN implementations

Think of it as the “network brain” of your cluster.


How OVN works in OCP

Core components

  • OVN Northbound DB → stores high-level network config
  • OVN Southbound DB → translates config to actual rules
  • OVN Controller (on each node) → programs networking rules
  • Open vSwitch (OVS) → does the actual packet forwarding

What it actually does

1. Pod networking

Each pod gets:

  • Its own IP address
  • Ability to talk to other pods across nodes

2. Routing

Handles:

  • Pod-to-pod communication
  • Pod-to-service traffic
  • External access (ingress/egress)

3. Network policies

Controls traffic like:

  • “Only allow frontend → backend”
  • “Block everything except specific ports”

4. Load balancing

Implements Kubernetes Services internally


Traffic flow (simplified)

  1. Pod sends traffic
  2. Goes through Open vSwitch (OVS) on the node
  3. OVN rules decide:
    • Where it goes
    • Whether it’s allowed
  4. Traffic reaches destination pod/service

Why OCP uses OVN

Compared to older SDN:

  • Better performance and scalability
  • Native support for Kubernetes NetworkPolicy
  • More flexible routing and IP management
  • Cleaner architecture (decoupled control + data plane)

OVN vs OpenShift SDN

FeatureOVN-KubernetesOpenShift SDN (legacy)
PerformanceHigherLower
NetworkPolicyFull supportLimited
IPv6SupportedLimited
Future support✅ Active❌ Deprecated

Key concepts to remember

  • OVN = control plane for networking
  • OVS = data plane (moves packets)
  • Pods communicate via virtual networking managed by OVN
  • Policies are enforced at the network layer

Real-world example

If you deploy:

  • frontend pod
  • backend pod

With OVN:

  • Both get IPs
  • OVN ensures routing between them
  • A NetworkPolicy can allow only frontend → backend traffic

Debugging OVN in OpenShift Container Platform (via OVN-Kubernetes) can feel overwhelming at first, but there’s a clear, structured way to approach it.

Below is a practical, field-tested workflow you can follow.


0. Start with the symptom

Before touching OVN internals, identify the issue type:

  • ❌ Pod can’t reach another pod
  • ❌ Pod can’t reach a service
  • ❌ External traffic not working
  • ❌ DNS failing
  • ❌ Only some nodes affected

This helps narrow the scope fast.


1. Check cluster networking health

oc get co network
  • Should be Available=True
  • If Degraded → OVN problem likely

2. Check OVN pods

oc get pods -n openshift-ovn-kubernetes

Look for:

  • CrashLoopBackOff
  • NotReady pods

Key pods:

  • ovnkube-node (runs on every node)
  • ovnkube-master

3. Check logs (most important step)

Node-level (data plane issues)

oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Control plane

oc logs -n openshift-ovn-kubernetes <ovnkube-master-pod>

Look for:

  • Flow programming errors
  • DB connection failures
  • OVS issues

4. Validate pod networking

Get pod IPs:

oc get pods -o wide

Test connectivity:

oc exec -it <pod> -- ping <other-pod-ip>

If this fails:

  • Likely OVN routing or policy issue

5. Check NetworkPolicies

oc get networkpolicy -A

Common mistake:

  • Policy blocking traffic unintentionally

Test by temporarily removing policy or creating an allow-all:

kind: NetworkPolicy
spec:
podSelector: {}
ingress:
- {}
egress:
- {}

6. Check Open vSwitch (OVS)

SSH into a node:

oc debug node/<node-name>
chroot /host

Then:

ovs-vsctl show

Look for:

  • Bridges (like br-int)
  • Missing interfaces = problem

7. Inspect OVN DB state

From master node:

ovn-nbctl show

Check:

  • Logical switches
  • Ports for pods

If missing → OVN not programming correctly


8. Check services & kube-proxy replacement

OVN replaces kube-proxy.

Check:

oc get svc

Test:

curl <service-cluster-ip>

If service fails but pod IP works:
→ Load balancing issue in OVN


9. Check egress / external connectivity

From pod:

curl google.com

If fails:

  • Check EgressFirewall / EgressIP
  • Check node routing

10. Use must-gather (for deep issues)

oc adm must-gather -- /usr/bin/gather_network_logs

This collects:

  • OVN DB state
  • OVS config
  • Logs

Common real-world issues

1. MTU mismatch

Symptoms:

  • Intermittent connectivity
  • Large packets fail

2. NetworkPolicy blocking traffic

Very common in production


3. OVN DB not syncing

Symptoms:

  • Pods exist but no routes

4. Node-specific issues

  • Only pods on one node fail → check that node’s ovnkube-node

5. DNS issues (often misdiagnosed as OVN)

Check:

oc get pods -n openshift-dns

Debugging mindset (this is key)

Always go in this order:

  1. Is cluster networking healthy?
  2. Are OVN pods running?
  3. Is traffic blocked (policy)?
  4. Is routing broken (OVN/OVS)?
  5. Is it actually DNS or app issue?

Pro tip

Use a debug pod:

oc run test --image=busybox -it --rm -- sh

From there:

  • ping
  • nslookup
  • curl

This isolates networking from your app.

Image Signing Guide with Tekton Chains on OCP

A complete guide to image signing with Tekton Chains on OCP — covering the concept, the setup, the pipeline integration, and verification.—

What Tekton Chains does

Tekton Chains works by reconciling the run of a task or a pipeline. Once the run is observed as completed, Tekton Chains takes a snapshot of the completed TaskRun/PipelineRun, and starts its core work in the order of: formatting (generate provenance JSON) → signing (sign the payload using the configured key) → uploading (upload the provenance and its signature to the configured storage).

It operates entirely automatically — you don’t modify your pipeline at all. Chains watches completed runs and signs in the background.


Step 1 — Chains is already installed on OCP

The Red Hat OpenShift Pipelines Operator installs Tekton Chains by default. You can configure Tekton Chains by modifying the TektonConfig custom resource; the Operator automatically applies the changes that you make.

# Verify Chains is running
oc get pods -n openshift-pipelines | grep chains
# tekton-chains-controller-xxx Running

Step 2 — Generate a signing key pair

# Install cosign (if not already)
brew install cosign # or download binary
# Generate key pair — stores private key as K8s secret automatically
cosign generate-key-pair k8s://openshift-pipelines/signing-secrets
# This creates:
# signing-secrets (K8s Secret) — holds cosign.key + cosign.password
# cosign.pub (local file) — distribute this for verification
# Extract public key for distribution/verification
oc get secret signing-secrets -n openshift-pipelines \
-o jsonpath='{.data.cosign\.pub}' | base64 -d > cosign.pub

For production, use a KMS (AWS KMS, HashiCorp Vault, GCP KMS) instead of a file-based key:

# AWS KMS example
cosign generate-key-pair --kms awskms:///arn:aws:kms:ca-central-1:123456:key/abc-def

Step 3 — Configure Chains via TektonConfig

Cluster administrators can use Tekton Chains to sign and verify images and provenances by: creating an encrypted x509 key pair and saving it as a Kubernetes secret; setting up authentication for the OCI registry to store images, image signatures, and signed image attestations; and configuring Tekton Chains to generate and sign provenance.

apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
name: config
spec:
chain:
# Format for TaskRun attestations
artifacts.taskrun.format: "slsa/v1" # SLSA v1.0 provenance
artifacts.taskrun.storage: "oci" # store in OCI registry
# Format for PipelineRun attestations (recommended)
artifacts.pipelinerun.format: "slsa/v1"
artifacts.pipelinerun.storage: "oci"
artifacts.pipelinerun.enable-deep-inspection: "true" # inspect child TaskRuns
# OCI image signature format
artifacts.oci.format: "simplesigning"
artifacts.oci.storage: "oci"
# Transparency log (Sigstore Rekor)
transparency.enabled: "true"
transparency.url: "https://rekor.sigstore.dev" # or your internal Rekor
# Signing key reference
signers.cosign.key: "k8s://openshift-pipelines/signing-secrets"

Apply via oc patch if you prefer:

oc patch tektonconfig config --type=merge -p='{
"spec": {
"chain": {
"artifacts.pipelinerun.format": "slsa/v1",
"artifacts.pipelinerun.storage": "oci",
"artifacts.oci.format": "simplesigning",
"artifacts.oci.storage": "oci",
"transparency.enabled": "true"
}
}
}'

Step 4 — Type-hint your pipeline so Chains knows what to sign

Chains discovers what the output artifact is via type hints in Task results. Your build task must emit IMAGE_URL and IMAGE_DIGEST results:

apiVersion: tekton.dev/v1
kind: Task
metadata:
name: buildah-push
spec:
params:
- name: IMAGE
type: string
results:
# Type hints — Chains reads these to find the artifact
- name: IMAGE_URL
description: The image URL
- name: IMAGE_DIGEST
description: The image digest (sha256)
steps:
- name: build-and-push
image: registry.redhat.io/rhel8/buildah
script: |
buildah bud -t $(params.IMAGE) .
buildah push $(params.IMAGE) \
--digestfile /tmp/digest
# Emit type hints for Chains
echo -n "$(params.IMAGE)" | tee $(results.IMAGE_URL.path)
cat /tmp/digest | tee $(results.IMAGE_DIGEST.path)

For pipeline-level provenance, also emit results at the Pipeline level:

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
spec:
results:
- name: IMAGE_URL
value: $(tasks.build.results.IMAGE_URL)
- name: IMAGE_DIGEST
value: $(tasks.build.results.IMAGE_DIGEST)
tasks:
- name: build
taskRef:
name: buildah-push

Step 5 — What happens automatically after a run

Once your PipelineRun completes, Chains fires automatically. You can watch for the signed annotation:

# Watch for Chains to finish signing
oc get pipelinerun my-run -o json | jq '.metadata.annotations'
# {
# "chains.tekton.dev/signed": "true",
# "chains.tekton.dev/transparency": "https://rekor.sigstore.dev/api/v1/log/entries?logIndex=12345678"
# }
# What gets stored in the OCI registry alongside your image:
# myimage:sha256-abc123.sig ← cosign image signature
# myimage:sha256-abc123.att ← SLSA provenance attestation

Step 6 — Verify images before deployment

# Set your image reference (always use digest, not tag)
IMAGE="quay.io/my-org/my-app@sha256:abc123..."
# 1. Verify the image signature
cosign verify \
--key cosign.pub \
$IMAGE
# 2. Verify the SLSA provenance attestation
cosign verify-attestation \
--key cosign.pub \
--type slsaprovenance \
$IMAGE | jq '.payload | @base64d | fromjson'
# 3. Check the Rekor transparency log entry
rekor-cli search --sha sha256:abc123...

The SLSA provenance JSON tells you exactly what built the image — the git commit, the pipeline name, each task step, and all input dependencies.


Step 7 — Enforce signatures at admission (policy gate)

Verification at deployment time is where this pays off. Use OCP’s built-in image policy or Kyverno/OPA to block unsigned images:

# Kyverno policy — block any image without a valid Chains signature
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: Enforce
rules:
- name: check-image-signature
match:
resources:
kinds: [Pod]
verifyImages:
- imageReferences:
- "quay.io/my-org/*"
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
<your cosign.pub contents>
-----END PUBLIC KEY-----
attestations:
- predicateType: https://slsa.dev/provenance/v1
conditions:
- all:
- key: "{{ builder.id }}"
operator: Equals
value: "https://tekton.dev/chains/v2"

SLSA levels Tekton Chains achieves

SLSA levelRequirementChains status
Level 1Provenance existsAchieved — attestation generated automatically
Level 2Signed provenance, hosted buildAchieved — cosign signature + Rekor log entry
Level 3Hardened build platform, non-falsifiable provenanceAchieved with OCP’s isolated pod builds
Level 4Two-party review, hermetic buildsPartial — requires additional hermetic build config

The key benefit: by implementing provenance in CI/CD pipelines, you protect your supply chain from tampering and unauthorized access, streamline compliance with evolving industry and government regulations, and enhance visibility and trust throughout your software lifecycle.

Understanding Tekton: A Comprehensive CI/CD Framework for Kubernetes

Tekton is a cloud-native CI/CD framework built for Kubernetes. Here’s a full breakdown:


What it is

Tekton is a Kubernetes-native open source framework for creating continuous integration and continuous delivery (CI/CD) systems. It installs and runs as an extension on a Kubernetes cluster and comprises a set of Kubernetes Custom Resources that define the building blocks you can create and reuse for your pipelines.

Tekton standardizes CI/CD tooling and processes across vendors, languages, and deployment environments. It lets you create CI/CD systems quickly, giving you scalable, serverless, cloud-native execution out of the box.


Core building blocks

Everything in Tekton is composed of these layers:

  • Step — the most basic entity, such as running unit tests or compiling a program. Tekton performs each step with a provided container image.
  • Task — a collection of steps in a specific order. Tekton runs a task in the form of a Kubernetes pod, where each step becomes a running container in the pod.
  • Pipeline — a collection of tasks in a specific order. Tekton collects all tasks, connects them in a directed acyclic graph (DAG), and executes the graph in sequence.
  • TaskRun — a specific execution of a task.
  • PipelineRun — a specific execution of a pipeline.

Example pipeline (clone → build → deploy)

# Step 1: Define a Task
apiVersion: tekton.dev/v1
kind: Task
metadata:
name: build-and-push
spec:
params:
- name: IMAGE
type: string
steps:
- name: build
image: gcr.io/kaniko-project/executor:latest
args:
- --destination=$(params.IMAGE)
- --context=/workspace/source
---
# Step 2: Compose Tasks into a Pipeline
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
spec:
tasks:
- name: clone
taskRef:
name: git-clone # from Tekton Catalog
- name: build
runAfter: [clone]
taskRef:
name: build-and-push
- name: deploy
runAfter: [build]
taskRef:
name: kubectl-apply
---
# Step 3: Trigger a run
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
name: ci-pipeline-run-001
spec:
pipelineRef:
name: ci-pipeline

Major components

The Tekton ecosystem includes:

  • Pipelines — the core CRDs (Task, Pipeline, etc.)
  • Triggers — allows you to create pipelines based on event payloads, such as triggering a run every time a merge request is opened against a Git repo
  • CLI (tkn) — command-line interface to interact with Tekton from your terminal
  • Dashboard — a web-based graphical interface showing pipeline execution history
  • Catalog — a repository of high-quality, community-contributed reusable Tasks and Pipelines
  • Chains — manages supply chain security, including artifact signing and SLSA provenance

Key advantages

  • Truly Kubernetes-native — every pipeline run is a real Kubernetes pod; no external CI server needed
  • Reusable and composable — Tasks from the Tekton Hub can be dropped into any pipeline
  • Event-driven — Triggers fire pipelines automatically on Git webhooks, image pushes, etc.
  • Scalable — each step runs in its own container; pipelines scale with the cluster
  • Supply chain security — Tekton Chains can sign images and generate SLSA provenance automatically

Tekton on OpenShift

Red Hat ships Tekton as OpenShift Pipelines — the officially supported Tekton operator available directly from OperatorHub. It adds OCP-specific integrations like integration with the OpenShift internal image registry, S2I (Source-to-Image) tasks, and the OpenShift console Pipeline UI. Tekton is the basis for OpenShift Pipelines, making it the natural CI tool to pair with Argo CD or Flux for a full GitOps workflow on OCP (Tekton handles CI/build, Argo CD or Flux handles CD/deploy).


Here’s the full picture of how Tekton (CI) and Argo CD / Flux (CD) work together on OCP — first the architecture flow, then a complete reference guide.Now here’s the full practical reference — everything you need to wire it up on OCP.


How the two halves divide responsibility

When code changes are pushed to a Git repository, OpenShift Pipelines initiates a pipeline run. This pipeline might include tasks such as building container images, running unit tests, and generating artifacts. Once the pipeline successfully completes, Argo CD continuously monitors the Git repository for changes in application manifests. Once the new image version is committed, Argo CD synchronizes the application state to match the declared state in Git.

The key insight is that Tekton owns the source repo (code → image) and Argo CD / Flux owns the config repo (manifests → cluster). Tekton never deploys directly. It commits the new image tag to a separate GitOps manifests repo, then hands off.


Step 1 — Install both operators on OCP

# OpenShift Pipelines (Tekton) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift Pipelines" → Install
# OpenShift GitOps (Argo CD) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift GitOps" → Install
# Verify both are running
oc get pods -n openshift-pipelines
oc get pods -n openshift-gitops

Step 2 — The Tekton CI pipeline

On every push or pull-request to the source Git repository, the following steps execute within the Tekton pipeline: code is cloned and unit tests are run; the application is analyzed by SonarQube in parallel; a container image is built using S2I and pushed to the OpenShift internal registry; then Kubernetes manifests are updated in the Git repository with the image digest that was built within the pipeline.

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: ci-pipeline
namespace: cicd
spec:
workspaces:
- name: source
- name: dockerconfig
params:
- name: GIT_URL
type: string
- name: IMAGE
type: string
- name: GIT_MANIFEST_URL # separate repo for k8s manifests
type: string
tasks:
- name: clone
taskRef:
name: git-clone
kind: ClusterTask
workspaces:
- name: output
workspace: source
params:
- name: url
value: $(params.GIT_URL)
- name: unit-test
runAfter: [clone]
taskRef:
name: maven
kind: ClusterTask
workspaces:
- name: source
workspace: source
- name: build-image
runAfter: [unit-test]
taskRef:
name: buildah
kind: ClusterTask
params:
- name: IMAGE
value: $(params.IMAGE)
workspaces:
- name: source
workspace: source
- name: dockerconfig
workspace: dockerconfig
- name: scan-image
runAfter: [build-image]
taskRef:
name: trivy-scanner # from Tekton Hub
params:
- name: IMAGE
value: $(params.IMAGE)
- name: update-manifest # THE HANDOFF to GitOps
runAfter: [scan-image]
taskRef:
name: git-cli
kind: ClusterTask
params:
- name: GIT_USER_NAME
value: tekton-bot
- name: COMMANDS
value: |
git clone $(params.GIT_MANIFEST_URL) /workspace/manifest
cd /workspace/manifest
# Update image tag in kustomization
kustomize edit set image myapp=$(params.IMAGE)
git add -A
git commit -m "ci: update image to $(params.IMAGE)"
git push

Step 3 — Tekton Triggers (webhook → pipeline)

# EventListener — receives the GitHub/GitLab webhook
apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
name: git-push-listener
namespace: cicd
spec:
serviceAccountName: pipeline
triggers:
- name: push-trigger
bindings:
- ref: github-push-binding
template:
ref: pipeline-trigger-template
---
# TriggerTemplate — what to create when the webhook fires
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
name: pipeline-trigger-template
namespace: cicd
spec:
params:
- name: git-revision
- name: git-repo-url
resourcetemplates:
- apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: ci-run-
spec:
pipelineRef:
name: ci-pipeline
params:
- name: GIT_URL
value: $(tt.params.git-repo-url)
- name: IMAGE
value: image-registry.openshift-image-registry.svc:5000/myapp/app:$(tt.params.git-revision)

Expose the EventListener as an OCP Route so GitHub/GitLab can reach it:

oc expose svc el-git-push-listener -n cicd
# Then add the route URL as a webhook in GitHub/GitLab

Step 4 — Argo CD watches and deploys

Once the manifests repo is updated by Tekton, Argo CD detects the change. With automated.prune: true and selfHeal: true, it syncs immediately and deploys the new revision.

# Argo CD Application — dev environment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-dev
namespace: openshift-gitops
spec:
project: default
source:
repoURL: https://github.com/my-org/manifests.git
targetRevision: main
path: environments/dev # Kustomize overlay for dev
destination:
server: https://kubernetes.default.svc
namespace: myapp-dev
syncPolicy:
automated:
prune: true # remove resources deleted from Git
selfHeal: true # revert manual changes to the cluster
syncOptions:
- CreateNamespace=true
---
# Promotion to staging requires a PR merge (no auto-deploy to prod)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-staging
namespace: openshift-gitops
spec:
source:
path: environments/staging
targetRevision: staging # separate branch = manual promotion
syncPolicy:
automated:
prune: true
selfHeal: true

The GitOps repo layout Tekton writes to

manifests-repo/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── environments/
├── dev/
│ └── kustomization.yaml ← Tekton updates image tag here
├── staging/
│ └── kustomization.yaml ← promoted via PR merge
└── prod/
└── kustomization.yaml ← promoted via PR merge + approval

Promotion flow (dev → staging → prod)

Once the pipeline finishes successfully, the image reference in the manifests repo is updated and automatically deployed to the dev environment by Argo CD. To promote to staging, a pull request is generated targeting the staging branch. Merging that PR triggers Argo CD to sync the staging environment. Production follows the same pattern with an additional approval gate.

The promotion task in Tekton creates a PR automatically:

- name: promote-to-staging
runAfter: [update-manifest]
taskRef:
name: github-open-pr # from Tekton Hub
params:
- name: REPO_FULL_NAME
value: my-org/manifests
- name: HEAD
value: feature/new-image-$(params.git-revision)
- name: BASE
value: staging
- name: TITLE
value: "Promote $(params.IMAGE) to staging"

Putting it all together — the complete flow

StepActorAction
1Developergit push to source repo
2GitHub/GitLabSends webhook to Tekton EventListener
3TektonClones, tests, builds image with Buildah/S2I
4TektonScans image with Trivy / ACS
5TektonPushes image to OCP internal registry or Quay
6TektonUpdates image tag in manifests repo, opens PR to staging
7Argo CD / FluxDetects change in manifests repo, deploys to dev automatically
8TeamReviews and merges PR → staging auto-deploys
9TeamApproves prod PR → production deploys

This pattern — Tekton handles CI, Argo CD / Flux handles CD, and Git is the only bridge between them — is the standard GitOps delivery model on OCP.

Ultimate Guide to Velero for Kubernetes Backups

Velero is an open-source tool used to back up, restore, and migrate Kubernetes cluster resources and persistent volumes.

Think of it as a safety net for your Kubernetes environment


What Velero actually does

Velero helps you:

  • Back up cluster data (like deployments, services, configs)
  • Restore your cluster if something breaks
  • Migrate workloads between clusters or cloud providers
  • Schedule automatic backups

How it works (simple view)

Velero connects your Kubernetes cluster to external storage (like cloud object storage — e.g., AWS S3, Azure Blob, etc.) and:

  1. Takes a snapshot of cluster resources
  2. Optionally backs up persistent volumes
  3. Stores everything outside the cluster
  4. Lets you restore it later when needed

What gets backed up

  • Kubernetes resources (Pods, Deployments, Services, etc.)
  • Persistent Volume data (via snapshots or file-level backups)
  • Namespaces and metadata

Common use cases

  • Disaster recovery (cluster crash, accidental deletion)
  • Migrating apps between clusters/clouds
  • Testing environments (restore production snapshot into staging)
  • Compliance backups

Velero vs basic backups

Without Velero, you’d have to manually export configs and handle storage snapshots yourself. Velero automates and organizes all of that.


Ecosystem

Velero is often used alongside:

  • Kubernetes-native tools
  • Cloud providers (AWS, Azure, GCP)
  • Storage plugins (for volume snapshots)

Here’s a simple, practical walkthrough to install Velero and run your first backup.

I’ll show the most common setup: Kubernetes + AWS S3 (others like Azure/GCP are similar).


1. Prerequisites

Make sure you have:

  • A running Kubernetes cluster
  • kubectl configured
  • An S3 bucket (or equivalent object storage)
  • AWS credentials (access key + secret)

2. Install Velero CLI

Download and install the Velero CLI:

# Mac (Homebrew)
brew install velero
# Or via binary
curl -L https://github.com/vmware-tanzu/velero/releases/latest/download/velero-darwin-amd64.tar.gz | tar -xz
sudo mv velero /usr/local/bin/

Verify:

velero version

3. Create credentials file

Create a file called credentials-velero:

[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY

4. Install Velero in your cluster

Run this command (replace bucket + region):

velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket YOUR_BUCKET_NAME \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero

This will:

  • Deploy Velero into your cluster
  • Connect it to your S3 bucket
  • Set up volume snapshot support

5. Verify installation

kubectl get pods -n velero

You should see a running Velero pod.


6. Create your first backup

Backup entire cluster:

velero backup create my-first-backup

Backup a specific namespace:

velero backup create my-backup \
--include-namespaces my-namespace

7. Check backup status

velero backup get

Describe it:

velero backup describe my-first-backup

8. Restore from backup

velero restore create --from-backup my-first-backup

9. (Optional) Schedule automatic backups

velero schedule create daily-backup \
--schedule="0 2 * * *"

This runs every day at 2 AM.


Tips that actually matter

  • Start with namespace backups, not full cluster
  • Use labels to target specific apps
  • Test restore early (don’t wait for disaster)
  • Monitor storage costs (snapshots + S3)

Common mistakes

  • Wrong IAM permissions → backups silently fail
  • Forgetting persistent volumes → incomplete recovery
  • Not testing restores → risky in real incidents

Understanding the Shared Responsibility Model in ROSA/ARO

When moving from on-premise to a Managed Service like ROSA (Red Hat OpenShift on AWS) or ARO (Azure Red Hat OpenShift), the interview shifts. The technical “heavy lifting” of managing master nodes and etcd is now handled by Red Hat and AWS/Microsoft (the SRE team).

Your role as an administrator moves from “Keeping the lights on” to “Governance, Cost Optimization, and Integration.”


1. The Shared Responsibility Model

This is the #1 question for managed services.

Q: Who is responsible for what in a ROSA/ARO environment?

  • The Provider (Red Hat/Cloud Provider): Manages the Control Plane (Masters), etcd health, patching of the underlying OS, and the core OpenShift Operators.
  • The Customer (You): Manages Worker nodes (scaling), Application lifecycle, RBAC, Network Policies, and Quotas.

Interview Tip: Mention that you no longer have cluster-admin in the traditional sense on ARO; you have a customer-admin role. You cannot SSH into master nodes or modify the etcd configuration directly.


2. Day 1: Provisioning & Connectivity

Q1: How does networking differ in ROSA/ARO compared to on-prem?

Answer: In a managed service, OpenShift is integrated into the Cloud’s Virtual Private Cloud (VPC/VNet).

  • Private vs. Public Clusters: You must decide if the API and Ingress are “Public” (accessible over the internet) or “Private” (only accessible via VPN/DirectConnect/ExpressRoute).
  • VPC Peering/Transit Gateway: You are responsible for connecting the OpenShift VPC to the rest of your cloud infrastructure (e.g., to reach a managed RDS database or Azure SQL).

Q2: What is the “Assisted Installer” vs. “Cloud CLI”?

Answer: For ROSA, you use the rosa CLI. For ARO, you use the az aro command. These tools abstract the CloudFormation or ARM templates required to spin up the infrastructure.


3. Day 2: Managed Operations

Q3: How do you handle cluster upgrades in ROSA/ARO?

Answer: You don’t just “hit update” and pray.

  • In ROSA, you can schedule upgrade windows via the OpenShift Cluster Manager (OCM).
  • The Red Hat SRE team monitors the upgrade. If it fails, they are the ones paged, not you. However, you must ensure your applications have correct Pod Disruption Budgets (PDBs) so the rolling update doesn’t take down your service.

Q4: How do you scale the cluster in the cloud?

Answer: You use MachineAutoscalers.

  • Unlike on-prem, where you are limited by physical hardware, in ROSA/ARO you define a MachineAutoscaler that monitors the cluster’s resource requests. If a pod can’t be scheduled due to lack of CPU, the autoscaler automatically provisions a new EC2/Azure VM and joins it to the cluster.

4. Cost & Security

Q5: How do you control costs in a managed OpenShift environment?

Answer: Since you pay for every worker node, I implement:

  1. Cluster Autoscaling: Scaling down to minimum nodes at night.
  2. Resource Quotas: Preventing developers from requesting 16GB of RAM for a “Hello World” app.
  3. Spot Instances: Using AWS Spot or Azure Priority instances for non-production workloads to save up to 70% on compute costs.

Q6: How do you handle Authentication?

Answer: You typically don’t use local users. You integrate OpenShift with Azure AD (Entra ID) or AWS IAM/OIDC.

  • Question: “How do pods access cloud resources (like S3 or Azure Vault)?”
  • Answer: STS (Security Token Service) or Managed Identities. This allows pods to assume a cloud role without needing to store static “Access Keys” inside a Secret.

5. Summary Comparison: On-Prem vs. Managed

FeatureOn-Prem (Bare Metal/VMware)Managed (ROSA/ARO)
Control PlaneYou manage (3 VMs/Servers)Managed by SRE (Hidden/Bundled)
UpdatesManual / High RiskScheduled / Automated
Load BalancerMetalLB / F5 / HAProxyAWS NLB/ALB or Azure LB
StorageODF / vSphere CSIEBS/EFS or Azure Disk/Files
Failure ResponseYou get paged at 3 AMRed Hat/Cloud SRE handles infra

The “Pro” Managed Question:

“If the cluster is managed by Red Hat, why do they still need an Administrator like you?”

Winning Answer: “Because while Red Hat manages the platform, I manage the consumption. I ensure the networking between our VPCs is secure, I manage the RBAC and onboarding for our developers, I optimize costs so we aren’t over-provisioning cloud resources, and I implement the CI/CD patterns that allow our apps to run reliably on that platform.”

Mastering Day 1 and Day 2 of Cluster Management

This is a classic way for interviewers to see if you have actually managed a cluster in production. Day 1 is about getting the cluster alive; Day 2 is about keeping it from dying.

In a senior interview, they expect you to spend most of your time talking about Day 2, as that represents 99% of a cluster’s lifespan


Day 1: Installation & Provisioning

Focus: Automation, Infrastructure, and “Getting to Green.”

TaskOn-Premise Reality Check
DNS SetupCreating the critical records: api, api-int, and *.apps. Without these, the bootstrap will fail.
Load BalancingSetting up external HAProxy or F5 (for UPI) or ensuring VIPs are reserved (for IPI).
Ignition ConfigsUsing the installer to generate .ign files and serving them via HTTP/PXE to the bare metal/VM nodes.
Certificate ApprovalManually running oc get csr and approving them to allow nodes to join the cluster.
Registry Mirroring(If Air-gapped) Setting up the local Quay/Nexus registry and the ImageContentSourcePolicy.

Day 2: Maintenance & Operations

Focus: Stability, Compliance, and Scaling.

1. Lifecycle Management
  • Cluster Upgrades: Navigating the “Update Graph.” Choosing between the stable and fast channels.
  • Certificate Rotation: Monitoring the expiration of the internal API and Ingress CA (though OpenShift now automates most of this, an admin must know how to fix a “stuck” rotation).
  • Node Scaling: Adding new Bare Metal workers via the Assisted Installer or expanding VMware Resource Pools.
2. Performance & Health
  • Etcd Maintenance: Performing periodic defragmentation and manual snapshots before any major change.
  • Logging Stack Management: Tuning the Elasticsearch/Fluentd (or Loki) stack. On-premise, this often means managing “PVC full” issues when logs grow too fast.
  • Pruning: Running oc adm prune to clean up old builds, images, and deployments that are cluttering the etcd database.
3. Security & Governance
  • RBAC Auditing: Ensuring developers aren’t using cluster-admin for daily tasks.
  • SCC Policy: Managing exceptions for specialized workloads (e.g., giving a monitoring agent privileged access).
  • Quota Management: Defining ResourceQuotas per Project to prevent a single “noisy neighbor” from consuming all physical RAM on your ESXi hosts.

The “Senior Admin” Bonus: Disaster Recovery (DR)

An interviewer will almost certainly ask: “What is your DR strategy for on-prem?”

A high-quality answer includes:

  1. Etcd Backups: Stored outside the cluster (e.g., on an external S3 bucket or NAS).
  2. Velero: Using the Velero operator to back up application metadata and Persistent Volumes (using CSI snapshots).
  3. Multi-Cluster: Having a second “Passive” cluster in a different data center and using Red Hat Advanced Cluster Management (RHACM) to shift traffic via DNS if the primary DC goes dark.

Final Interview Tip: The “Why”

When answering, don’t just say what you did; say why it matters for the business:

  • Wrong: “I configured the MTU to 1400.”
  • Right: “I lowered the MTU to 1400 to prevent packet fragmentation over our Geneve tunnels, which reduced our application latency by 30%.”