Debugging DNS Issues in OpenShift Pods

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, kubernetes, linux, technology

DNS works for some pods but not others: this one is tricky because it often looks like OVN, but a lot of the time it is actually DNS path, namespace lookup, or pod DNS config.

In OpenShift, the DNS Operator manages CoreDNS for pod and service name resolution, and CoreDNS runs as the dns-default daemon set in openshift-dns. Pods rely on kubelet-provided DNS settings in /etc/resolv.conf to reach those DNS servers. (Red Hat Documentation)

Scenario

Some pods can resolve service names, but others cannot.

Examples:

Pod A: nslookup backend-service ✅
Pod B: nslookup backend-service ❌

That usually means one of these:

the failing pod has bad DNS settings,
the query is being made from the wrong namespace,
only some nodes can reach the DNS pods,
or the DNS pods themselves are unhealthy on part of the cluster. (Red Hat Documentation)

Diagram

                +------------------------------+
                |        failing pod           |
                |  /etc/resolv.conf            |
                |  nameserver -> DNS service   |
                +--------------+---------------+
                               |
                               v
                    +---------------------+
                    |   CoreDNS /         |
                    |   dns-default pods  |
                    |   in openshift-dns  |
                    +----------+----------+
                               |
                 resolves svc/pod names from cluster state
                               |
                               v
                    +---------------------+
                    |  Service / Pod DNS  |
                    |  records            |
                    +---------------------+

Where it breaks:
1) Pod resolv.conf is wrong
2) Pod queries wrong namespace
3) Pod/node cannot reach dns-default
4) dns-default pods unhealthy
5) Name exists, but target service/endpoints are wrong

How to debug it

1. Prove it is DNS and not general networking

From a good pod and a bad pod, test both DNS and direct IP access:

			
oc exec -it <good-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- curl http://<service-cluster-ip>:<port>
oc exec -it <bad-pod> -- curl http://<pod-ip>:<port>

If IP-based access works but nslookup fails, that points strongly to DNS rather than OVN datapath routing. Kubernetes service and pod discovery are meant to work through DNS records. (Kubernetes)

2. Check the failing pod’s `/etc/resolv.conf`

This is one of the fastest checks:

oc exec -it <bad-pod> -- cat /etc/resolv.conf

A normal pod DNS config should include a cluster DNS nameserver and search domains such as the pod namespace, svc.cluster.local, and cluster.local; Kubernetes documents options ndots:5 as typical too. If those are missing or odd, the pod DNS setup is wrong. (Kubernetes)

3. Make sure the pod is querying the right namespace

A very common false alarm:

			
oc exec -it <bad-pod> -- nslookup backend-service
oc exec -it <bad-pod> -- nslookup backend-service.<namespace>

Kubernetes says unqualified service names are resolved relative to the pod’s own namespace. So backend-service from namespace frontend will not find a service that lives in namespace backend unless you query backend-service.backend. (Kubernetes)

4. Check whether the DNS pods are healthy

In OpenShift, look at the DNS operator and DNS pods:

			
oc get clusteroperator dns
oc get pods -n openshift-dns
oc get pods -n openshift-dns-operator

Red Hat documents that the DNS Operator manages CoreDNS, and that CoreDNS runs as the dns-default daemon set. If those pods are crashlooping, pending, or missing on expected nodes, pods may lose name resolution. (Red Hat Documentation)

5. Check whether only some nodes are affected

If only pods on one worker fail DNS, compare node placement:

			
oc get pods -A -o wide | grep <failing-node>
oc get pods -n openshift-dns -o wide

Red Hat notes DNS is available to all pods if DNS pods are running on some nodes and nodes without DNS pods still have network connectivity to nodes with DNS pods. So “only pods on node X fail DNS” often means node-to-DNS connectivity is broken rather than CoreDNS being globally broken. (Red Hat Documentation)

6. Test from a clean debug pod

This removes app-side noise:

			
oc run dns-debug --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -it --rm -- sh
nslookup kubernetes.default
nslookup backend-service.<namespace>
cat /etc/resolv.conf

Kubernetes recommends creating a simple test pod and using nslookup kubernetes.default as a baseline DNS test. (Kubernetes)

7. Check DNS service reachability from the bad pod

If you know the DNS service IP from /etc/resolv.conf, test whether the pod can even reach it. If the DNS nameserver is unreachable from only some pods or nodes, the issue is likely network path to DNS, not DNS records themselves. This is an inference from the Kubernetes debug flow and OpenShift’s note about node connectivity to DNS pods. (Kubernetes)

8. Check logs from the DNS pods

If the DNS pods are up but resolution still fails:

oc logs -n openshift-dns <dns-default-pod>

If you are testing a workaround, Red Hat documents that the DNS Operator can be set to Unmanaged, but they also note you cannot upgrade while it remains unmanaged. (Red Hat Documentation)

What this usually turns out to be

Most common causes:

Wrong namespace lookup: querying service instead of service.namespace. (Kubernetes)
Bad pod DNS config: strange or missing nameserver/search domains in /etc/resolv.conf. (Kubernetes)
DNS pods unhealthy: dns-default issues in openshift-dns. (Red Hat Documentation)
Node-specific connectivity issue: pods on one node cannot reach DNS pods running elsewhere. (Red Hat Documentation)
Service confusion: DNS resolves, but the target service or endpoints are wrong, making it look like DNS. Kubernetes DNS only gives you the name-to-record mapping; the service still has to be valid. (Kubernetes)

Fast triage sequence

			
oc exec -it <bad-pod> -- cat /etc/resolv.conf
oc exec -it <bad-pod> -- nslookup kubernetes.default
oc exec -it <bad-pod> -- nslookup <service>.<namespace>
oc get clusteroperator dns
oc get pods -n openshift-dns -o wide
oc logs -n openshift-dns <dns-default-pod>

		

Mental model

When DNS fails only for some pods:

if all traffic is broken, think OVN/node networking
if IP access works but names fail, think DNS
if short names fail but FQDN works, think namespace/search path
if only one node’s pods fail, think node-to-dns connectivity

Debugging ClusterIP Issues in OVN Kubernetes

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, docker, kubernetes, technology

Great—let’s go through another very common real-world issue and include a simple visual to make it click.

Scenario

Service works via pod IP, but fails via ClusterIP (service name/IP)

Environment:

frontend → calling backend
Direct call works:curl http://10.128.2.15:8080 ✅
Service call fails:curl http://backend-service ❌

What this means (important insight)

If pod IP works but service fails, then:

Pod networking (OVN routing) is working
Problem is in service load-balancing layer inside OVN-Kubernetes

Mental model (diagram)

Interpretation:

Pod → Pod = direct routing (works)
Pod → Service = goes through OVN load balancer (broken here)

Step-by-step debugging

Step 1: Confirm endpoints exist

oc get endpoints backend-service

If EMPTY:

Root cause = wrong labels

Example:

			
# Service selector
selector:
  app: backend

But pod has:

			
labels:
  app: api   ❌ mismatch

Fix labels → service starts working instantly

Step 2: Verify service definition

oc get svc backend-service -o yaml

Check:

correct port
correct targetPort

Common mistake:

			
port: 80
targetPort: 8080   ✅ must match container port

Step 3: Test ClusterIP directly

curl <ClusterIP>:<port>

Results:

❌ fails → OVN load balancer issue
✅ works → DNS issue instead

Step 4: Check DNS (don’t skip this)

From pod:

nslookup backend-service

If fails:

→ Not OVN
→ Check:

oc get pods -n openshift-dns

Step 5: Inspect OVN load balancer

On a node:

			
oc debug node/<node>
chroot /host

Then:

ovn-nbctl lb-list

You should see something like:

VIP: 172.30.0.10:80 → 10.128.2.15:8080

If missing:

OVN didn’t program the service

Step 6: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-master>

Look for:

load balancer sync errors
endpoint update failures

Step 7: Check kube-proxy replacement

In OpenShift Container Platform, OVN replaces kube-proxy.

So if service routing is broken:
It’s handled by OVN, not iptables

Real root causes (from production)

1. Label mismatch (MOST COMMON)

Service selector doesn’t match pod
→ no endpoints → service dead

2. Wrong port/targetPort

Service pointing to wrong container port
→ connection refused

3. OVN load balancer not programmed

OVN DB out of sync
→ ClusterIP has no backend mapping

4. NetworkPolicy blocking service traffic

Pod allows direct IP but blocks service path
(less common but happens)

5. DNS issue (misdiagnosed often)

Service name fails, ClusterIP works

Fast debugging logic (this is gold)

When pod IP works but service fails:

Endpoints exist?
- ❌ → labels problem
ClusterIP works?
- ❌ → OVN load balancing
DNS works?
- ❌ → DNS issue

Pro tip (what experts do fast)

From a debug pod:

oc run debug --image=busybox -it --rm -- sh

Run:

			
nslookup backend-service
curl <ClusterIP>
curl <pod-IP>

This instantly isolates:

DNS
service
networking

Key takeaway

Pod IP = routing layer (OVN switching)
Service IP = OVN load balancer layer
If one works and the other doesn’t → you know exactly where to look

Troubleshooting Node-Specific Pod Traffic Failures

April 20, 2026April 20, 2026 techhadoop OCP cloud, devops, docker, kubernetes, technology

Scenario

Traffic works for pods on node A, but fails for pods on node B.

That usually points to a node-local OVN/OVS problem, not an app problem.

Example:

frontend on worker-1 can reach backend
same app on worker-2 cannot

That pattern is a huge clue.

How to debug it

1. Prove it’s node-specific

List pods and nodes:

oc get pods -A -o wide

Run the same network test from a pod on each node:

			
oc exec -it <good-pod> -- curl http://<target-pod-ip>:<port>
oc exec -it <bad-pod> -- curl http://<target-pod-ip>:<port>

If one node always works and another always fails, focus on the bad node.

2. Check the OVN pod on the bad node

Find the ovnkube-node pod for that worker:

oc get pods -n openshift-ovn-kubernetes -o wide

Look for the pod scheduled on the failing node.

Then inspect:

			
oc describe pod -n openshift-ovn-kubernetes <ovnkube-node-pod>
oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Things that matter:

restarts
readiness failures
DB connection errors
OVS/flow programming errors

If ovnkube-node is unhealthy there, that is often the root cause.

3. Check node readiness and basic health

			
oc get node
oc describe node <bad-node>

Look for:

NotReady
memory/disk pressure
network-related events

Sometimes OVN is fine and the node itself is degraded.

4. Inspect OVS on the bad node

Open a debug shell:

			
oc debug node/<bad-node>
chroot /host

Then:

ovs-vsctl show

You want to see expected bridges such as br-int.

Also useful:

			
ovs-ofctl dump-ports br-int
ovs-appctl bond/show

Red flags:

missing br-int
interfaces missing
counters not increasing on expected ports

If OVS is broken on that node, pod traffic there will fail even while the rest of the cluster looks fine.

5. Check the node’s host networking

Still on the node:

			
ip addr
ip route
ip link

Look for:

missing routes
down interfaces
wrong MTU

A node can have OVN running, but if the host interface or route is wrong, encapsulated traffic will still fail.

6. Compare MTU with a working node

MTU mismatches are sneaky.

On both a good node and bad node:

ip link

Look at the main NIC and OVN-related interfaces.

Symptoms of MTU trouble:

DNS works sometimes
small pings work
larger curls/higher-volume traffic fail or hang

A quick test from a pod can help:

ping -M do -s 1400 <target-ip>

If smaller packets work and larger ones fail, suspect MTU.

7. Check if pod wiring exists on the bad node

From the failing node’s ovnkube-node logs, check whether the affected pod sandbox/interface got programmed correctly.

Also inspect pods on that node:

oc get pods -A -o wide | grep <bad-node>

If all pods on that node fail, it is likely node OVN/OVS or host network.
If only one pod fails, it may be a pod-specific attachment/setup issue.

8. Test service vs direct pod IP

From a failing pod:

			
curl http://<target-pod-ip>:<port>
curl http://<service-cluster-ip>:<port>

Interpretation:

both fail → node/local OVN path likely broken
pod IP works, service fails → service/load-balancer programming problem
DNS name fails, ClusterIP works → DNS problem

This helps avoid blaming OVN for the wrong layer.

9. Check for node-local firewall or host changes

On the bad node, inspect whether something changed outside OpenShift:

			
iptables -S
nft list ruleset
systemctl status ovs-vswitchd
systemctl status ovn-controller

A manual host change, bad firewall rule, or failed service can break just one node.

10. Restart scope carefully

If evidence points clearly to the bad node’s OVN stack, a targeted recovery step is safer than broad cluster changes.

Typical sequence:

cordon/drain the bad node if workloads are impacted
restart or recover the bad node’s OVN/OVS components
verify traffic before uncordoning

Avoid random restarts cluster-wide unless you’ve ruled out a local issue.

What this usually turns out to be

Most common causes:

ovnkube-node unhealthy on one node
broken or stale OVS state on that node
host NIC / route / MTU mismatch
node-specific firewall or kernel/network issue
the node recently rebooted or partially lost connectivity to OVN DB

Fast triage checklist

When traffic fails only on one node, I’d do this in order:

			
oc get pods -A -o wide
oc get pods -n openshift-ovn-kubernetes -o wide
oc logs -n openshift-ovn-kubernetes <ovnkube-node-on-bad-node>
oc debug node/<bad-node>
chroot /host
ovs-vsctl show
ip route
ip link
systemctl status ovs-vswitchd
systemctl status ovn-controller

		

That usually gets you very close.

Mental model

When only one node is broken:

cluster-wide policy is less likely
app config is less likely
service config is less likely
node-local data plane is most likely

So think:
bad node → ovnkube-node → OVS → host NIC/route/MTU

Here’s a realistic example:

pods on worker-2 cannot reach anything off-node
pods on worker-1 are fine
ovnkube-node on worker-2 shows repeated connection/programming errors
ovs-vsctl show on worker-2 is missing expected state

That strongly suggests the fix is on worker-2, not in the app or service definitions.

Debugging OVN Issues in OpenShift

April 20, 2026April 20, 2026 techhadoop OCP ai, cloud, devops, kubernetes, technology

Let’s walk through a realistic, production-style OVN debugging scenario in
OpenShift Container Platform using OVN-Kubernetes.

Scenario

A frontend pod cannot reach a backend service

You have:

frontend pod
backend pod
backend-service (ClusterIP)

And:

curl http://backend-service

fails

Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

oc get pods -o wide

You want:

Backend pod = Running
Has an IP (e.g., 10.128.2.15)

If pod is not running → stop here (not an OVN issue)

Step 2: Test direct pod-to-pod connectivity

From frontend pod:

oc exec -it frontend -- curl http://10.128.2.15

Outcomes:

Case A: This FAILS

→ Problem is networking (OVN / policy / routing)

Case B: This WORKS

→ Networking is fine → problem is service layer

Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

oc get networkpolicy -A

Look for anything like:

Deny all ingress
Missing allow rules

Quick test:
Create temporary allow-all policy

If it suddenly works → root cause = NetworkPolicy

Step 4A: Check node-level OVN

Find nodes:

oc get pods -o wide

Then:

oc get pods -n openshift-ovn-kubernetes -o wide

Check:

Is ovnkube-node running on both nodes?
Any restarts?

Step 5A: Test OVS health

			
oc debug node/<node>
chroot /host
ovs-vsctl show

Look for:

br-int bridge
Proper interfaces

Missing interfaces = OVN not wiring pods correctly

Step 6A: Check OVN logs

oc logs -n openshift-ovn-kubernetes <ovnkube-node>

Common errors:

Flow install failures
DB sync issues

Branch B: Pod-to-pod WORKS, Service FAILS

This is VERY common and often misunderstood.

Step 3B: Check service

oc get svc backend-service -o wide

Check:

ClusterIP exists
Correct port

Step 4B: Check endpoints

oc get endpoints backend-service

If EMPTY:

→ Service is not linked to pods

Root cause:

Wrong selector labels

Fix:

			
selector:
  app: backend

Step 5B: Test service IP directly

curl <ClusterIP>

Fails but pod IP works:

→ OVN load-balancing issue

Step 6B: Check OVN load balancer

On node:

ovn-nbctl lb-list

You should see:

Service IP mapped to pod IPs

If missing → OVN not programming service

Bonus: DNS check (often confused with OVN)

From frontend:

nslookup backend-service

If fails:

→ DNS issue, NOT OVN

Check:

oc get pods -n openshift-dns

Real root cause examples (from production)

Case 1: Wrong labels

Service selector doesn’t match pod
→ No endpoints → service fails

Case 2: NetworkPolicy blocking traffic

Default deny policy applied
→ Pods isolated

Case 3: OVN desync

Pod exists but not in OVN DB
→ No routing

Case 4: Node issue

Only pods on one node fail
→ ovnkube-node broken there

Case 5: MTU mismatch

Small packets work, large fail
→ Very tricky to spot

The mental model (this is what experts use)

When debugging:

Pod IP → works?
- ❌ → OVN / policy / routing
- ✅ → go to service layer
Service endpoints exist?
- ❌ → labels problem
- ✅ → OVN load balancing
DNS works?
- ❌ → DNS, not OVN

Pro move (what senior engineers do)

Spin up a debug pod:

oc run debug --image=busybox -it --rm -- sh

Then test:

ping
curl
nslookup

This removes app complexity completely.

Understanding OVN in OpenShift: A Networking Overview

April 20, 2026April 20, 2026 techhadoop OCP cloud, containers, devops, kubernetes, technology

In OpenShift Container Platform (OCP), OVN refers to Open Virtual Network, used via OVN-Kubernetes. It’s the default networking solution in modern OpenShift clusters.

What OVN is (in simple terms)

OVN is a software-defined networking (SDN) system that:

Creates virtual networks for pods
Handles routing, switching, and network policies
Replaces older OpenShift SDN implementations

Think of it as the “network brain” of your cluster.

How OVN works in OCP

Core components

OVN Northbound DB → stores high-level network config
OVN Southbound DB → translates config to actual rules
OVN Controller (on each node) → programs networking rules
Open vSwitch (OVS) → does the actual packet forwarding

What it actually does

1. Pod networking

Each pod gets:

Its own IP address
Ability to talk to other pods across nodes

2. Routing

Handles:

Pod-to-pod communication
Pod-to-service traffic
External access (ingress/egress)

3. Network policies

Controls traffic like:

“Only allow frontend → backend”
“Block everything except specific ports”

4. Load balancing

Implements Kubernetes Services internally

Traffic flow (simplified)

Pod sends traffic
Goes through Open vSwitch (OVS) on the node
OVN rules decide:
- Where it goes
- Whether it’s allowed
Traffic reaches destination pod/service

Why OCP uses OVN

Compared to older SDN:

Better performance and scalability
Native support for Kubernetes NetworkPolicy
More flexible routing and IP management
Cleaner architecture (decoupled control + data plane)

OVN vs OpenShift SDN

Feature	OVN-Kubernetes	OpenShift SDN (legacy)
Performance	Higher	Lower
NetworkPolicy	Full support	Limited
IPv6	Supported	Limited
Future support	✅ Active	❌ Deprecated

Key concepts to remember

OVN = control plane for networking
OVS = data plane (moves packets)
Pods communicate via virtual networking managed by OVN
Policies are enforced at the network layer

Real-world example

If you deploy:

frontend pod
backend pod

With OVN:

Both get IPs
OVN ensures routing between them
A NetworkPolicy can allow only frontend → backend traffic

Debugging OVN in OpenShift Container Platform (via OVN-Kubernetes) can feel overwhelming at first, but there’s a clear, structured way to approach it.

Below is a practical, field-tested workflow you can follow.

0. Start with the symptom

Before touching OVN internals, identify the issue type:

❌ Pod can’t reach another pod
❌ Pod can’t reach a service
❌ External traffic not working
❌ DNS failing
❌ Only some nodes affected

This helps narrow the scope fast.

1. Check cluster networking health

oc get co network

Should be Available=True
If Degraded → OVN problem likely

2. Check OVN pods

oc get pods -n openshift-ovn-kubernetes

Look for:

CrashLoopBackOff
NotReady pods

Key pods:

ovnkube-node (runs on every node)
ovnkube-master

3. Check logs (most important step)

Node-level (data plane issues)

oc logs -n openshift-ovn-kubernetes <ovnkube-node-pod>

Control plane

oc logs -n openshift-ovn-kubernetes <ovnkube-master-pod>

Look for:

Flow programming errors
DB connection failures
OVS issues

4. Validate pod networking

Get pod IPs:

oc get pods -o wide

Test connectivity:

oc exec -it <pod> -- ping <other-pod-ip>

If this fails:

Likely OVN routing or policy issue

5. Check NetworkPolicies

oc get networkpolicy -A

Common mistake:

Policy blocking traffic unintentionally

Test by temporarily removing policy or creating an allow-all:

			
kind: NetworkPolicy
spec:
  podSelector: {}
  ingress:
  - {}
  egress:
  - {}

		

6. Check Open vSwitch (OVS)

SSH into a node:

			
oc debug node/<node-name>
chroot /host

Then:

ovs-vsctl show

Look for:

Bridges (like br-int)
Missing interfaces = problem

7. Inspect OVN DB state

From master node:

ovn-nbctl show

Check:

Logical switches
Ports for pods

If missing → OVN not programming correctly

8. Check services & kube-proxy replacement

OVN replaces kube-proxy.

Check:

oc get svc

Test:

curl <service-cluster-ip>

If service fails but pod IP works:
→ Load balancing issue in OVN

9. Check egress / external connectivity

From pod:

curl google.com

If fails:

Check EgressFirewall / EgressIP
Check node routing

10. Use must-gather (for deep issues)

oc adm must-gather -- /usr/bin/gather_network_logs

This collects:

OVN DB state
OVS config
Logs

Common real-world issues

1. MTU mismatch

Symptoms:

Intermittent connectivity
Large packets fail

2. NetworkPolicy blocking traffic

Very common in production

3. OVN DB not syncing

Symptoms:

Pods exist but no routes

4. Node-specific issues

Only pods on one node fail → check that node’s ovnkube-node

5. DNS issues (often misdiagnosed as OVN)

Check:

oc get pods -n openshift-dns

Debugging mindset (this is key)

Always go in this order:

Is cluster networking healthy?
Are OVN pods running?
Is traffic blocked (policy)?
Is routing broken (OVN/OVS)?
Is it actually DNS or app issue?

Pro tip

Use a debug pod:

oc run test --image=busybox -it --rm -- sh

From there:

ping
nslookup
curl

This isolates networking from your app.

Image Signing Guide with Tekton Chains on OCP

April 20, 2026April 20, 2026 techhadoop OCP ai, artificial-intelligence, cybersecurity, security, technology

A complete guide to image signing with Tekton Chains on OCP — covering the concept, the setup, the pipeline integration, and verification.—

What Tekton Chains does

Tekton Chains works by reconciling the run of a task or a pipeline. Once the run is observed as completed, Tekton Chains takes a snapshot of the completed TaskRun/PipelineRun, and starts its core work in the order of: formatting (generate provenance JSON) → signing (sign the payload using the configured key) → uploading (upload the provenance and its signature to the configured storage).

It operates entirely automatically — you don’t modify your pipeline at all. Chains watches completed runs and signs in the background.

Step 1 — Chains is already installed on OCP

The Red Hat OpenShift Pipelines Operator installs Tekton Chains by default. You can configure Tekton Chains by modifying the TektonConfig custom resource; the Operator automatically applies the changes that you make.

			
# Verify Chains is running
oc get pods -n openshift-pipelines | grep chains
# tekton-chains-controller-xxx   Running

Step 2 — Generate a signing key pair

			
# Install cosign (if not already)
brew install cosign   # or download binary
# Generate key pair — stores private key as K8s secret automatically
cosign generate-key-pair k8s://openshift-pipelines/signing-secrets
# This creates:
#   signing-secrets (K8s Secret) — holds cosign.key + cosign.password
#   cosign.pub (local file)      — distribute this for verification
# Extract public key for distribution/verification
oc get secret signing-secrets -n openshift-pipelines \
  -o jsonpath='{.data.cosign\.pub}' | base64 -d > cosign.pub

		

For production, use a KMS (AWS KMS, HashiCorp Vault, GCP KMS) instead of a file-based key:

			
# AWS KMS example
cosign generate-key-pair --kms awskms:///arn:aws:kms:ca-central-1:123456:key/abc-def

Step 3 — Configure Chains via TektonConfig

Cluster administrators can use Tekton Chains to sign and verify images and provenances by: creating an encrypted x509 key pair and saving it as a Kubernetes secret; setting up authentication for the OCI registry to store images, image signatures, and signed image attestations; and configuring Tekton Chains to generate and sign provenance.

			
apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
  name: config
spec:
  chain:
    # Format for TaskRun attestations
    artifacts.taskrun.format: "slsa/v1"          # SLSA v1.0 provenance
    artifacts.taskrun.storage: "oci"             # store in OCI registry
    # Format for PipelineRun attestations (recommended)
    artifacts.pipelinerun.format: "slsa/v1"
    artifacts.pipelinerun.storage: "oci"
    artifacts.pipelinerun.enable-deep-inspection: "true"  # inspect child TaskRuns
    # OCI image signature format
    artifacts.oci.format: "simplesigning"
    artifacts.oci.storage: "oci"
    # Transparency log (Sigstore Rekor)
    transparency.enabled: "true"
    transparency.url: "https://rekor.sigstore.dev"   # or your internal Rekor
    # Signing key reference
    signers.cosign.key: "k8s://openshift-pipelines/signing-secrets"

		

Apply via oc patch if you prefer:

			
oc patch tektonconfig config --type=merge -p='{
  "spec": {
    "chain": {
      "artifacts.pipelinerun.format": "slsa/v1",
      "artifacts.pipelinerun.storage": "oci",
      "artifacts.oci.format": "simplesigning",
      "artifacts.oci.storage": "oci",
      "transparency.enabled": "true"
    }
  }
}'

		

Step 4 — Type-hint your pipeline so Chains knows what to sign

Chains discovers what the output artifact is via type hints in Task results. Your build task must emit IMAGE_URL and IMAGE_DIGEST results:

			
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: buildah-push
spec:
  params:
    - name: IMAGE
      type: string
  results:
    # Type hints — Chains reads these to find the artifact
    - name: IMAGE_URL
      description: The image URL
    - name: IMAGE_DIGEST
      description: The image digest (sha256)
  steps:
    - name: build-and-push
      image: registry.redhat.io/rhel8/buildah
      script: |
        buildah bud -t $(params.IMAGE) .
        buildah push $(params.IMAGE) \
          --digestfile /tmp/digest
        # Emit type hints for Chains
        echo -n "$(params.IMAGE)" | tee $(results.IMAGE_URL.path)
        cat /tmp/digest | tee $(results.IMAGE_DIGEST.path)

		

For pipeline-level provenance, also emit results at the Pipeline level:

			
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  results:
    - name: IMAGE_URL
      value: $(tasks.build.results.IMAGE_URL)
    - name: IMAGE_DIGEST
      value: $(tasks.build.results.IMAGE_DIGEST)
  tasks:
    - name: build
      taskRef:
        name: buildah-push

		

Step 5 — What happens automatically after a run

Once your PipelineRun completes, Chains fires automatically. You can watch for the signed annotation:

			
# Watch for Chains to finish signing
oc get pipelinerun my-run -o json | jq '.metadata.annotations'
# {
#   "chains.tekton.dev/signed": "true",
#   "chains.tekton.dev/transparency": "https://rekor.sigstore.dev/api/v1/log/entries?logIndex=12345678"
# }
# What gets stored in the OCI registry alongside your image:
# myimage:sha256-abc123.sig           ← cosign image signature
# myimage:sha256-abc123.att           ← SLSA provenance attestation

		

Step 6 — Verify images before deployment

			
# Set your image reference (always use digest, not tag)
IMAGE="quay.io/my-org/my-app@sha256:abc123..."
# 1. Verify the image signature
cosign verify \
  --key cosign.pub \
  $IMAGE
# 2. Verify the SLSA provenance attestation
cosign verify-attestation \
  --key cosign.pub \
  --type slsaprovenance \
  $IMAGE | jq '.payload | @base64d | fromjson'
# 3. Check the Rekor transparency log entry
rekor-cli search --sha sha256:abc123...

		

The SLSA provenance JSON tells you exactly what built the image — the git commit, the pipeline name, each task step, and all input dependencies.

Step 7 — Enforce signatures at admission (policy gate)

Verification at deployment time is where this pays off. Use OCP’s built-in image policy or Kyverno/OPA to block unsigned images:

			
# Kyverno policy — block any image without a valid Chains signature
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-image-signature
      match:
        resources:
          kinds: [Pod]
      verifyImages:
        - imageReferences:
            - "quay.io/my-org/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |-
                      -----BEGIN PUBLIC KEY-----
                      <your cosign.pub contents>
                      -----END PUBLIC KEY-----
          attestations:
            - predicateType: https://slsa.dev/provenance/v1
              conditions:
                - all:
                    - key: "{{ builder.id }}"
                      operator: Equals
                      value: "https://tekton.dev/chains/v2"

		

SLSA levels Tekton Chains achieves

SLSA level	Requirement	Chains status
Level 1	Provenance exists	Achieved — attestation generated automatically
Level 2	Signed provenance, hosted build	Achieved — cosign signature + Rekor log entry
Level 3	Hardened build platform, non-falsifiable provenance	Achieved with OCP’s isolated pod builds
Level 4	Two-party review, hermetic builds	Partial — requires additional hermetic build config

The key benefit: by implementing provenance in CI/CD pipelines, you protect your supply chain from tampering and unauthorized access, streamline compliance with evolving industry and government regulations, and enhance visibility and trust throughout your software lifecycle.

Understanding Tekton: A Comprehensive CI/CD Framework for Kubernetes

April 20, 2026April 20, 2026 techhadoop kubernetes ai, artificial-intelligence, cloud, devops, technology

Tekton is a cloud-native CI/CD framework built for Kubernetes. Here’s a full breakdown:

What it is

Tekton is a Kubernetes-native open source framework for creating continuous integration and continuous delivery (CI/CD) systems. It installs and runs as an extension on a Kubernetes cluster and comprises a set of Kubernetes Custom Resources that define the building blocks you can create and reuse for your pipelines.

Tekton standardizes CI/CD tooling and processes across vendors, languages, and deployment environments. It lets you create CI/CD systems quickly, giving you scalable, serverless, cloud-native execution out of the box.

Core building blocks

Everything in Tekton is composed of these layers:

Step — the most basic entity, such as running unit tests or compiling a program. Tekton performs each step with a provided container image.
Task — a collection of steps in a specific order. Tekton runs a task in the form of a Kubernetes pod, where each step becomes a running container in the pod.
Pipeline — a collection of tasks in a specific order. Tekton collects all tasks, connects them in a directed acyclic graph (DAG), and executes the graph in sequence.
TaskRun — a specific execution of a task.
PipelineRun — a specific execution of a pipeline.

Example pipeline (clone → build → deploy)

			
# Step 1: Define a Task
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: build-and-push
spec:
  params:
    - name: IMAGE
      type: string
  steps:
    - name: build
      image: gcr.io/kaniko-project/executor:latest
      args:
        - --destination=$(params.IMAGE)
        - --context=/workspace/source
---
# Step 2: Compose Tasks into a Pipeline
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
spec:
  tasks:
    - name: clone
      taskRef:
        name: git-clone        # from Tekton Catalog
    - name: build
      runAfter: [clone]
      taskRef:
        name: build-and-push
    - name: deploy
      runAfter: [build]
      taskRef:
        name: kubectl-apply
---
# Step 3: Trigger a run
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  name: ci-pipeline-run-001
spec:
  pipelineRef:
    name: ci-pipeline

		

Major components

The Tekton ecosystem includes:

Pipelines — the core CRDs (Task, Pipeline, etc.)
Triggers — allows you to create pipelines based on event payloads, such as triggering a run every time a merge request is opened against a Git repo
CLI (tkn) — command-line interface to interact with Tekton from your terminal
Dashboard — a web-based graphical interface showing pipeline execution history
Catalog — a repository of high-quality, community-contributed reusable Tasks and Pipelines
Chains — manages supply chain security, including artifact signing and SLSA provenance

Key advantages

Truly Kubernetes-native — every pipeline run is a real Kubernetes pod; no external CI server needed
Reusable and composable — Tasks from the Tekton Hub can be dropped into any pipeline
Event-driven — Triggers fire pipelines automatically on Git webhooks, image pushes, etc.
Scalable — each step runs in its own container; pipelines scale with the cluster
Supply chain security — Tekton Chains can sign images and generate SLSA provenance automatically

Tekton on OpenShift

Red Hat ships Tekton as OpenShift Pipelines — the officially supported Tekton operator available directly from OperatorHub. It adds OCP-specific integrations like integration with the OpenShift internal image registry, S2I (Source-to-Image) tasks, and the OpenShift console Pipeline UI. Tekton is the basis for OpenShift Pipelines, making it the natural CI tool to pair with Argo CD or Flux for a full GitOps workflow on OCP (Tekton handles CI/build, Argo CD or Flux handles CD/deploy).

Here’s the full picture of how Tekton (CI) and Argo CD / Flux (CD) work together on OCP — first the architecture flow, then a complete reference guide.Now here’s the full practical reference — everything you need to wire it up on OCP.

How the two halves divide responsibility

When code changes are pushed to a Git repository, OpenShift Pipelines initiates a pipeline run. This pipeline might include tasks such as building container images, running unit tests, and generating artifacts. Once the pipeline successfully completes, Argo CD continuously monitors the Git repository for changes in application manifests. Once the new image version is committed, Argo CD synchronizes the application state to match the declared state in Git.

The key insight is that Tekton owns the source repo (code → image) and Argo CD / Flux owns the config repo (manifests → cluster). Tekton never deploys directly. It commits the new image tag to a separate GitOps manifests repo, then hands off.

Step 1 — Install both operators on OCP

			
# OpenShift Pipelines (Tekton) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift Pipelines" → Install
# OpenShift GitOps (Argo CD) — via OperatorHub
# Operators → OperatorHub → "Red Hat OpenShift GitOps" → Install
# Verify both are running
oc get pods -n openshift-pipelines
oc get pods -n openshift-gitops

		

Step 2 — The Tekton CI pipeline

On every push or pull-request to the source Git repository, the following steps execute within the Tekton pipeline: code is cloned and unit tests are run; the application is analyzed by SonarQube in parallel; a container image is built using S2I and pushed to the OpenShift internal registry; then Kubernetes manifests are updated in the Git repository with the image digest that was built within the pipeline.

			
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: ci-pipeline
  namespace: cicd
spec:
  workspaces:
    - name: source
    - name: dockerconfig
  params:
    - name: GIT_URL
      type: string
    - name: IMAGE
      type: string
    - name: GIT_MANIFEST_URL   # separate repo for k8s manifests
      type: string
  tasks:
    - name: clone
      taskRef:
        name: git-clone
        kind: ClusterTask
      workspaces:
        - name: output
          workspace: source
      params:
        - name: url
          value: $(params.GIT_URL)
    - name: unit-test
      runAfter: [clone]
      taskRef:
        name: maven
        kind: ClusterTask
      workspaces:
        - name: source
          workspace: source
    - name: build-image
      runAfter: [unit-test]
      taskRef:
        name: buildah
        kind: ClusterTask
      params:
        - name: IMAGE
          value: $(params.IMAGE)
      workspaces:
        - name: source
          workspace: source
        - name: dockerconfig
          workspace: dockerconfig
    - name: scan-image
      runAfter: [build-image]
      taskRef:
        name: trivy-scanner      # from Tekton Hub
      params:
        - name: IMAGE
          value: $(params.IMAGE)
    - name: update-manifest      # THE HANDOFF to GitOps
      runAfter: [scan-image]
      taskRef:
        name: git-cli
        kind: ClusterTask
      params:
        - name: GIT_USER_NAME
          value: tekton-bot
        - name: COMMANDS
          value: |
            git clone $(params.GIT_MANIFEST_URL) /workspace/manifest
            cd /workspace/manifest
            # Update image tag in kustomization
            kustomize edit set image myapp=$(params.IMAGE)
            git add -A
            git commit -m "ci: update image to $(params.IMAGE)"
            git push

		

Step 3 — Tekton Triggers (webhook → pipeline)

			
# EventListener — receives the GitHub/GitLab webhook
apiVersion: triggers.tekton.dev/v1beta1
kind: EventListener
metadata:
  name: git-push-listener
  namespace: cicd
spec:
  serviceAccountName: pipeline
  triggers:
    - name: push-trigger
      bindings:
        - ref: github-push-binding
      template:
        ref: pipeline-trigger-template
---
# TriggerTemplate — what to create when the webhook fires
apiVersion: triggers.tekton.dev/v1beta1
kind: TriggerTemplate
metadata:
  name: pipeline-trigger-template
  namespace: cicd
spec:
  params:
    - name: git-revision
    - name: git-repo-url
  resourcetemplates:
    - apiVersion: tekton.dev/v1
      kind: PipelineRun
      metadata:
        generateName: ci-run-
      spec:
        pipelineRef:
          name: ci-pipeline
        params:
          - name: GIT_URL
            value: $(tt.params.git-repo-url)
          - name: IMAGE
            value: image-registry.openshift-image-registry.svc:5000/myapp/app:$(tt.params.git-revision)

		

Expose the EventListener as an OCP Route so GitHub/GitLab can reach it:

			
oc expose svc el-git-push-listener -n cicd
# Then add the route URL as a webhook in GitHub/GitLab

Step 4 — Argo CD watches and deploys

Once the manifests repo is updated by Tekton, Argo CD detects the change. With automated.prune: true and selfHeal: true, it syncs immediately and deploys the new revision.

			
# Argo CD Application — dev environment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-dev
  namespace: openshift-gitops
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/manifests.git
    targetRevision: main
    path: environments/dev          # Kustomize overlay for dev
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp-dev
  syncPolicy:
    automated:
      prune: true       # remove resources deleted from Git
      selfHeal: true    # revert manual changes to the cluster
    syncOptions:
      - CreateNamespace=true
---
# Promotion to staging requires a PR merge (no auto-deploy to prod)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-staging
  namespace: openshift-gitops
spec:
  source:
    path: environments/staging
    targetRevision: staging         # separate branch = manual promotion
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

		

The GitOps repo layout Tekton writes to

			
manifests-repo/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
└── environments/
    ├── dev/
    │   └── kustomization.yaml     ← Tekton updates image tag here
    ├── staging/
    │   └── kustomization.yaml     ← promoted via PR merge
    └── prod/
        └── kustomization.yaml     ← promoted via PR merge + approval

		

Promotion flow (dev → staging → prod)

Once the pipeline finishes successfully, the image reference in the manifests repo is updated and automatically deployed to the dev environment by Argo CD. To promote to staging, a pull request is generated targeting the staging branch. Merging that PR triggers Argo CD to sync the staging environment. Production follows the same pattern with an additional approval gate.

The promotion task in Tekton creates a PR automatically:

			
- name: promote-to-staging
  runAfter: [update-manifest]
  taskRef:
    name: github-open-pr       # from Tekton Hub
  params:
    - name: REPO_FULL_NAME
      value: my-org/manifests
    - name: HEAD
      value: feature/new-image-$(params.git-revision)
    - name: BASE
      value: staging
    - name: TITLE
      value: "Promote $(params.IMAGE) to staging"

		

Putting it all together — the complete flow

Step	Actor	Action
1	Developer	`git push` to source repo
2	GitHub/GitLab	Sends webhook to Tekton EventListener
3	Tekton	Clones, tests, builds image with Buildah/S2I
4	Tekton	Scans image with Trivy / ACS
5	Tekton	Pushes image to OCP internal registry or Quay
6	Tekton	Updates image tag in manifests repo, opens PR to staging
7	Argo CD / Flux	Detects change in manifests repo, deploys to dev automatically
8	Team	Reviews and merges PR → staging auto-deploys
9	Team	Approves prod PR → production deploys

This pattern — Tekton handles CI, Argo CD / Flux handles CD, and Git is the only bridge between them — is the standard GitOps delivery model on OCP.

Ultimate Guide to Velero for Kubernetes Backups

April 19, 2026April 19, 2026 techhadoop kubernetes kubernetes

Velero is an open-source tool used to back up, restore, and migrate Kubernetes cluster resources and persistent volumes.

Think of it as a safety net for your Kubernetes environment

What Velero actually does

Velero helps you:

Back up cluster data (like deployments, services, configs)
Restore your cluster if something breaks
Migrate workloads between clusters or cloud providers
Schedule automatic backups

How it works (simple view)

Velero connects your Kubernetes cluster to external storage (like cloud object storage — e.g., AWS S3, Azure Blob, etc.) and:

Takes a snapshot of cluster resources
Optionally backs up persistent volumes
Stores everything outside the cluster
Lets you restore it later when needed

What gets backed up

Kubernetes resources (Pods, Deployments, Services, etc.)
Persistent Volume data (via snapshots or file-level backups)
Namespaces and metadata

Common use cases

Disaster recovery (cluster crash, accidental deletion)
Migrating apps between clusters/clouds
Testing environments (restore production snapshot into staging)
Compliance backups

Velero vs basic backups

Without Velero, you’d have to manually export configs and handle storage snapshots yourself. Velero automates and organizes all of that.

Ecosystem

Velero is often used alongside:

Kubernetes-native tools
Cloud providers (AWS, Azure, GCP)
Storage plugins (for volume snapshots)

Here’s a simple, practical walkthrough to install Velero and run your first backup.

I’ll show the most common setup: Kubernetes + AWS S3 (others like Azure/GCP are similar).

1. Prerequisites

Make sure you have:

A running Kubernetes cluster
kubectl configured
An S3 bucket (or equivalent object storage)
AWS credentials (access key + secret)

2. Install Velero CLI

Download and install the Velero CLI:

			
# Mac (Homebrew)
brew install velero
# Or via binary
curl -L https://github.com/vmware-tanzu/velero/releases/latest/download/velero-darwin-amd64.tar.gz | tar -xz
sudo mv velero /usr/local/bin/

		

Verify:

velero version

3. Create credentials file

Create a file called credentials-velero:

			
[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY

4. Install Velero in your cluster

Run this command (replace bucket + region):

			
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket YOUR_BUCKET_NAME \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

		

This will:

Deploy Velero into your cluster
Connect it to your S3 bucket
Set up volume snapshot support

5. Verify installation

kubectl get pods -n velero

You should see a running Velero pod.

6. Create your first backup

Backup entire cluster:

velero backup create my-first-backup

Backup a specific namespace:

			
velero backup create my-backup \
  --include-namespaces my-namespace

7. Check backup status

velero backup get

Describe it:

velero backup describe my-first-backup

8. Restore from backup

velero restore create --from-backup my-first-backup

9. (Optional) Schedule automatic backups

			
velero schedule create daily-backup \
  --schedule="0 2 * * *"

This runs every day at 2 AM.

Tips that actually matter

Start with namespace backups, not full cluster
Use labels to target specific apps
Test restore early (don’t wait for disaster)
Monitor storage costs (snapshots + S3)

Common mistakes

Wrong IAM permissions → backups silently fail
Forgetting persistent volumes → incomplete recovery
Not testing restores → risky in real incidents

Understanding the Shared Responsibility Model in ROSA/ARO

April 19, 2026April 19, 2026 techhadoop Uncategorized ai, azure, cloud, kubernetes, technology

When moving from on-premise to a Managed Service like ROSA (Red Hat OpenShift on AWS) or ARO (Azure Red Hat OpenShift), the interview shifts. The technical “heavy lifting” of managing master nodes and etcd is now handled by Red Hat and AWS/Microsoft (the SRE team).

Your role as an administrator moves from “Keeping the lights on” to “Governance, Cost Optimization, and Integration.”

1. The Shared Responsibility Model

This is the #1 question for managed services.

Q: Who is responsible for what in a ROSA/ARO environment?

The Provider (Red Hat/Cloud Provider): Manages the Control Plane (Masters), etcd health, patching of the underlying OS, and the core OpenShift Operators.
The Customer (You): Manages Worker nodes (scaling), Application lifecycle, RBAC, Network Policies, and Quotas.

Interview Tip: Mention that you no longer have cluster-admin in the traditional sense on ARO; you have a customer-admin role. You cannot SSH into master nodes or modify the etcd configuration directly.

2. Day 1: Provisioning & Connectivity

Q1: How does networking differ in ROSA/ARO compared to on-prem?

Answer: In a managed service, OpenShift is integrated into the Cloud’s Virtual Private Cloud (VPC/VNet).

Private vs. Public Clusters: You must decide if the API and Ingress are “Public” (accessible over the internet) or “Private” (only accessible via VPN/DirectConnect/ExpressRoute).
VPC Peering/Transit Gateway: You are responsible for connecting the OpenShift VPC to the rest of your cloud infrastructure (e.g., to reach a managed RDS database or Azure SQL).

Q2: What is the “Assisted Installer” vs. “Cloud CLI”?

Answer: For ROSA, you use the rosa CLI. For ARO, you use the az aro command. These tools abstract the CloudFormation or ARM templates required to spin up the infrastructure.

3. Day 2: Managed Operations

Q3: How do you handle cluster upgrades in ROSA/ARO?

Answer: You don’t just “hit update” and pray.

In ROSA, you can schedule upgrade windows via the OpenShift Cluster Manager (OCM).
The Red Hat SRE team monitors the upgrade. If it fails, they are the ones paged, not you. However, you must ensure your applications have correct Pod Disruption Budgets (PDBs) so the rolling update doesn’t take down your service.

Q4: How do you scale the cluster in the cloud?

Answer: You use MachineAutoscalers.

Unlike on-prem, where you are limited by physical hardware, in ROSA/ARO you define a MachineAutoscaler that monitors the cluster’s resource requests. If a pod can’t be scheduled due to lack of CPU, the autoscaler automatically provisions a new EC2/Azure VM and joins it to the cluster.

4. Cost & Security

Q5: How do you control costs in a managed OpenShift environment?

Answer: Since you pay for every worker node, I implement:

Cluster Autoscaling: Scaling down to minimum nodes at night.
Resource Quotas: Preventing developers from requesting 16GB of RAM for a “Hello World” app.
Spot Instances: Using AWS Spot or Azure Priority instances for non-production workloads to save up to 70% on compute costs.

Q6: How do you handle Authentication?

Answer: You typically don’t use local users. You integrate OpenShift with Azure AD (Entra ID) or AWS IAM/OIDC.

Question: “How do pods access cloud resources (like S3 or Azure Vault)?”
Answer: STS (Security Token Service) or Managed Identities. This allows pods to assume a cloud role without needing to store static “Access Keys” inside a Secret.

5. Summary Comparison: On-Prem vs. Managed

Feature	On-Prem (Bare Metal/VMware)	Managed (ROSA/ARO)
Control Plane	You manage (3 VMs/Servers)	Managed by SRE (Hidden/Bundled)
Updates	Manual / High Risk	Scheduled / Automated
Load Balancer	MetalLB / F5 / HAProxy	AWS NLB/ALB or Azure LB
Storage	ODF / vSphere CSI	EBS/EFS or Azure Disk/Files
Failure Response	You get paged at 3 AM	Red Hat/Cloud SRE handles infra

The “Pro” Managed Question:

“If the cluster is managed by Red Hat, why do they still need an Administrator like you?”

Winning Answer: “Because while Red Hat manages the platform, I manage the consumption. I ensure the networking between our VPCs is secure, I manage the RBAC and onboarding for our developers, I optimize costs so we aren’t over-provisioning cloud resources, and I implement the CI/CD patterns that allow our apps to run reliably on that platform.”

Mastering Day 1 and Day 2 of Cluster Management

April 19, 2026April 19, 2026 techhadoop OCP ai, artificial-intelligence, cloud, kubernetes, technology

This is a classic way for interviewers to see if you have actually managed a cluster in production. Day 1 is about getting the cluster alive; Day 2 is about keeping it from dying.

In a senior interview, they expect you to spend most of your time talking about Day 2, as that represents 99% of a cluster’s lifespan

Day 1: Installation & Provisioning

Focus: Automation, Infrastructure, and “Getting to Green.”

Task	On-Premise Reality Check
DNS Setup	Creating the critical records: `api`, `api-int`, and `*.apps`. Without these, the bootstrap will fail.
Load Balancing	Setting up external HAProxy or F5 (for UPI) or ensuring VIPs are reserved (for IPI).
Ignition Configs	Using the installer to generate `.ign` files and serving them via HTTP/PXE to the bare metal/VM nodes.
Certificate Approval	Manually running `oc get csr` and approving them to allow nodes to join the cluster.
Registry Mirroring	(If Air-gapped) Setting up the local Quay/Nexus registry and the `ImageContentSourcePolicy`.

Day 2: Maintenance & Operations

Focus: Stability, Compliance, and Scaling.

1. Lifecycle Management

Cluster Upgrades: Navigating the “Update Graph.” Choosing between the stable and fast channels.
Certificate Rotation: Monitoring the expiration of the internal API and Ingress CA (though OpenShift now automates most of this, an admin must know how to fix a “stuck” rotation).
Node Scaling: Adding new Bare Metal workers via the Assisted Installer or expanding VMware Resource Pools.

2. Performance & Health

Etcd Maintenance: Performing periodic defragmentation and manual snapshots before any major change.
Logging Stack Management: Tuning the Elasticsearch/Fluentd (or Loki) stack. On-premise, this often means managing “PVC full” issues when logs grow too fast.
Pruning: Running oc adm prune to clean up old builds, images, and deployments that are cluttering the etcd database.

3. Security & Governance

RBAC Auditing: Ensuring developers aren’t using cluster-admin for daily tasks.
SCC Policy: Managing exceptions for specialized workloads (e.g., giving a monitoring agent privileged access).
Quota Management: Defining ResourceQuotas per Project to prevent a single “noisy neighbor” from consuming all physical RAM on your ESXi hosts.

The “Senior Admin” Bonus: Disaster Recovery (DR)

An interviewer will almost certainly ask: “What is your DR strategy for on-prem?”

A high-quality answer includes:

Etcd Backups: Stored outside the cluster (e.g., on an external S3 bucket or NAS).
Velero: Using the Velero operator to back up application metadata and Persistent Volumes (using CSI snapshots).
Multi-Cluster: Having a second “Passive” cluster in a different data center and using Red Hat Advanced Cluster Management (RHACM) to shift traffic via DNS if the primary DC goes dark.

Final Interview Tip: The “Why”

When answering, don’t just say what you did; say why it matters for the business:

Wrong: “I configured the MTU to 1400.”
Right: “I lowered the MTU to 1400 to prevent packet fragmentation over our Geneve tunnels, which reduced our application latency by 30%.”

Infra Cloud Solutions

Year: 2026

Image Signing Guide with Tekton Chains on OCP

What Tekton Chains does

Step 1 — Chains is already installed on OCP

Step 2 — Generate a signing key pair

Step 3 — Configure Chains via TektonConfig

Step 4 — Type-hint your pipeline so Chains knows what to sign

Step 5 — What happens automatically after a run

Step 6 — Verify images before deployment

Step 7 — Enforce signatures at admission (policy gate)

SLSA levels Tekton Chains achieves

Ultimate Guide to Velero for Kubernetes Backups

What Velero actually does

How it works (simple view)

What gets backed up

Common use cases

Velero vs basic backups

Ecosystem

1. Prerequisites

2. Install Velero CLI

3. Create credentials file

4. Install Velero in your cluster

5. Verify installation

6. Create your first backup

Backup entire cluster:

Backup a specific namespace:

7. Check backup status

8. Restore from backup

9. (Optional) Schedule automatic backups

Tips that actually matter

Common mistakes

Scenario

Diagram

How to debug it

1. Prove it is DNS and not general networking

2. Check the failing pod’s /etc/resolv.conf

3. Make sure the pod is querying the right namespace

4. Check whether the DNS pods are healthy

5. Check whether only some nodes are affected

6. Test from a clean debug pod

7. Check DNS service reachability from the bad pod

8. Check logs from the DNS pods

What this usually turns out to be

Fast triage sequence

Mental model

Scenario

What this means (important insight)

Mental model (diagram)

Step-by-step debugging

Step 1: Confirm endpoints exist

If EMPTY:

Step 2: Verify service definition

Step 3: Test ClusterIP directly

Results:

Step 4: Check DNS (don’t skip this)

If fails:

Step 5: Inspect OVN load balancer

If missing:

Step 6: Check OVN logs

Step 7: Check kube-proxy replacement

Real root causes (from production)

1. Label mismatch (MOST COMMON)

2. Wrong port/targetPort

3. OVN load balancer not programmed

4. NetworkPolicy blocking service traffic

5. DNS issue (misdiagnosed often)

Fast debugging logic (this is gold)

Pro tip (what experts do fast)

Key takeaway

Scenario

How to debug it

1. Prove it’s node-specific

2. Check the OVN pod on the bad node

3. Check node readiness and basic health

4. Inspect OVS on the bad node

5. Check the node’s host networking

6. Compare MTU with a working node

7. Check if pod wiring exists on the bad node

8. Test service vs direct pod IP

9. Check for node-local firewall or host changes

10. Restart scope carefully

What this usually turns out to be

Fast triage checklist

Mental model

Scenario

Step-by-step debugging (real flow)

Step 1: Check if backend pod is healthy

Step 2: Test direct pod-to-pod connectivity

Outcomes:

Case A: This FAILS

Case B: This WORKS

Branch A: Pod-to-pod FAILS (OVN issue)

Step 3A: Check NetworkPolicies

Step 4A: Check node-level OVN

Step 5A: Test OVS health

Step 6A: Check OVN logs

Branch B: Pod-to-pod WORKS, Service FAILS

Step 3B: Check service

Step 4B: Check endpoints

If EMPTY:

Step 5B: Test service IP directly

Fails but pod IP works:

Step 6B: Check OVN load balancer

Bonus: DNS check (often confused with OVN)

If fails:

Real root cause examples (from production)

Case 1: Wrong labels

Case 2: NetworkPolicy blocking traffic

Case 3: OVN desync

Case 4: Node issue

Case 5: MTU mismatch

2. Check the failing pod’s `/etc/resolv.conf`