Top OpenShift Interview Questions for Cluster Administrators

For a Cluster Administrator role, the interview questions will pivot away from application development (like S2I or basic deployments) and focus heavily on infrastructure, cluster stability, day-2 operations, security, and underlying architecture.

Here are the high-yield, advanced OpenShift interview questions tailored specifically for a Cluster Administrator.

1. Installation, Infrastructure & Architecture

Q1: What is the difference between IPI (Installer-Provisioned Infrastructure) and UPI (User-Provisioned Infrastructure)? When would you choose one over the other?
  • IPI (Full Automation): The OpenShift installer controls everything. It talks directly to the cloud provider API (like AWS, Azure, or vSphere), provisions the networking, load balancers, storage, virtual machines, and installs the cluster.
    • When to use: When you want a quick, standard, hands-off deployment and have full administrative rights to the underlying cloud/infra provider.
  • UPI (Customized Control): The administrator manually provisions the infrastructure (compute, networking, storage, firewalls) ahead of time. The OpenShift installer is only used to generate the ignition files to boot the nodes.
    • When to use: Essential for strict enterprise environments. If you have complex pre-existing networking (DMZs, specific subnets), custom firewalls, disconnected (air-gapped) environments, or strict security governance where an installer cannot be given API access to create infrastructure.
Q2: What is Red Hat Enterprise Linux CoreOS (RHCOS), and why does OpenShift require it for the Control Plane?

RHCOS is a minimal, monolithic, container-optimized operating system.

  • Immutability: The underlying host OS is read-only (except for /etc and /var). This prevents “configuration drift” where individual administrators make untracked manual changes to specific nodes.
  • Managed by the Cluster: RHCOS is managed directly by the cluster itself via the Machine Config Operator (MCO). Upgrading OpenShift automatically upgrades the OS on the nodes. You treat nodes as cattle, not pets.
  • Control Plane Requirement: OpenShift strictly requires RHCOS for master nodes to ensure total predictability, security, and atomic updates of the control plane.

2. Day-2 Operations & Upgrades

Q3: You are planning a cluster upgrade from version 4.x to 4.y. Walk me through your pre-requisites and execution steps.

A seasoned admin doesn’t just click “Upgrade”. The response should show a structured process:

  1. Check the Upgrade Graph: Use the OpenShift Update Graph tool or oc adm upgrade to verify a valid, supported path exists between your current version and the target version.
  2. Evaluate Operator Compatibility: Check the Operator Lifecycle Manager (OLM) to ensure all installed 3rd-party operators (e.g., databases, service meshes) are compatible with the target OpenShift version.
  3. Verify Cluster Health: Ensure all ClusterOperators are Available=True, Progressing=False, and Degraded=False. Never upgrade a degraded cluster.
  4. Backup the etcd Database: Take a manual etcd snapshot before initiating the upgrade (oc debug node/... -- chroot /host cluster-etcd-operator/etcd-snapshot-backup.sh).
  5. Monitor Worker Node Capacity: Ensure there is enough spare capacity in the cluster. Because nodes are drained and rebooted sequentially during an upgrade, the remaining nodes must be able to handle the shifted workload.
  6. Trigger and Monitor: Execute oc adm upgrade channel=<channel> then oc adm upgrade --to=<version>. Monitor via oc get clusterversion.
Q4: How do you handle a scenario where a Worker Node becomes NotReady due to disk pressure?
  1. Identify the Culprit: Use oc describe node <node-name> to confirm DiskPressure is the active taint. Look at the conditions.
  2. Determine the Cause: Access node metrics or use oc debug node/<node-name> to check if the issue is in /var/lib/containers (stuck/bloated container logs, uncleaned images) or a specific application writing local data.
  3. Short-Term Remediation: * OpenShift’s Kubelet should automatically trigger garbage collection for unused images. If it fails, manual clearing of stopped containers or safe log rotation might be necessary.
    • Evict pods if necessary, though the DiskPressure taint should stop new pods from scheduling there.
  4. Long-Term Root Cause Analysis:
    • Implement stricter log limits in application Console logging.
    • Adjust the evictionHard thresholds in the KubeletConfig to trigger garbage collection earlier.
    • Consider scaling up node disk sizes or adding more worker nodes.

3. Advanced Networking, Security & Storage

Q5: What is the default CNI for modern OpenShift 4 clusters, and how does it differ from its predecessor?
  • Modern OpenShift clusters use OVN-Kubernetes (Open Virtual Network) as the default container network interface (CNI). It replaced OpenShift SDN.
  • Key Advantages of OVN-K:
    • It supports dual-stack IPv4/IPv6 networking out of the box.
    • It includes native support for Kubernetes NetworkPolicies and advanced routing.
    • Better integration with hybrid cloud environments and massive scalability compared to the older OVS-based OpenShift SDN.
Q6: How would you secure a multi-tenant OpenShift cluster where Team A and Team B must share the same hardware but cannot communicate?
  1. Network Isolation: Implement NetworkPolicies in each namespace. By default, namespaces can talk to each other. I would apply a default-deny policy for cross-namespace ingress traffic and explicitly whitelist only what is necessary.
  2. RBAC (Role-Based Access Control): Bind Team A’s users to a specific ClusterRole (like admin or edit) scoped strictly to Team A’s namespaces using RoleBindings (not ClusterRoleBindings).
  3. Resource Quotas and LimitRanges: Prevent one team from starving the other of resources. Apply ResourceQuotas to limit maximum CPU/Memory per namespace and LimitRanges to enforce default requests/limits on pods.
  4. Security Context Constraints (SCC): Ensure both teams are locked into the default restricted-v2 SCC so neither team can escalate privileges to the underlying worker node.

4. Scenario-Based Troubleshooting (The “Hero” Questions)

Q7: The etcd leader crashes, or the etcd cluster loses quorum. How do you recover?

Losing etcd quorum is a critical P1 outage.

  1. Assess the Damage: Check the status of the etcd pods using oc get pods -n openshift-etcd. If quorum is lost, the control plane APIs will stop responding (oc commands will timeout).
  2. Locate the Backup: Hopefully, automated or recent manual etcd backups exist on a secure external server (as performed in Q3).
  3. Perform an etcd Recovery: * You must access one of the Master nodes directly via SSH (using the core user and recovery keys).
    • Stop the etcd static pods.
    • Run the recovery script provided by OpenShift (/usr/local/bin/etcd-snapshot-restore.sh) initialization using the backup snapshot.
    • This script rewrites the etcd data directory and forces a single-node etcd cluster, allowing the other masters to slowly re-sync once the API comes back online.
Q8: A critical security vulnerability (CVE) is announced for the Linux Kernel. How do you patch all 100+ worker nodes in your cluster without disrupting application traffic?

You leverage the power of the Machine Config Operator (MCO):

  1. Red Hat will release an updated RHCOS node image or a MachineConfig package containing the patch.
  2. When the cluster updates its configuration via a new MachineConfig, the MCO takes over.
  3. The MCO acts as a rolling controller: It selects a worker node, marks it as unschedulable (cordon), safely evicts all running pods (drain), applies the kernel patch/reboot, waits for it to come back to Ready status, and moves to the next node.
  4. Because this happens sequentially, applications with proper replica counts and Pod Disruption Budgets (PDBs) suffer exactly zero downtime.

etcd recovery and MachineConfig concepts

Let’s break these two critical Cluster Administration concepts down to the architectural and operational level. Mastering etcd management and the Machine Config Operator (MCO) is what separates a junior administrator from a senior cluster engineer.

1. Deep Dive: etcd Recovery & Architecture

etcd is a distributed, consistent key-value store that acts as OpenShift’s single source of truth. Every single object—Pods, Routes, Secrets, CRDs, and ConfigMaps—is stored here.

The Quorum Mechanics

etcd uses the Raft consensus algorithm to ensure data consistency across the master nodes. To function, it requires a strict majority (quorum) of healthy members. The formula for quorum is:

{Quorum} = {N}/{2} + 1

Where N is the total number of members in the cluster.

  • For a standard 3-master cluster, quorum is 3/2 + 1 = 2. You can lose 1 master without losing quorum.
  • If 2 masters fail simultaneously, you have 1 node left. 1 < 2, so quorum is lost. The cluster API immediately locks up and stops responding.
Scenario: Step-by-Step Loss of Quorum Recovery

When quorum is broken, you cannot use oc commands because the API server depends on etcd. You must bypass the API entirely and interact directly with the master host operating system via SSH.

Step 1: Access a Surviving Master Node

SSH into one of the surviving master nodes using your cluster’s private SSH key:

Bash

ssh core@master-0.example.com

Step 2: Run the Backup Script (Pre-requisite Check)

Before recovering, ensure you actually have a valid snapshot. By default, backups are stored in /var/lib/etcd/ if configured, or a custom external path. A valid backup looks like a .db file accompanied by cluster metadata.

Step 3: Initiate the Single-Node Recovery

OpenShift provides a built-in recovery script located inside the cluster-etcd-operator container image, but it is exposed to the host path. Run the recovery script on the master node:

sudo -i
/usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot_v4.14.db

What this script does under the hood:

  1. Stops the static pods: It moves the manifests for the etcd, kube-apiserver, and kube-controller-manager out of /etc/kubernetes/manifests/ so the Kubelet stops trying to run them.
  2. Wipes existing data: It clears the corrupted/out-of-sync etcd data directory (/var/lib/etcd/).
  3. Restores the snapshot: It unpacks your .db file into /var/lib/etcd/.
  4. Rewrites the cluster membership: It modifies the etcd configuration to trick the node into believing it is a single-node cluster ($N=1$, meaning quorum = 1). It erases the metadata of the other dead masters.
  5. Restarts the static pods: It moves the manifests back, forcing the API server to spin up using this new single-member database.

Step 4: Re-syncing the Other Masters

Once the API server is back up on master-0, use your local terminal again. The remaining master nodes (master-1 and master-2) will still be out of sync.

You don’t need to manually restore them. Instead, you force the Cluster Etcd Operator to redeploy them by clearing their old member states:

Bash

# Force etcd operator to regenerate secret keys and re-sync members
oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedArchitectureControl": "true"}}}'

The operator will detect that master-1 and master-2 are missing from the current active state, pave their etcd directories automatically, and catch them up via Raft replication from master-0.

2. Deep Dive: The Machine Config Operator (MCO)

In vanilla Kubernetes, if you want to change a kernel parameter (sysctl), add an SSH key, or configure an enterprise container registry mirror on your worker nodes, you have to use external configuration management tools like Ansible, Puppet, or SaltStack.

OpenShift discards external tools entirely and uses the Machine Config Operator (MCO) to manage the operating system (RHCOS) natively through Kubernetes Custom Resources.

The MCO Component Hierarchy

To understand how a configuration change reaches a node, you must understand these 4 core objects:

  1. MachineConfig (MC): A YAML file that outlines the exact state you want the OS to be in. It can contain files to write, systemd units to enable, or kernel settings to apply.
  2. MachineConfigPool (MCP): A grouping of nodes that should receive the same configurations (typically mapped to roles, like master or worker).
  3. Controller: Monitors the MachineConfigs and compiles them into a single, master “target” configuration for the entire pool.
  4. Machine Config Daemon (MCD): A pod that runs as a DaemonSet on every single node in the cluster. It runs with root privileges (chroot /host) and is responsible for actually writing the changes to its local disk.
Scenario: Writing a Custom Security File to 100+ Nodes

Let’s look at how the MCO executes a change. Suppose your security team requires a corporate banner (/etc/issue) to be present on every worker node.

Step 1: Create the MachineConfig Object

The files inside a MachineConfig must be encoded in Ignition format (which uses URL-encoded or base64 text).

YAML

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker # Ties this config to the worker pool
name: 99-worker-corporate-banner
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:,WARNING%3A%20Authorized%20Access%20Only%21%0A
mode: 420 # Octal for file permissions (0644)
path: /etc/issue

Step 2: The MCO Reconciliation Flow

When you apply this file (oc apply -f banner.yaml), the following chain reaction occurs:

  1. Compilation: The MCO notices a new MachineConfig with the worker label. It takes this file and merges it with all existing worker configs into a new, single cryptographic hash string.
  2. Pool Update: The MachineConfigPool for workers changes its status to UPDATING=True.
  3. The Rollout Lifecycle (Node by Node):
    • The Machine Config Daemon (MCD) running on worker-0 notices the pool’s target hash no longer matches its local current hash.
    • The MCD requests the cluster to drain the node. The cluster cordons the node and gracefully evicts all workloads to other worker nodes.
    • Once empty, the MCD steps out of the container boundary using chroot, accesses the host file system, and writes the string WARNING: Authorized Access Only! into /etc/issue.
    • If the change requires a reboot (like a kernel parameter change), the MCD triggers a host reboot. If it doesn’t require a reboot (like our file change), it skips this step.
    • The MCD verifies the file exists and is correct, then uncordons the node, marking it Ready.
  4. Next Node: The MCO moves to worker-1, repeating the exact same process.
How to Monitor This as an Administrator

During a massive rollout, you monitor the orchestration via the pools:

oc get mcp

Output example during update:

Plaintext

NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT
master rendered-master-1a2b3c... True False False 3 3
worker rendered-worker-4f5e6d... False True False 100 45

If a node fails to apply a configuration (e.g., a systemd service fails to start), the pool will mark DEGRADED=True and immediately halt the entire rollout to prevent breaking the remaining 54 nodes.

If you were to encounter a Degraded MachineConfigPool in production, your immediate next step would be to check the logs of the specific daemon pod on the failing node.

Leave a Reply