For a Cluster Administrator role, the interview questions will pivot away from application development (like S2I or basic deployments) and focus heavily on infrastructure, cluster stability, day-2 operations, security, and underlying architecture.
Here are the high-yield, advanced OpenShift interview questions tailored specifically for a Cluster Administrator.
1. Installation, Infrastructure & Architecture
Q1: What is the difference between IPI (Installer-Provisioned Infrastructure) and UPI (User-Provisioned Infrastructure)? When would you choose one over the other?
- IPI (Full Automation): The OpenShift installer controls everything. It talks directly to the cloud provider API (like AWS, Azure, or vSphere), provisions the networking, load balancers, storage, virtual machines, and installs the cluster.
- When to use: When you want a quick, standard, hands-off deployment and have full administrative rights to the underlying cloud/infra provider.
- UPI (Customized Control): The administrator manually provisions the infrastructure (compute, networking, storage, firewalls) ahead of time. The OpenShift installer is only used to generate the ignition files to boot the nodes.
- When to use: Essential for strict enterprise environments. If you have complex pre-existing networking (DMZs, specific subnets), custom firewalls, disconnected (air-gapped) environments, or strict security governance where an installer cannot be given API access to create infrastructure.
Q2: What is Red Hat Enterprise Linux CoreOS (RHCOS), and why does OpenShift require it for the Control Plane?
RHCOS is a minimal, monolithic, container-optimized operating system.
- Immutability: The underlying host OS is read-only (except for
/etcand/var). This prevents “configuration drift” where individual administrators make untracked manual changes to specific nodes. - Managed by the Cluster: RHCOS is managed directly by the cluster itself via the Machine Config Operator (MCO). Upgrading OpenShift automatically upgrades the OS on the nodes. You treat nodes as cattle, not pets.
- Control Plane Requirement: OpenShift strictly requires RHCOS for master nodes to ensure total predictability, security, and atomic updates of the control plane.
2. Day-2 Operations & Upgrades
Q3: You are planning a cluster upgrade from version 4.x to 4.y. Walk me through your pre-requisites and execution steps.
A seasoned admin doesn’t just click “Upgrade”. The response should show a structured process:
- Check the Upgrade Graph: Use the OpenShift Update Graph tool or
oc adm upgradeto verify a valid, supported path exists between your current version and the target version. - Evaluate Operator Compatibility: Check the Operator Lifecycle Manager (OLM) to ensure all installed 3rd-party operators (e.g., databases, service meshes) are compatible with the target OpenShift version.
- Verify Cluster Health: Ensure all ClusterOperators are
Available=True,Progressing=False, andDegraded=False. Never upgrade a degraded cluster. - Backup the etcd Database: Take a manual etcd snapshot before initiating the upgrade (
oc debug node/... -- chroot /host cluster-etcd-operator/etcd-snapshot-backup.sh). - Monitor Worker Node Capacity: Ensure there is enough spare capacity in the cluster. Because nodes are drained and rebooted sequentially during an upgrade, the remaining nodes must be able to handle the shifted workload.
- Trigger and Monitor: Execute
oc adm upgrade channel=<channel>thenoc adm upgrade --to=<version>. Monitor viaoc get clusterversion.
Q4: How do you handle a scenario where a Worker Node becomes NotReady due to disk pressure?
- Identify the Culprit: Use
oc describe node <node-name>to confirmDiskPressureis the active taint. Look at the conditions. - Determine the Cause: Access node metrics or use
oc debug node/<node-name>to check if the issue is in/var/lib/containers(stuck/bloated container logs, uncleaned images) or a specific application writing local data. - Short-Term Remediation: * OpenShift’s Kubelet should automatically trigger garbage collection for unused images. If it fails, manual clearing of stopped containers or safe log rotation might be necessary.
- Evict pods if necessary, though the
DiskPressuretaint should stop new pods from scheduling there.
- Evict pods if necessary, though the
- Long-Term Root Cause Analysis:
- Implement stricter log limits in application
Consolelogging. - Adjust the
evictionHardthresholds in theKubeletConfigto trigger garbage collection earlier. - Consider scaling up node disk sizes or adding more worker nodes.
- Implement stricter log limits in application
3. Advanced Networking, Security & Storage
Q5: What is the default CNI for modern OpenShift 4 clusters, and how does it differ from its predecessor?
- Modern OpenShift clusters use OVN-Kubernetes (Open Virtual Network) as the default container network interface (CNI). It replaced OpenShift SDN.
- Key Advantages of OVN-K:
- It supports dual-stack IPv4/IPv6 networking out of the box.
- It includes native support for Kubernetes NetworkPolicies and advanced routing.
- Better integration with hybrid cloud environments and massive scalability compared to the older OVS-based OpenShift SDN.
Q6: How would you secure a multi-tenant OpenShift cluster where Team A and Team B must share the same hardware but cannot communicate?
- Network Isolation: Implement NetworkPolicies in each namespace. By default, namespaces can talk to each other. I would apply a default-deny policy for cross-namespace ingress traffic and explicitly whitelist only what is necessary.
- RBAC (Role-Based Access Control): Bind Team A’s users to a specific
ClusterRole(likeadminoredit) scoped strictly to Team A’s namespaces usingRoleBindings(notClusterRoleBindings). - Resource Quotas and LimitRanges: Prevent one team from starving the other of resources. Apply
ResourceQuotasto limit maximum CPU/Memory per namespace andLimitRangesto enforce default requests/limits on pods. - Security Context Constraints (SCC): Ensure both teams are locked into the default
restricted-v2SCC so neither team can escalate privileges to the underlying worker node.
4. Scenario-Based Troubleshooting (The “Hero” Questions)
Q7: The etcd leader crashes, or the etcd cluster loses quorum. How do you recover?
Losing etcd quorum is a critical P1 outage.
- Assess the Damage: Check the status of the etcd pods using
oc get pods -n openshift-etcd. If quorum is lost, the control plane APIs will stop responding (occommands will timeout). - Locate the Backup: Hopefully, automated or recent manual etcd backups exist on a secure external server (as performed in Q3).
- Perform an etcd Recovery: * You must access one of the Master nodes directly via SSH (using the core user and recovery keys).
- Stop the etcd static pods.
- Run the recovery script provided by OpenShift (
/usr/local/bin/etcd-snapshot-restore.sh) initialization using the backup snapshot. - This script rewrites the etcd data directory and forces a single-node etcd cluster, allowing the other masters to slowly re-sync once the API comes back online.
Q8: A critical security vulnerability (CVE) is announced for the Linux Kernel. How do you patch all 100+ worker nodes in your cluster without disrupting application traffic?
You leverage the power of the Machine Config Operator (MCO):
- Red Hat will release an updated RHCOS node image or a MachineConfig package containing the patch.
- When the cluster updates its configuration via a new
MachineConfig, the MCO takes over. - The MCO acts as a rolling controller: It selects a worker node, marks it as unschedulable (
cordon), safely evicts all running pods (drain), applies the kernel patch/reboot, waits for it to come back toReadystatus, and moves to the next node. - Because this happens sequentially, applications with proper replica counts and Pod Disruption Budgets (PDBs) suffer exactly zero downtime.
etcd recovery and MachineConfig concepts
Let’s break these two critical Cluster Administration concepts down to the architectural and operational level. Mastering etcd management and the Machine Config Operator (MCO) is what separates a junior administrator from a senior cluster engineer.
1. Deep Dive: etcd Recovery & Architecture
etcd is a distributed, consistent key-value store that acts as OpenShift’s single source of truth. Every single object—Pods, Routes, Secrets, CRDs, and ConfigMaps—is stored here.
The Quorum Mechanics
etcd uses the Raft consensus algorithm to ensure data consistency across the master nodes. To function, it requires a strict majority (quorum) of healthy members. The formula for quorum is:
{Quorum} = {N}/{2} + 1
Where N is the total number of members in the cluster.
- For a standard 3-master cluster, quorum is 3/2 + 1 = 2. You can lose 1 master without losing quorum.
- If 2 masters fail simultaneously, you have 1 node left. 1 < 2, so quorum is lost. The cluster API immediately locks up and stops responding.
Scenario: Step-by-Step Loss of Quorum Recovery
When quorum is broken, you cannot use oc commands because the API server depends on etcd. You must bypass the API entirely and interact directly with the master host operating system via SSH.
Step 1: Access a Surviving Master Node
SSH into one of the surviving master nodes using your cluster’s private SSH key:
Bash
ssh core@master-0.example.com
Step 2: Run the Backup Script (Pre-requisite Check)
Before recovering, ensure you actually have a valid snapshot. By default, backups are stored in /var/lib/etcd/ if configured, or a custom external path. A valid backup looks like a .db file accompanied by cluster metadata.
Step 3: Initiate the Single-Node Recovery
OpenShift provides a built-in recovery script located inside the cluster-etcd-operator container image, but it is exposed to the host path. Run the recovery script on the master node:
sudo -i/usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot_v4.14.db
What this script does under the hood:
- Stops the static pods: It moves the manifests for the
etcd,kube-apiserver, andkube-controller-managerout of/etc/kubernetes/manifests/so the Kubelet stops trying to run them. - Wipes existing data: It clears the corrupted/out-of-sync
etcddata directory (/var/lib/etcd/). - Restores the snapshot: It unpacks your
.dbfile into/var/lib/etcd/. - Rewrites the cluster membership: It modifies the
etcdconfiguration to trick the node into believing it is a single-node cluster ($N=1$, meaning quorum = 1). It erases the metadata of the other dead masters. - Restarts the static pods: It moves the manifests back, forcing the API server to spin up using this new single-member database.
Step 4: Re-syncing the Other Masters
Once the API server is back up on master-0, use your local terminal again. The remaining master nodes (master-1 and master-2) will still be out of sync.
You don’t need to manually restore them. Instead, you force the Cluster Etcd Operator to redeploy them by clearing their old member states:
Bash
# Force etcd operator to regenerate secret keys and re-sync membersoc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedArchitectureControl": "true"}}}'
The operator will detect that master-1 and master-2 are missing from the current active state, pave their etcd directories automatically, and catch them up via Raft replication from master-0.
2. Deep Dive: The Machine Config Operator (MCO)
In vanilla Kubernetes, if you want to change a kernel parameter (sysctl), add an SSH key, or configure an enterprise container registry mirror on your worker nodes, you have to use external configuration management tools like Ansible, Puppet, or SaltStack.
OpenShift discards external tools entirely and uses the Machine Config Operator (MCO) to manage the operating system (RHCOS) natively through Kubernetes Custom Resources.
The MCO Component Hierarchy
To understand how a configuration change reaches a node, you must understand these 4 core objects:
- MachineConfig (MC): A YAML file that outlines the exact state you want the OS to be in. It can contain files to write, systemd units to enable, or kernel settings to apply.
- MachineConfigPool (MCP): A grouping of nodes that should receive the same configurations (typically mapped to roles, like
masterorworker). - Controller: Monitors the
MachineConfigsand compiles them into a single, master “target” configuration for the entire pool. - Machine Config Daemon (MCD): A pod that runs as a
DaemonSeton every single node in the cluster. It runs with root privileges (chroot /host) and is responsible for actually writing the changes to its local disk.
Scenario: Writing a Custom Security File to 100+ Nodes
Let’s look at how the MCO executes a change. Suppose your security team requires a corporate banner (/etc/issue) to be present on every worker node.
Step 1: Create the MachineConfig Object
The files inside a MachineConfig must be encoded in Ignition format (which uses URL-encoded or base64 text).
YAML
apiVersion: machineconfiguration.openshift.io/v1kind: MachineConfigmetadata: labels: machineconfiguration.openshift.io/role: worker # Ties this config to the worker pool name: 99-worker-corporate-bannerspec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,WARNING%3A%20Authorized%20Access%20Only%21%0A mode: 420 # Octal for file permissions (0644) path: /etc/issue
Step 2: The MCO Reconciliation Flow
When you apply this file (oc apply -f banner.yaml), the following chain reaction occurs:
- Compilation: The MCO notices a new
MachineConfigwith theworkerlabel. It takes this file and merges it with all existing worker configs into a new, single cryptographic hash string. - Pool Update: The
MachineConfigPoolfor workers changes its status toUPDATING=True. - The Rollout Lifecycle (Node by Node):
- The Machine Config Daemon (MCD) running on
worker-0notices the pool’s target hash no longer matches its local current hash. - The MCD requests the cluster to drain the node. The cluster cordons the node and gracefully evicts all workloads to other worker nodes.
- Once empty, the MCD steps out of the container boundary using
chroot, accesses the host file system, and writes the stringWARNING: Authorized Access Only!into/etc/issue. - If the change requires a reboot (like a kernel parameter change), the MCD triggers a host reboot. If it doesn’t require a reboot (like our file change), it skips this step.
- The MCD verifies the file exists and is correct, then uncordons the node, marking it
Ready.
- The Machine Config Daemon (MCD) running on
- Next Node: The MCO moves to
worker-1, repeating the exact same process.
How to Monitor This as an Administrator
During a massive rollout, you monitor the orchestration via the pools:
oc get mcp
Output example during update:
Plaintext
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNTmaster rendered-master-1a2b3c... True False False 3 3worker rendered-worker-4f5e6d... False True False 100 45
If a node fails to apply a configuration (e.g., a systemd service fails to start), the pool will mark DEGRADED=True and immediately halt the entire rollout to prevent breaking the remaining 54 nodes.
If you were to encounter a Degraded MachineConfigPool in production, your immediate next step would be to check the logs of the specific daemon pod on the failing node.