Mastering Day 1 and Day 2 of Cluster Management

This is a classic way for interviewers to see if you have actually managed a cluster in production. Day 1 is about getting the cluster alive; Day 2 is about keeping it from dying.

In a senior interview, they expect you to spend most of your time talking about Day 2, as that represents 99% of a cluster’s lifespan


Day 1: Installation & Provisioning

Focus: Automation, Infrastructure, and “Getting to Green.”

TaskOn-Premise Reality Check
DNS SetupCreating the critical records: api, api-int, and *.apps. Without these, the bootstrap will fail.
Load BalancingSetting up external HAProxy or F5 (for UPI) or ensuring VIPs are reserved (for IPI).
Ignition ConfigsUsing the installer to generate .ign files and serving them via HTTP/PXE to the bare metal/VM nodes.
Certificate ApprovalManually running oc get csr and approving them to allow nodes to join the cluster.
Registry Mirroring(If Air-gapped) Setting up the local Quay/Nexus registry and the ImageContentSourcePolicy.

Day 2: Maintenance & Operations

Focus: Stability, Compliance, and Scaling.

1. Lifecycle Management

  • Cluster Upgrades: Navigating the “Update Graph.” Choosing between the stable and fast channels.
  • Certificate Rotation: Monitoring the expiration of the internal API and Ingress CA (though OpenShift now automates most of this, an admin must know how to fix a “stuck” rotation).
  • Node Scaling: Adding new Bare Metal workers via the Assisted Installer or expanding VMware Resource Pools.

2. Performance & Health

  • Etcd Maintenance: Performing periodic defragmentation and manual snapshots before any major change.
  • Logging Stack Management: Tuning the Elasticsearch/Fluentd (or Loki) stack. On-premise, this often means managing “PVC full” issues when logs grow too fast.
  • Pruning: Running oc adm prune to clean up old builds, images, and deployments that are cluttering the etcd database.

3. Security & Governance

  • RBAC Auditing: Ensuring developers aren’t using cluster-admin for daily tasks.
  • SCC Policy: Managing exceptions for specialized workloads (e.g., giving a monitoring agent privileged access).
  • Quota Management: Defining ResourceQuotas per Project to prevent a single “noisy neighbor” from consuming all physical RAM on your ESXi hosts.

The “Senior Admin” Bonus: Disaster Recovery (DR)

An interviewer will almost certainly ask: “What is your DR strategy for on-prem?”

A high-quality answer includes:

  1. Etcd Backups: Stored outside the cluster (e.g., on an external S3 bucket or NAS).
  2. Velero: Using the Velero operator to back up application metadata and Persistent Volumes (using CSI snapshots).
  3. Multi-Cluster: Having a second “Passive” cluster in a different data center and using Red Hat Advanced Cluster Management (RHACM) to shift traffic via DNS if the primary DC goes dark.

Final Interview Tip: The “Why”

When answering, don’t just say what you did; say why it matters for the business:

  • Wrong: “I configured the MTU to 1400.”
  • Right: “I lowered the MTU to 1400 to prevent packet fragmentation over our Geneve tunnels, which reduced our application latency by 30%.”

Leave a comment