What is an Operator? Operators are pieces of software that ease the operational complexity of running another piece of software. They act like an extension of the software vendor’s engineering team, monitoring a Kubernetes environment and using its current state to make decisions in real time. Advanced Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts. They are built on two things: a CRD (which extends the Kubernetes API with a new object type) and a controller (which watches instances of that type and drives the cluster toward the desired state).
The reconciliation loop is the engine behind every Operator. Every OpenShift Operator runs a control loop that continuously compares actual cluster state against the desired state defined in your CRDs. When discrepancies appear, the Operator executes operations to reconcile the difference — creating missing resources, updating outdated configurations, or removing excess components.
OLM (Operator Lifecycle Manager) is the package manager that runs by default on every OCP cluster. Within a CatalogSource, Operators are organised into packages and streams of updates called channels — a familiar update pattern from software on a continuous release cycle. A Subscription binds to a particular package and channel, and OLM handles installation and upgrades. OLM considerably simplifies lifecycle automation at enterprise scale — certified Operators can be automatically upgraded for patch releases and set to manual approval for major releases across the entire fleet.
The maturity model runs from Level 1 (basic install) up to Level 5 (auto-pilot with self-healing and auto-scaling), reflecting how much operational knowledge the Operator encodes.
In OpenShift (OCP), Operators are the single most important architectural concept. While Kubernetes provides the basic building blocks (Pods, Services, Deployments), Operators provide the intelligence to manage complex, stateful applications automatically.
Think of an Operator as a “Digital SRE” (Site Reliability Engineer). It is a piece of software that packages the human operational knowledge required to manage a service.
1. How an Operator Works
An Operator functions as a continuous control loop. It follows a simple three-step cycle called the Reconciliation Loop:
- Observe: It watches the cluster’s current state (e.g., “There are 2 database pods running”).
- Analyze: It compares that to your desired state (e.g., “The user requested 3 database pods”).
- Act: It performs the necessary actions to fix the discrepancy (e.g., “Start a 3rd database pod and run the ‘seed-data’ script”).
2. Key Components
To an administrator, an Operator is made up of two main parts:
- Custom Resource Definition (CRD): This extends the OpenShift API. It creates a new “object type” that the cluster didn’t have before (like
Kind: DatabaseorKind: Backup). - Controller: The actual code (running as a Pod) that watches that CRD and performs the work.
3. The Operator Lifecycle Manager (OLM)
OpenShift doesn’t just run Operators; it manages them using the OLM. This is what makes OpenShift’s implementation of Operators superior to “vanilla” Kubernetes. The OLM handles:
- OperatorHub: A built-in “app store” for the cluster where you can find and install certified operators (like databases, monitoring, or AI tools).
- Over-the-Air Updates: Just like your phone updates apps, the OLM tracks “Update Channels” and can automatically patch your Operators when Red Hat releases a security fix.
- Dependency Management: If a specific Operator requires another one to function (e.g., a Service Mesh needing a specific Monitoring tool), the OLM resolves and installs those dependencies for you.
4. Operator Maturity Model
Not all Operators are created equal. Red Hat categorizes them by their “intelligence” levels:
| Level | Capability | Description |
| 1 | Basic Install | Automates the initial setup and configuration. |
| 2 | Seamless Upgrades | Handles version updates and minor/patch releases. |
| 3 | Full Lifecycle | Manages backup, recovery, and storage scaling. |
| 4 | Deep Insights | Provides metrics, alerts, and log analysis. |
| 5 | Auto Pilot | Self-healing, auto-scaling, and abnormal behavior detection. |
5. Why they matter in 2026
In modern clusters, everything is an Operator.
- Cluster Operators: OpenShift 4.x is actually built of ~30 internal operators that manage the network, the console, and the internal registry. If you run
oc get co, you are looking at the health of the cluster’s “internal organs.” - Operand vs. Operator: The Operator is the manager (the pod running the code), while the Operand is the actual application being managed (the database pods).
Interview Tip
If an interviewer asks, “Why can’t I just use a Helm chart?”
Answer: “A Helm chart is a template for a one-time installation. An Operator is a living controller that stays active after the install. Helm doesn’t know how to handle a database failover or a schema migration at 3 AM; an Operator does.”
Troubleshooting a Degraded operator is the ultimate “Admin Level” task. When an operator is degraded, it means it can’t reach its desired state, usually due to a configuration error, a networking block, or a problem with the underlying infrastructure.
Here is the professional workflow for fixing a degraded operator in 2026.
1. Identify the Target
First, pinpoint which operator is failing and check the “Message” column for a quick hint.
Bash
oc get clusteroperators
What to look for: Look for
DEGRADED=True. Note theMESSAGEcolumn; it often tells you exactly what’s wrong (e.g., “MachineConfigPool master is not ready” or “Failed to reach endpoint”).
2. The “Deep Dive” Command
To see the detailed history and conditions of the failure, inspect the YAML of the clusteroperator itself.
Bash
oc get clusteroperator <operator-name> -o yaml
Focus on the status.conditions section. You are looking for a condition with type: Degraded and status: "True". Read the reason and message fields—they contain the specific logs from the operator’s last reconciliation attempt.
3. Trace the “Related Objects”
Operators manage other things (Pods, ConfigMaps, Secrets). If the operator is degraded, it’s usually because one of its Operands (the things it manages) is broken.
- Find the Managed Objects:Bash
oc get clusteroperator <operator-name> -o json | jq .status.relatedObjects - Check those namespaces: Most core operators live in namespaces starting with
openshift-(e.g.,openshift-apiserver,openshift-ingress). - Inspect the Pods:Bash
oc get pods -n <operator-namespace>- CrashLoopBackOff? Check logs:
oc logs <pod-name> -n <namespace> - Pending? Describe the pod:
oc describe pod <pod-name> -n <namespace>
- CrashLoopBackOff? Check logs:
4. Common 2026 Failure Scenarios
| Degraded Operator | Common Cause | The Admin Fix |
| authentication | Certificates out of sync or API server unreachable. | Check oc get secrets -n openshift-authentication. Restart the operator pod to force a re-sync. |
| machine-config | A node is stuck draining during an update (often due to a PDB). | oc get nodes to find the SchedulingDisabled node. Check oc get pdb -A to see if a pod is blocked from moving. |
| image-registry | The cloud storage (S3/Azure/GCS) is full or has wrong credentials. | oc edit configs.imageregistry.operator.openshift.io cluster and verify the storage block. |
| etcd | Disk latency is too high or a member is down. | oc get pods -n openshift-etcd. Check logs for “etcdserver: publish error: etcdserver: request timed out”. |
5. The “Restart” (When all else fails)
Because Operators are designed to be self-healing, sometimes the “manager” (the operator pod) just needs a kick to re-evaluate the cluster state.
Bash
# Example: Restarting the Ingress Operatoroc delete pod -l name=ingress-operator -n openshift-ingress-operator
Pro Tip: This does NOT take down your traffic. It only restarts the “manager.” The actual HAProxy routers stay running while the operator restarts.
Final Summary for the Interview
“If I see a Degraded operator, I follow the Outside-In approach:
- Check the Status:
oc get coandoc describe co. - Check the Namespace: Look at the operator’s pods and logs.
- Check the Infrastructure: Verify the storage, network, and node health that the operator relies on.
- Collect Evidence: If I can’t fix it in 15 minutes, I run
oc adm must-gatherto get a full diagnostic snapshot for Red Hat Support.”