Monitoring and observability in Red Hat OpenShift Container Platform (OCP) are built on a highly resilient, enterprise-grade telemetry stack managed by the Cluster Monitoring Operator (CMO).
Unlike upstream Kubernetes, where you have to manually configure Prometheus operators, persistent storage, and scrapers, OpenShift provides a fully integrated, out-of-the-box observability ecosystem.
1. Architectural Blueprint: Dual Stack Monitoring
OpenShift splits its monitoring architecture into two completely isolated functional domains to prevent application-level telemetry spikes from crashing cluster control plane visibility.
Platform Monitoring (Managed by Red Hat)
- Components: Prometheus, Alertmanager, and Grafana.
- Scope: Dedicated strictly to tracking infrastructure components, node health, and core OpenShift operators (e.g., API server, etcd, SDN/OVN).
- Governance: Configured out-of-the-box. Read-only to standard cluster users; Red Hat SREs use this data to calculate cluster health and drive automated alerts.
User Workload Monitoring (UWM)
- Scope: Dedicated to developers and application SREs for scraping custom application metrics (e.g., JVM stats, HTTP request rates).
- Isolation: Runs entirely independent Prometheus instances and Thanos sidecars to ensure that a massive surge in developer metrics cannot starve the primary cluster plane APIs of memory.
2. Deep Dive: Global Observability via Thanos
Because OpenShift clusters frequently run across multiple availability zones or scale to hundreds of nodes, keeping raw metrics on short-lived disk volumes is dangerous and expensive. OpenShift solves this by incorporating Thanos into the native monitoring architecture.
Plaintext
┌───────────────────────────┐ ┌───────────────────────────┐
│ Worker Node (Zone A) │ │ Worker Node (Zone B) │
│ ┌────────────┐┌─────────┐ │ │ ┌────────────┐┌─────────┐ │
│ │ Prometheus ││ Thanos │ │ │ │ Prometheus ││ Thanos │ │
│ │(Local TSDB)││Sidecar │ │ │ │(Local TSDB)││Sidecar │ │
└───────┬────────────┬──────┘ └───────┬────────────┬──────┘
│ │ │ │
│ Scrapes │ Pushes │ Scrapes │ Pushes
▼ ▼ ▼ ▼
┌──────────────┐ ┌───────────────────────────────────────────┐
│ Applications │ │ Enterprise S3 Object Storage │
└──────────────┘ └─────────────────────▲─────────────────────┘
│
│ Queries via gRPC
┌─────┴─────────────┐
│ Thanos Querier │◀── [Grafana / Console]
└───────────────────┘
- Thanos Sidecar: Every instance of Prometheus runs a Thanos Sidecar container. This sidecar intercepts newly created Time Series Database (TSDB) metrics blocks and ships them to long-term S3-compatible Object Storage (e.g., AWS S3, Azure Blob, ODF/Ceph).
- Thanos Querier: When an engineer queries a metric through a Grafana dashboard or the OpenShift console, the Thanos Querier engine intercepts the request. It dynamically aggregates data from both the local, short-term Prometheus TSDB caches and historical data sitting in the object storage layer, deduplicating data points on the fly.
3. The Full Observability Pillars: Logging & Tracing
True observability extends beyond metrics. OpenShift packages native operators to handle the other two pillars of observability: Logs and Distributed Traces.
A. Centralized Log Aggregation (Vector & Loki)
The Red Hat OpenShift Logging Operator acts as a centralized log collector for all infrastructure and container output.
- Vector (The Collector): Runs as a
DaemonSeton every node, silently capturing all stdout/stderr logs from running pods, auditing OS logs, and normalizing them into JSON objects. - Loki (The Storage Engine): Vector forwards the processed logs down to a multi-tenant Loki database grid. Loki organizes logs by labels rather than indexing full text, which keeps storage footprints small and query speeds high.
B. Distributed Tracing (OpenTelemetry & Jaeger)
For complex microservice environments where a web request hits dozens of decoupled backends, metrics and logs aren’t enough to trace bottlenecks. OpenShift relies on the Red Hat OpenShift distributed tracing platform.
- OpenTelemetry (OTel): Standardized collectors are injected into application namespaces to instrument application runtime code seamlessly.
- Jaeger: Receives tracing telemetry data, rendering sequential visual execution flows showing exactly how many milliseconds a request spent inside an API gateway, a business service, or a database query.
4. Operational Practice: Enabling User Workload Monitoring
To allow developers to define their own custom metrics rules and scrape application payloads, a cluster administrator must explicitly activate the User Workload Monitoring domain.
Step 1: Create the Global Cluster Monitoring ConfigMap
Apply the configuration block below to the openshift-monitoring administrative namespace:
YAML
apiVersion: v1kind: ConfigMapmetadata: name: cluster-monitoring-config namespace: openshift-monitoringdata: config.yaml: | enableUserWorkload: true # Actives the isolated developer monitoring plane
Step 2: Declare an Application ServiceMonitor
Once activated, application engineers do not configure Prometheus endpoints manually. They deploy a ServiceMonitor manifest within their application namespace. The user-workload monitoring engine will automatically discover the resource and start pulling application performance metrics:
YAML
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: springboot-app-monitor namespace: finance-prodspec: selector: matchLabels: app: springboot-banking # Targets pods containing this active label endpoints: - port: web # The named port inside the target Kubernetes Service path: /actuator/prometheus interval: 30s # Scrape frequency cadence