Understanding OpenShift’s Observability Stack

Monitoring and observability in Red Hat OpenShift Container Platform (OCP) are built on a highly resilient, enterprise-grade telemetry stack managed by the Cluster Monitoring Operator (CMO).

Unlike upstream Kubernetes, where you have to manually configure Prometheus operators, persistent storage, and scrapers, OpenShift provides a fully integrated, out-of-the-box observability ecosystem.

1. Architectural Blueprint: Dual Stack Monitoring

OpenShift splits its monitoring architecture into two completely isolated functional domains to prevent application-level telemetry spikes from crashing cluster control plane visibility.

Platform Monitoring (Managed by Red Hat)

Components: Prometheus, Alertmanager, and Grafana.
Scope: Dedicated strictly to tracking infrastructure components, node health, and core OpenShift operators (e.g., API server, etcd, SDN/OVN).
Governance: Configured out-of-the-box. Read-only to standard cluster users; Red Hat SREs use this data to calculate cluster health and drive automated alerts.

User Workload Monitoring (UWM)

Scope: Dedicated to developers and application SREs for scraping custom application metrics (e.g., JVM stats, HTTP request rates).
Isolation: Runs entirely independent Prometheus instances and Thanos sidecars to ensure that a massive surge in developer metrics cannot starve the primary cluster plane APIs of memory.

2. Deep Dive: Global Observability via Thanos

Because OpenShift clusters frequently run across multiple availability zones or scale to hundreds of nodes, keeping raw metrics on short-lived disk volumes is dangerous and expensive. OpenShift solves this by incorporating Thanos into the native monitoring architecture.

Plaintext

 ┌───────────────────────────┐      ┌───────────────────────────┐
 │   Worker Node (Zone A)    │      │   Worker Node (Zone B)    │
 │ ┌────────────┐┌─────────┐ │      │ ┌────────────┐┌─────────┐ │
 │ │ Prometheus ││ Thanos  │ │      │ │ Prometheus ││ Thanos  │ │
 │ │(Local TSDB)││Sidecar  │ │      │ │(Local TSDB)││Sidecar  │ │
 └───────┬────────────┬──────┘      └───────┬────────────┬──────┘
         │            │                     │            │
         │ Scrapes    │ Pushes              │ Scrapes    │ Pushes
         ▼            ▼                     ▼            ▼
 ┌──────────────┐   ┌───────────────────────────────────────────┐
 │ Applications │   │          Enterprise S3 Object Storage     │
 └──────────────┘   └─────────────────────▲─────────────────────┘
                                          │
                                          │ Queries via gRPC
                                    ┌─────┴─────────────┐
                                    │    Thanos Querier │◀── [Grafana / Console]
                                    └───────────────────┘

Thanos Sidecar: Every instance of Prometheus runs a Thanos Sidecar container. This sidecar intercepts newly created Time Series Database (TSDB) metrics blocks and ships them to long-term S3-compatible Object Storage (e.g., AWS S3, Azure Blob, ODF/Ceph).
Thanos Querier: When an engineer queries a metric through a Grafana dashboard or the OpenShift console, the Thanos Querier engine intercepts the request. It dynamically aggregates data from both the local, short-term Prometheus TSDB caches and historical data sitting in the object storage layer, deduplicating data points on the fly.

3. The Full Observability Pillars: Logging & Tracing

True observability extends beyond metrics. OpenShift packages native operators to handle the other two pillars of observability: Logs and Distributed Traces.

A. Centralized Log Aggregation (Vector & Loki)

The Red Hat OpenShift Logging Operator acts as a centralized log collector for all infrastructure and container output.

Vector (The Collector): Runs as a DaemonSet on every node, silently capturing all stdout/stderr logs from running pods, auditing OS logs, and normalizing them into JSON objects.
Loki (The Storage Engine): Vector forwards the processed logs down to a multi-tenant Loki database grid. Loki organizes logs by labels rather than indexing full text, which keeps storage footprints small and query speeds high.

B. Distributed Tracing (OpenTelemetry & Jaeger)

For complex microservice environments where a web request hits dozens of decoupled backends, metrics and logs aren’t enough to trace bottlenecks. OpenShift relies on the Red Hat OpenShift distributed tracing platform.

OpenTelemetry (OTel): Standardized collectors are injected into application namespaces to instrument application runtime code seamlessly.
Jaeger: Receives tracing telemetry data, rendering sequential visual execution flows showing exactly how many milliseconds a request spent inside an API gateway, a business service, or a database query.

4. Operational Practice: Enabling User Workload Monitoring

To allow developers to define their own custom metrics rules and scrape application payloads, a cluster administrator must explicitly activate the User Workload Monitoring domain.

Step 1: Create the Global Cluster Monitoring ConfigMap

Apply the configuration block below to the openshift-monitoring administrative namespace:

YAML

			
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true # Actives the isolated developer monitoring plane

		

Step 2: Declare an Application `ServiceMonitor`

Once activated, application engineers do not configure Prometheus endpoints manually. They deploy a ServiceMonitor manifest within their application namespace. The user-workload monitoring engine will automatically discover the resource and start pulling application performance metrics:

YAML

			
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: springboot-app-monitor
  namespace: finance-prod
spec:
  selector:
    matchLabels:
      app: springboot-banking # Targets pods containing this active label
  endpoints:
    - port: web               # The named port inside the target Kubernetes Service
      path: /actuator/prometheus
      interval: 30s           # Scrape frequency cadence

		

Infra Cloud Solutions

Understanding OpenShift’s Observability Stack

1. Architectural Blueprint: Dual Stack Monitoring

Platform Monitoring (Managed by Red Hat)

User Workload Monitoring (UWM)

2. Deep Dive: Global Observability via Thanos

3. The Full Observability Pillars: Logging & Tracing

A. Centralized Log Aggregation (Vector & Loki)

B. Distributed Tracing (OpenTelemetry & Jaeger)

4. Operational Practice: Enabling User Workload Monitoring

Step 1: Create the Global Cluster Monitoring ConfigMap

Step 2: Declare an Application `ServiceMonitor`

Like this:

Related

Leave a ReplyCancel reply

1. Architectural Blueprint: Dual Stack Monitoring

Platform Monitoring (Managed by Red Hat)

User Workload Monitoring (UWM)

2. Deep Dive: Global Observability via Thanos

3. The Full Observability Pillars: Logging & Tracing

A. Centralized Log Aggregation (Vector & Loki)

B. Distributed Tracing (OpenTelemetry & Jaeger)

4. Operational Practice: Enabling User Workload Monitoring

Step 1: Create the Global Cluster Monitoring ConfigMap

Step 2: Declare an Application ServiceMonitor

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Infra Cloud Solutions

Step 2: Declare an Application `ServiceMonitor`