Understanding OpenShift’s Observability Stack

Monitoring and observability in Red Hat OpenShift Container Platform (OCP) are built on a highly resilient, enterprise-grade telemetry stack managed by the Cluster Monitoring Operator (CMO).

Unlike upstream Kubernetes, where you have to manually configure Prometheus operators, persistent storage, and scrapers, OpenShift provides a fully integrated, out-of-the-box observability ecosystem.

1. Architectural Blueprint: Dual Stack Monitoring

OpenShift splits its monitoring architecture into two completely isolated functional domains to prevent application-level telemetry spikes from crashing cluster control plane visibility.

Platform Monitoring (Managed by Red Hat)
  • Components: Prometheus, Alertmanager, and Grafana.
  • Scope: Dedicated strictly to tracking infrastructure components, node health, and core OpenShift operators (e.g., API server, etcd, SDN/OVN).
  • Governance: Configured out-of-the-box. Read-only to standard cluster users; Red Hat SREs use this data to calculate cluster health and drive automated alerts.
User Workload Monitoring (UWM)
  • Scope: Dedicated to developers and application SREs for scraping custom application metrics (e.g., JVM stats, HTTP request rates).
  • Isolation: Runs entirely independent Prometheus instances and Thanos sidecars to ensure that a massive surge in developer metrics cannot starve the primary cluster plane APIs of memory.

2. Deep Dive: Global Observability via Thanos

Because OpenShift clusters frequently run across multiple availability zones or scale to hundreds of nodes, keeping raw metrics on short-lived disk volumes is dangerous and expensive. OpenShift solves this by incorporating Thanos into the native monitoring architecture.

Plaintext

 ┌───────────────────────────┐      ┌───────────────────────────┐
 │   Worker Node (Zone A)    │      │   Worker Node (Zone B)    │
 │ ┌────────────┐┌─────────┐ │      │ ┌────────────┐┌─────────┐ │
 │ │ Prometheus ││ Thanos  │ │      │ │ Prometheus ││ Thanos  │ │
 │ │(Local TSDB)││Sidecar  │ │      │ │(Local TSDB)││Sidecar  │ │
 └───────┬────────────┬──────┘      └───────┬────────────┬──────┘
         │            │                     │            │
         │ Scrapes    │ Pushes              │ Scrapes    │ Pushes
         ▼            ▼                     ▼            ▼
 ┌──────────────┐   ┌───────────────────────────────────────────┐
 │ Applications │   │          Enterprise S3 Object Storage     │
 └──────────────┘   └─────────────────────▲─────────────────────┘
                                          │
                                          │ Queries via gRPC
                                    ┌─────┴─────────────┐
                                    │    Thanos Querier │◀── [Grafana / Console]
                                    └───────────────────┘


  • Thanos Sidecar: Every instance of Prometheus runs a Thanos Sidecar container. This sidecar intercepts newly created Time Series Database (TSDB) metrics blocks and ships them to long-term S3-compatible Object Storage (e.g., AWS S3, Azure Blob, ODF/Ceph).
  • Thanos Querier: When an engineer queries a metric through a Grafana dashboard or the OpenShift console, the Thanos Querier engine intercepts the request. It dynamically aggregates data from both the local, short-term Prometheus TSDB caches and historical data sitting in the object storage layer, deduplicating data points on the fly.

3. The Full Observability Pillars: Logging & Tracing

True observability extends beyond metrics. OpenShift packages native operators to handle the other two pillars of observability: Logs and Distributed Traces.

A. Centralized Log Aggregation (Vector & Loki)

The Red Hat OpenShift Logging Operator acts as a centralized log collector for all infrastructure and container output.

  • Vector (The Collector): Runs as a DaemonSet on every node, silently capturing all stdout/stderr logs from running pods, auditing OS logs, and normalizing them into JSON objects.
  • Loki (The Storage Engine): Vector forwards the processed logs down to a multi-tenant Loki database grid. Loki organizes logs by labels rather than indexing full text, which keeps storage footprints small and query speeds high.
B. Distributed Tracing (OpenTelemetry & Jaeger)

For complex microservice environments where a web request hits dozens of decoupled backends, metrics and logs aren’t enough to trace bottlenecks. OpenShift relies on the Red Hat OpenShift distributed tracing platform.

  • OpenTelemetry (OTel): Standardized collectors are injected into application namespaces to instrument application runtime code seamlessly.
  • Jaeger: Receives tracing telemetry data, rendering sequential visual execution flows showing exactly how many milliseconds a request spent inside an API gateway, a business service, or a database query.

4. Operational Practice: Enabling User Workload Monitoring

To allow developers to define their own custom metrics rules and scrape application payloads, a cluster administrator must explicitly activate the User Workload Monitoring domain.

Step 1: Create the Global Cluster Monitoring ConfigMap

Apply the configuration block below to the openshift-monitoring administrative namespace:

YAML

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true # Actives the isolated developer monitoring plane
Step 2: Declare an Application ServiceMonitor

Once activated, application engineers do not configure Prometheus endpoints manually. They deploy a ServiceMonitor manifest within their application namespace. The user-workload monitoring engine will automatically discover the resource and start pulling application performance metrics:

YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: springboot-app-monitor
namespace: finance-prod
spec:
selector:
matchLabels:
app: springboot-banking # Targets pods containing this active label
endpoints:
- port: web # The named port inside the target Kubernetes Service
path: /actuator/prometheus
interval: 30s # Scrape frequency cadence

Leave a Reply