Understanding PromQL: A Complete Guide

What is PromQL?

PromQL (Prometheus Query Language) is the query language used by Prometheus to retrieve, filter, aggregate, and analyze time-series metrics.

It is the primary language used in:

  • Prometheus UI
  • Grafana dashboards
  • Alerting rules
  • Recording rules

PromQL Data Model

Metrics are stored as:

metric_name{label1="value1",label2="value2"} value timestamp

Example:

node_cpu_seconds_total{instance="server1",mode="idle"} 12345

Where:

ComponentMeaning
node_cpu_seconds_totalMetric name
instance=”server1″Label
mode=”idle”Label
12345Metric value

Basic PromQL Examples

1. Show a Metric
up

Returns all monitored targets.

Example:

up{instance="server1"} 1
up{instance="server2"} 1
  • 1 = healthy
  • 0 = down

2. Filter by Label
up{instance="server1:9100"}

Returns metrics only for that server.


3. Multiple Labels
node_cpu_seconds_total{
instance="server1:9100",
mode="idle"
}

Range Queries

Retrieve values over a time period.

Example:

node_cpu_seconds_total[5m]

Returns the last 5 minutes of data.


Rate Functions

One of the most common interview topics.

rate()

Calculates the per-second increase of a counter.

Example:

rate(http_requests_total[5m])

Meaning:

How many requests per second occurred during the last 5 minutes?


irate()

Calculates the rate using only the two most recent samples.

irate(http_requests_total[5m])

More responsive but noisier.


CPU Usage Example

Node Exporter provides:

node_cpu_seconds_total

Idle CPU:

avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

CPU Usage Percentage:

100 - (
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
)

Very common Grafana dashboard query.


Memory Usage Example

Used Memory:

node_memory_MemTotal_bytes
-
node_memory_MemAvailable_bytes

Memory Percentage:

(
(node_memory_MemTotal_bytes
-
node_memory_MemAvailable_bytes)
/
node_memory_MemTotal_bytes
) * 100

Aggregation Functions

sum()
sum(http_requests_total)

Adds all values together.


avg()
avg(node_load1)

Average load.


max()
max(node_memory_MemAvailable_bytes)

Highest value.


min()
min(node_memory_MemAvailable_bytes)

Lowest value.


Group By

Example:

sum(rate(http_requests_total[5m])) by (instance)

Output:

server1 = 100 req/s
server2 = 150 req/s

Top Consumers

Top 5 CPU-consuming containers:

topk(
5,
sum(rate(container_cpu_usage_seconds_total[5m]))
by (pod)
)

Very common in Kubernetes/OpenShift interviews.


Kubernetes Examples

Pod Count
count(kube_pod_info)

Running Pods

count(kube_pod_status_phase{phase="Running"})

Node Count
count(kube_node_info)

OpenShift Examples

API Server Latency
histogram_quantile(
0.99,
sum(rate(apiserver_request_duration_seconds_bucket[5m]))
by (le)
)

etcd Latency
histogram_quantile(
0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

OVN Pod Status
up{job="ovn-kubernetes-node"}

Alert Rule Example

CPU > 80%

groups:
- name: cpu-alerts
rules:
- alert: HighCPU
expr: 100 - (
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
) > 80
for: 5m

Meaning:

  • CPU above 80%
  • For 5 minutes
  • Fire alert

Recording Rule Example

Instead of calculating CPU every dashboard refresh:

- record: node:cpu_usage:avg
expr: 100 - (
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
)

Then dashboards query:

node:cpu_usage:avg

This improves performance.


Common Interview Questions

What is the difference between rate() and irate()?
rate()irate()
Uses many samplesUses last two samples
SmootherMore responsive
Good for alertsGood for graphs

What is a counter?

A metric that only increases.

Examples:

http_requests_total
container_cpu_usage_seconds_total

What is a gauge?

A metric that can increase or decrease.

Examples:

node_memory_MemAvailable_bytes
node_load1

What is a histogram?

Used to measure distributions such as latency.

Example:

http_request_duration_seconds_bucket

What is cardinality?

The number of unique metric/label combinations.

Example:

http_requests_total{user="1"}
http_requests_total{user="2"}
http_requests_total{user="3"}
...

Millions of unique labels create high cardinality, which can cause Prometheus performance and memory issues.

Interview answer

PromQL is Prometheus’s query language used to retrieve, filter, aggregate, and calculate metrics. It supports functions such as rate(), sum(), avg(), histogram_quantile(), and label filtering, making it the foundation for Grafana dashboards, alerting rules, and monitoring in Kubernetes and OpenShift environments.

Leave a Reply