Fixing etcd Latency Issues in OpenShift

Troubleshooting etcd latency in OpenShift (OCP) is one of the most important skills for an OpenShift Architect because etcd performance directly impacts the Kubernetes API server and the entire cluster.

When etcd becomes slow, you’ll typically see:

oc get pods --> slow
oc apply --> slow
Operators --> degraded
Authentication --> slow
API Server --> high latency

Understanding the flow

User
API Server
etcd
Disk

Most etcd latency issues are actually caused by:

  1. Storage latency (most common)
  2. CPU starvation
  3. Memory pressure
  4. Network latency between masters
  5. Large etcd database
  6. Excessive API writes

Symptoms

Check cluster operators:

oc get co

Typical:

etcd Degraded=True
kube-apiserver Degraded=True

Step 1: Check etcd operator status
oc get co etcd

Detailed:

oc describe co etcd

Look for messages:

High fsync latency
High commit latency
Leader changes
Database size warning

Step 2: Check etcd metrics

Login to Prometheus:

Observe → Metrics

Important metrics:

WAL fsync latency
histogram_quantile(
0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

Target:

< 10ms

Problem:

> 50ms

Backend commit latency
histogram_quantile(
0.99,
rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
)

Target:

< 25ms

Leader changes
increase(etcd_server_leader_changes_seen_total[1h])

Expected:

0

Frequent changes indicate network or resource issues.


Step 3: Check API server latency

oc adm top pods -n openshift-kube-apiserver

Also:

apiserver_request_duration_seconds

If API latency spikes with etcd latency:

etcd is likely the bottleneck.


Step 4: Check etcd pod health
oc get pods -n openshift-etcd

Should be:

etcd-master-0 Running
etcd-master-1 Running
etcd-master-2 Running

Check logs:

oc logs -n openshift-etcd etcd-master-0

Look for:

apply request took too long
waiting for ReadIndex response took too long
leader changed

These are classic etcd latency indicators.


Step 5: Check database size
oc exec -n openshift-etcd etcd-master-0 \
-- etcdctl endpoint status -w table

Look at:

DB SIZE

Rule of thumb:

SizeStatus
<4GBGood
4-8GBWatch
>8GBInvestigate

Step 6: Check storage latency (MOST COMMON)

SSH to master:

oc debug node/master-0
chroot /host

Run:

iostat -xm 5

Look at:

await

Healthy:

< 10ms

Problem:

> 20ms

Critical:

> 50ms

Also check:

sar -d 1 10

Step 7: Check CPU pressure
top

or

oc adm top nodes

Look for:

CPU > 80%

on control plane nodes.


Step 8: Check memory pressure
free -g

or

oc adm top nodes

If swapping occurs:

vmstat 1

etcd performance drops dramatically.


Step 9: Check network latency between masters

From master:

ping <master2>
ping <master3>

Latency should be:

< 1ms

Also:

mtr <master2>

Look for:

  • packet loss
  • jitter

Step 10: Check API churn

One hidden cause:

Too many updates
Too many watches
Bad operators

Check:

rate(apiserver_request_total[5m])

and:

apiserver_current_inflight_requests

Common Real-World Root Causes
Storage issue (70% of cases)

Symptoms:

High WAL fsync
High backend commit

Fix:

  • Faster disks
  • Premium SSD
  • Dedicated storage

Large etcd DB

Symptoms:

DB > 8GB

Fix:

etcdctl compact
etcdctl defrag

(OpenShift usually automates maintenance, but investigate excessive object growth.)


API storm

Example:

Misconfigured operator
Bad controller
Thousands of writes/sec

Fix:

Identify offending namespace/operator.


Leader instability

Symptoms:

Leader changes
Election storms

Fix:

Network troubleshooting between control plane nodes.


Quick Triage Flow
etcd latency alert
Check etcd operator
Check WAL fsync latency
Check backend commit latency
Check storage (iostat)
Check DB size
Check leader changes
Check API churn

Interview Answer (2-Minute Version)

When troubleshooting etcd latency in OpenShift, I first check the etcd operator and Prometheus metrics such as WAL fsync latency, backend commit latency, and leader changes. The most common root cause is storage latency, so I verify disk performance using iostat and ensure etcd is running on low-latency SSD storage. I also check etcd database size, API server request rates, CPU and memory utilization on control plane nodes, and network latency between masters. If needed, I investigate excessive API writes, large object counts, or leader election instability. Since etcd is the cluster’s source of truth, even small increases in disk latency can significantly impact API responsiveness and overall cluster health.

Leave a Reply