Fixing etcd Latency Issues in OpenShift

Troubleshooting etcd latency in OpenShift (OCP) is one of the most important skills for an OpenShift Architect because etcd performance directly impacts the Kubernetes API server and the entire cluster.

When etcd becomes slow, you’ll typically see:

			
oc get pods      --> slow
oc apply         --> slow
Operators        --> degraded
Authentication   --> slow
API Server       --> high latency

		

Understanding the flow

			
User
 ↓
API Server
 ↓
etcd
 ↓
Disk

		

Most etcd latency issues are actually caused by:

Storage latency (most common)
CPU starvation
Memory pressure
Network latency between masters
Large etcd database
Excessive API writes

Symptoms

Check cluster operators:

oc get co

Typical:

			
etcd Degraded=True
kube-apiserver Degraded=True

Step 1: Check etcd operator status

oc get co etcd

Detailed:

oc describe co etcd

Look for messages:

			
High fsync latency
High commit latency
Leader changes
Database size warning

Step 2: Check etcd metrics

Observe → Metrics

Important metrics:

WAL fsync latency

			
histogram_quantile(
  0.99,
  rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])
)

Target:

< 10ms

Problem:

> 50ms

Backend commit latency

			
histogram_quantile(
  0.99,
  rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
)

Target:

< 25ms

Leader changes

increase(etcd_server_leader_changes_seen_total[1h])

Expected:

Frequent changes indicate network or resource issues.

Step 3: Check API server latency

oc adm top pods -n openshift-kube-apiserver

Also:

apiserver_request_duration_seconds

If API latency spikes with etcd latency:

etcd is likely the bottleneck.

Step 4: Check etcd pod health

oc get pods -n openshift-etcd

Should be:

			
etcd-master-0 Running
etcd-master-1 Running
etcd-master-2 Running

Check logs:

oc logs -n openshift-etcd etcd-master-0

Look for:

			
apply request took too long
waiting for ReadIndex response took too long
leader changed

These are classic etcd latency indicators.

Step 5: Check database size

			
oc exec -n openshift-etcd etcd-master-0 \
  -- etcdctl endpoint status -w table

Look at:

DB SIZE

Rule of thumb:

Size	Status
<4GB	Good
4-8GB	Watch
>8GB	Investigate

Step 6: Check storage latency (MOST COMMON)

SSH to master:

			
oc debug node/master-0
chroot /host

Run:

iostat -xm 5

Look at:

await

Healthy:

< 10ms

Problem:

> 20ms

Critical:

> 50ms

Also check:

sar -d 1 10

Step 7: Check CPU pressure

top

oc adm top nodes

Look for:

CPU > 80%

on control plane nodes.

Step 8: Check memory pressure

free -g

oc adm top nodes

If swapping occurs:

vmstat 1

etcd performance drops dramatically.

Step 9: Check network latency between masters

From master:

			
ping <master2>
ping <master3>

Latency should be:

< 1ms

Also:

mtr <master2>

Look for:

packet loss
jitter

Step 10: Check API churn

One hidden cause:

			
Too many updates
Too many watches
Bad operators

Check:

rate(apiserver_request_total[5m])

and:

apiserver_current_inflight_requests

Common Real-World Root Causes

Storage issue (70% of cases)

Symptoms:

			
High WAL fsync
High backend commit

Fix:

Faster disks
Premium SSD
Dedicated storage

Large etcd DB

Symptoms:

DB > 8GB

Fix:

			
etcdctl compact
etcdctl defrag

(OpenShift usually automates maintenance, but investigate excessive object growth.)

API storm

Example:

			
Misconfigured operator
Bad controller
Thousands of writes/sec

Fix:

Identify offending namespace/operator.

Leader instability

Symptoms:

			
Leader changes
Election storms

Fix:

Network troubleshooting between control plane nodes.

Quick Triage Flow

			
etcd latency alert
      ↓
Check etcd operator
      ↓
Check WAL fsync latency
      ↓
Check backend commit latency
      ↓
Check storage (iostat)
      ↓
Check DB size
      ↓
Check leader changes
      ↓
Check API churn

		

Interview Answer (2-Minute Version)

When troubleshooting etcd latency in OpenShift, I first check the etcd operator and Prometheus metrics such as WAL fsync latency, backend commit latency, and leader changes. The most common root cause is storage latency, so I verify disk performance using iostat and ensure etcd is running on low-latency SSD storage. I also check etcd database size, API server request rates, CPU and memory utilization on control plane nodes, and network latency between masters. If needed, I investigate excessive API writes, large object counts, or leader election instability. Since etcd is the cluster’s source of truth, even small increases in disk latency can significantly impact API responsiveness and overall cluster health.

Infra Cloud Solutions

Fixing etcd Latency Issues in OpenShift

Understanding the flow

Symptoms

Step 1: Check etcd operator status

Step 2: Check etcd metrics

WAL fsync latency

Backend commit latency

Leader changes

Step 3: Check API server latency

Step 4: Check etcd pod health

Step 5: Check database size

Step 6: Check storage latency (MOST COMMON)

Step 7: Check CPU pressure

Step 8: Check memory pressure

Step 9: Check network latency between masters

Step 10: Check API churn

Common Real-World Root Causes

Storage issue (70% of cases)

Large etcd DB

API storm

Leader instability

Quick Triage Flow

Interview Answer (2-Minute Version)

Like this:

Related

Leave a ReplyCancel reply

Understanding the flow

Symptoms

Step 1: Check etcd operator status

Step 2: Check etcd metrics

WAL fsync latency

Backend commit latency

Leader changes

Step 3: Check API server latency

Step 4: Check etcd pod health

Step 5: Check database size

Step 6: Check storage latency (MOST COMMON)

Step 7: Check CPU pressure

Step 8: Check memory pressure

Step 9: Check network latency between masters

Step 10: Check API churn

Common Real-World Root Causes

Storage issue (70% of cases)

Large etcd DB

API storm

Leader instability

Quick Triage Flow

Interview Answer (2-Minute Version)

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Infra Cloud Solutions