Troubleshooting etcd latency in OpenShift (OCP) is one of the most important skills for an OpenShift Architect because etcd performance directly impacts the Kubernetes API server and the entire cluster.
When etcd becomes slow, you’ll typically see:
oc get pods --> slowoc apply --> slowOperators --> degradedAuthentication --> slowAPI Server --> high latency
Understanding the flow
User ↓API Server ↓etcd ↓Disk
Most etcd latency issues are actually caused by:
- Storage latency (most common)
- CPU starvation
- Memory pressure
- Network latency between masters
- Large etcd database
- Excessive API writes
Symptoms
Check cluster operators:
oc get co
Typical:
etcd Degraded=Truekube-apiserver Degraded=True
Step 1: Check etcd operator status
oc get co etcd
Detailed:
oc describe co etcd
Look for messages:
High fsync latencyHigh commit latencyLeader changesDatabase size warning
Step 2: Check etcd metrics
Login to Prometheus:
Observe → Metrics
Important metrics:
WAL fsync latency
histogram_quantile( 0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
Target:
< 10ms
Problem:
> 50ms
Backend commit latency
histogram_quantile( 0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
Target:
< 25ms
Leader changes
increase(etcd_server_leader_changes_seen_total[1h])
Expected:
0
Frequent changes indicate network or resource issues.
Step 3: Check API server latency
oc adm top pods -n openshift-kube-apiserver
Also:
apiserver_request_duration_seconds
If API latency spikes with etcd latency:
etcd is likely the bottleneck.
Step 4: Check etcd pod health
oc get pods -n openshift-etcd
Should be:
etcd-master-0 Runningetcd-master-1 Runningetcd-master-2 Running
Check logs:
oc logs -n openshift-etcd etcd-master-0
Look for:
apply request took too longwaiting for ReadIndex response took too longleader changed
These are classic etcd latency indicators.
Step 5: Check database size
oc exec -n openshift-etcd etcd-master-0 \ -- etcdctl endpoint status -w table
Look at:
DB SIZE
Rule of thumb:
| Size | Status |
|---|---|
| <4GB | Good |
| 4-8GB | Watch |
| >8GB | Investigate |
Step 6: Check storage latency (MOST COMMON)
SSH to master:
oc debug node/master-0chroot /host
Run:
iostat -xm 5
Look at:
await
Healthy:
< 10ms
Problem:
> 20ms
Critical:
> 50ms
Also check:
sar -d 1 10
Step 7: Check CPU pressure
top
or
oc adm top nodes
Look for:
CPU > 80%
on control plane nodes.
Step 8: Check memory pressure
free -g
or
oc adm top nodes
If swapping occurs:
vmstat 1
etcd performance drops dramatically.
Step 9: Check network latency between masters
From master:
ping <master2>ping <master3>
Latency should be:
< 1ms
Also:
mtr <master2>
Look for:
- packet loss
- jitter
Step 10: Check API churn
One hidden cause:
Too many updatesToo many watchesBad operators
Check:
rate(apiserver_request_total[5m])
and:
apiserver_current_inflight_requests
Common Real-World Root Causes
Storage issue (70% of cases)
Symptoms:
High WAL fsyncHigh backend commit
Fix:
- Faster disks
- Premium SSD
- Dedicated storage
Large etcd DB
Symptoms:
DB > 8GB
Fix:
etcdctl compactetcdctl defrag
(OpenShift usually automates maintenance, but investigate excessive object growth.)
API storm
Example:
Misconfigured operatorBad controllerThousands of writes/sec
Fix:
Identify offending namespace/operator.
Leader instability
Symptoms:
Leader changesElection storms
Fix:
Network troubleshooting between control plane nodes.
Quick Triage Flow
etcd latency alert ↓Check etcd operator ↓Check WAL fsync latency ↓Check backend commit latency ↓Check storage (iostat) ↓Check DB size ↓Check leader changes ↓Check API churn
Interview Answer (2-Minute Version)
When troubleshooting etcd latency in OpenShift, I first check the etcd operator and Prometheus metrics such as WAL fsync latency, backend commit latency, and leader changes. The most common root cause is storage latency, so I verify disk performance using iostat and ensure etcd is running on low-latency SSD storage. I also check etcd database size, API server request rates, CPU and memory utilization on control plane nodes, and network latency between masters. If needed, I investigate excessive API writes, large object counts, or leader election instability. Since etcd is the cluster’s source of truth, even small increases in disk latency can significantly impact API responsiveness and overall cluster health.