If you turn on (start/deploy) cattle-system pods (especially cattle-cluster-agent) and it crashes your cluster,
it means the Rancher agents are broken or misconfigured and are overloading, blocking, or breaking Kubernetes internally.
In detail, here’s why this can happen:
| Cause | What Happens | Why It Crushes the Cluster |
|---|---|---|
| ❌ Rancher agents fail to connect and keep retrying | They flood the Kubernetes API server with reconnect attempts | API server gets overloaded, becomes unresponsive |
| ❌ Wrong Rancher URL or network broken | Agents enter infinite loops trying to reach Rancher | Node CPU/memory gets exhausted |
| ❌ Authentication errors (cert expired, token invalid) | Agents spam auth failures on kube-apiserver | API server becomes slow or hangs |
| ❌ Agent version mismatch | Older agents send bad requests to newer servers | API server rejects them violently |
| ❌ Cluster registration issues (wrong cluster state in Rancher DB) | Rancher tries to sync invalid resources | etcd or kube-apiserver crash |
| ❌ cattle-node-agent overload | Each node spawns bad agents, DaemonSet restarts infinitely | kubelet, container runtime get overloaded |
🚨 Important:
- Rancher’s
cattle-cluster-agenttalks both to your Kubernetes API server and to Rancher’s API. - If it is broken (wrong URL, invalid token, expired cert, bad Rancher setup), it spams the cluster.
- Symptoms you usually see:
kubectl get nodeshangskubectl get pods -Afreezes- CPU of kube-apiserver jumps high
- Cluster nodes become
NotReady - etcd warnings appear if etcd is running locally
✅ Proof: If you kill (scale to 0) the cattle-cluster-agent, the cluster usually becomes healthy again.
Example:
kubectl scale deployment cattle-cluster-agent -n cattle-system --replicas=0
→ Cluster becomes stable.
💬 Summary in simple words:
Rancher agents are deeply tied into your Kubernetes cluster.
If they are broken, they spam your Kubernetes system like crazy, and that overloads and crashes your cluster.
🔎 Next step: To solve it properly:
- Collect logs:
kubectl logs -n cattle-system cattle-cluster-agent-xxxx kubectl logs -n cattle-system cattle-node-agent-xxxx - Fix whatever is wrong (wrong URL, expired cert, wrong Rancher token, network broken).
- Then redeploy
cattle-systemcarefully.
👉 Question for you to proceed:
- Did you recently upgrade Rancher or move Rancher server (IP change, domain change)?
- Did you renew SSL/TLS certificates on Rancher?
- Or is this a new cluster registration you are trying?
Depending on your answer, I will tell you exactly how to fix it.
Would you like me to guide you step-by-step? 🚀
(If yes, tell me Rancher version too — 2.6.x, 2.7.x, 2.8.x, etc.)
This is a serious architecture conflict:
Rancher’s cattle-node-agent is destabilizing your AWX deployment.
And the reason is resource exhaustion or cluster network/messaging overload.
🔥 Why does this happen?
| Cause | Description | Why it breaks AWX |
|---|---|---|
| Resource exhaustion | Rancher’s cattle-node-agent is heavy (especially when reconnecting or retrying). It uses a lot of CPU, memory, and bandwidth. | AWX Postgres and Web pods starve for CPU/memory or lose network |
| Node pressure | When 2 cattle-node-agents come alive, node CPU/mem pressure increases. | Kubernetes starts evicting pods (including AWX pods) |
| Pod preemption | If no resource limits are properly set, kube-scheduler kills “low-priority” pods to save critical pods. | Your AWX web or postgres pod might get evicted |
| Flapping networking | Rancher’s agents may try to reach Rancher server and overload K8s networking. | AWX needs stable DNS, API server access; it breaks |
| PersistentVolume interruptions | If Rancher agents cause node instability, your Postgres PVCs (storage) might get stuck or remount improperly. | AWX relies heavily on Postgres; database disruptions kill it |
🧠 Let’s go deep technical quickly:
If you run:
kubectl get events -n awx --sort-by='.metadata.creationTimestamp'
or
kubectl describe pod <awx-pod-name> -n awx
you will likely see:
EvictedmessagesNodeAffinityerrorsOutOfcpuOutOfmemoryPod sandbox changed- Volume mount errors
- Postgres
connection refusedor file corruption
🛠 How to properly fix this
✅ Step 1: Check Node Resources
Look at:
kubectl describe node <node-name>
Check for:
AllocatableCPU/memory- Current
Usage ConditionslikeMemoryPressure,DiskPressure
✅ Step 2: Set Strong Resource Requests & Limits for AWX
In your AWX YAML (awx.yaml), add hard limits like this:
yamlCopyEditspec:
web_resource_requirements:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
task_resource_requirements:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
postgres_resource_requirements:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
✅ Step 3: Pin AWX pods to 1 specific node (optional)
If your nodes are unstable, you can force AWX to run only on the “good” node:
spec:
node_selector:
node-role.kubernetes.io/awx: "true"
Or use taints and tolerations.
✅ Step 4: Fix Rancher agents
If your Rancher agents (cattle-node-agent) are going crazy (e.g., retrying, error loops):
- Make sure nodes can reach Rancher server stably
- Update the registration URL if needed
- Fix certificate errors
You can check the cattle-node-agent logs:
kubectl logs -n cattle-system cattle-node-agent-xxxxx
✅ Step 5: Stabilize PVCs
If AWX Postgres uses PVC (Persistent Volume Claim), ensure that:
- Storage class is fast and resilient
- Volumes aren’t remounting during node pressure
If you use Longhorn or local-path, make sure your storage is attached and healthy.
⚡ Quick Summary
| Symptom | Root Cause | Fix |
|---|---|---|
| AWX web and Postgres pods crash when Rancher agent wakes up | Node pressure, network overload, volume flapping | Set AWX hard resource requests/limits, stabilize Rancher agent, pin critical pods to best nodes, ensure fast storage |
✍️ What I Recommend for You Right Now
- Look at AWX pods:
kubectl get pods -n awx kubectl describe pod <awx-pod> -n awx kubectl get events -n awx --sort-by='.metadata.creationTimestamp' - Look at Node status: bashCopyEdit
kubectl describe node <node-name> - Look at Rancher cattle-node-agent logs: bashCopyEdit
kubectl logs -n cattle-system cattle-node-agent-xxxx
Paste me (if you want) the errors and I can tell you exactly what’s killing AWX.
You’re very close to solving it. 🚀
❓ Quick Questions
create a better AWX deployment YAML that sets:
- Node Affinity
- Hard CPU/memory limits
- Stable PVC
- Tolerations (if needed)
Troubleshooting
kubectl get secret awx-admin-password -n awx -o jsonpath=”{.data.password}” | base64 –decode