Deploy AWX (Rancher )

If you turn on (start/deploy) cattle-system pods (especially cattle-cluster-agent) and it crashes your cluster,
it means the Rancher agents are broken or misconfigured and are overloading, blocking, or breaking Kubernetes internally.


In detail, here’s why this can happen:

CauseWhat HappensWhy It Crushes the Cluster
❌ Rancher agents fail to connect and keep retryingThey flood the Kubernetes API server with reconnect attemptsAPI server gets overloaded, becomes unresponsive
❌ Wrong Rancher URL or network brokenAgents enter infinite loops trying to reach RancherNode CPU/memory gets exhausted
❌ Authentication errors (cert expired, token invalid)Agents spam auth failures on kube-apiserverAPI server becomes slow or hangs
❌ Agent version mismatchOlder agents send bad requests to newer serversAPI server rejects them violently
❌ Cluster registration issues (wrong cluster state in Rancher DB)Rancher tries to sync invalid resourcesetcd or kube-apiserver crash
❌ cattle-node-agent overloadEach node spawns bad agents, DaemonSet restarts infinitelykubelet, container runtime get overloaded

🚨 Important:

  • Rancher’s cattle-cluster-agent talks both to your Kubernetes API server and to Rancher’s API.
  • If it is broken (wrong URL, invalid token, expired cert, bad Rancher setup), it spams the cluster.
  • Symptoms you usually see:
    • kubectl get nodes hangs
    • kubectl get pods -A freezes
    • CPU of kube-apiserver jumps high
    • Cluster nodes become NotReady
    • etcd warnings appear if etcd is running locally

Proof: If you kill (scale to 0) the cattle-cluster-agent, the cluster usually becomes healthy again.
Example:

kubectl scale deployment cattle-cluster-agent -n cattle-system --replicas=0

→ Cluster becomes stable.


💬 Summary in simple words:

Rancher agents are deeply tied into your Kubernetes cluster.
If they are broken, they spam your Kubernetes system like crazy, and that overloads and crashes your cluster.


🔎 Next step: To solve it properly:

  1. Collect logs: kubectl logs -n cattle-system cattle-cluster-agent-xxxx kubectl logs -n cattle-system cattle-node-agent-xxxx
  2. Fix whatever is wrong (wrong URL, expired cert, wrong Rancher token, network broken).
  3. Then redeploy cattle-system carefully.

👉 Question for you to proceed:

  • Did you recently upgrade Rancher or move Rancher server (IP change, domain change)?
  • Did you renew SSL/TLS certificates on Rancher?
  • Or is this a new cluster registration you are trying?

Depending on your answer, I will tell you exactly how to fix it.
Would you like me to guide you step-by-step? 🚀
(If yes, tell me Rancher version too — 2.6.x, 2.7.x, 2.8.x, etc.)

This is a serious architecture conflict:

Rancher’s cattle-node-agent is destabilizing your AWX deployment.

And the reason is resource exhaustion or cluster network/messaging overload.


🔥 Why does this happen?

CauseDescriptionWhy it breaks AWX
Resource exhaustionRancher’s cattle-node-agent is heavy (especially when reconnecting or retrying). It uses a lot of CPU, memory, and bandwidth.AWX Postgres and Web pods starve for CPU/memory or lose network
Node pressureWhen 2 cattle-node-agents come alive, node CPU/mem pressure increases.Kubernetes starts evicting pods (including AWX pods)
Pod preemptionIf no resource limits are properly set, kube-scheduler kills “low-priority” pods to save critical pods.Your AWX web or postgres pod might get evicted
Flapping networkingRancher’s agents may try to reach Rancher server and overload K8s networking.AWX needs stable DNS, API server access; it breaks
PersistentVolume interruptionsIf Rancher agents cause node instability, your Postgres PVCs (storage) might get stuck or remount improperly.AWX relies heavily on Postgres; database disruptions kill it

🧠 Let’s go deep technical quickly:

If you run:

kubectl get events -n awx --sort-by='.metadata.creationTimestamp'

or

kubectl describe pod <awx-pod-name> -n awx

you will likely see:

  • Evicted messages
  • NodeAffinity errors
  • OutOfcpu
  • OutOfmemory
  • Pod sandbox changed
  • Volume mount errors
  • Postgres connection refused or file corruption

🛠 How to properly fix this

Step 1: Check Node Resources

Look at:

kubectl describe node <node-name>

Check for:

  • Allocatable CPU/memory
  • Current Usage
  • Conditions like MemoryPressure, DiskPressure

Step 2: Set Strong Resource Requests & Limits for AWX

In your AWX YAML (awx.yaml), add hard limits like this:

yamlCopyEditspec:
  web_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m
  task_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m
  postgres_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m

Step 3: Pin AWX pods to 1 specific node (optional)

If your nodes are unstable, you can force AWX to run only on the “good” node:

spec:
node_selector:
node-role.kubernetes.io/awx: "true"

Or use taints and tolerations.

Step 4: Fix Rancher agents

If your Rancher agents (cattle-node-agent) are going crazy (e.g., retrying, error loops):

  • Make sure nodes can reach Rancher server stably
  • Update the registration URL if needed
  • Fix certificate errors

You can check the cattle-node-agent logs:

kubectl logs -n cattle-system cattle-node-agent-xxxxx

Step 5: Stabilize PVCs

If AWX Postgres uses PVC (Persistent Volume Claim), ensure that:

  • Storage class is fast and resilient
  • Volumes aren’t remounting during node pressure

If you use Longhorn or local-path, make sure your storage is attached and healthy.


⚡ Quick Summary

SymptomRoot CauseFix
AWX web and Postgres pods crash when Rancher agent wakes upNode pressure, network overload, volume flappingSet AWX hard resource requests/limits, stabilize Rancher agent, pin critical pods to best nodes, ensure fast storage

✍️ What I Recommend for You Right Now

  1. Look at AWX pods: kubectl get pods -n awx kubectl describe pod <awx-pod> -n awx kubectl get events -n awx --sort-by='.metadata.creationTimestamp'
  2. Look at Node status: bashCopyEditkubectl describe node <node-name>
  3. Look at Rancher cattle-node-agent logs: bashCopyEditkubectl logs -n cattle-system cattle-node-agent-xxxx

Paste me (if you want) the errors and I can tell you exactly what’s killing AWX.
You’re very close to solving it. 🚀


❓ Quick Questions

create a better AWX deployment YAML that sets:

  • Node Affinity
  • Hard CPU/memory limits
  • Stable PVC
  • Tolerations (if needed)

Troubleshooting

kubectl get secret awx-admin-password -n awx -o jsonpath=”{.data.password}” | base64 –decode

Leave a comment