Deploy AWX (Rancher )

April 28, 2025April 29, 2025 techhadoop cloud-native, interesting, kubernetes, python, technology

If you turn on (start/deploy) cattle-system pods (especially cattle-cluster-agent) and it crashes your cluster,
it means the Rancher agents are broken or misconfigured and are overloading, blocking, or breaking Kubernetes internally.

In detail, here’s why this can happen:

Cause	What Happens	Why It Crushes the Cluster
❌ Rancher agents fail to connect and keep retrying	They flood the Kubernetes API server with reconnect attempts	API server gets overloaded, becomes unresponsive
❌ Wrong Rancher URL or network broken	Agents enter infinite loops trying to reach Rancher	Node CPU/memory gets exhausted
❌ Authentication errors (cert expired, token invalid)	Agents spam auth failures on kube-apiserver	API server becomes slow or hangs
❌ Agent version mismatch	Older agents send bad requests to newer servers	API server rejects them violently
❌ Cluster registration issues (wrong cluster state in Rancher DB)	Rancher tries to sync invalid resources	etcd or kube-apiserver crash
❌ cattle-node-agent overload	Each node spawns bad agents, DaemonSet restarts infinitely	kubelet, container runtime get overloaded

🚨 Important:

Rancher’s cattle-cluster-agent talks both to your Kubernetes API server and to Rancher’s API.
If it is broken (wrong URL, invalid token, expired cert, bad Rancher setup), it spams the cluster.
Symptoms you usually see:
- kubectl get nodes hangs
- kubectl get pods -A freezes
- CPU of kube-apiserver jumps high
- Cluster nodes become NotReady
- etcd warnings appear if etcd is running locally

✅ Proof: If you kill (scale to 0) the cattle-cluster-agent, the cluster usually becomes healthy again.
Example:

kubectl scale deployment cattle-cluster-agent -n cattle-system --replicas=0

→ Cluster becomes stable.

💬 Summary in simple words:

Rancher agents are deeply tied into your Kubernetes cluster.
If they are broken, they spam your Kubernetes system like crazy, and that overloads and crashes your cluster.

🔎 Next step: To solve it properly:

Collect logs: kubectl logs -n cattle-system cattle-cluster-agent-xxxx kubectl logs -n cattle-system cattle-node-agent-xxxx
Fix whatever is wrong (wrong URL, expired cert, wrong Rancher token, network broken).
Then redeploy cattle-system carefully.

👉 Question for you to proceed:

Did you recently upgrade Rancher or move Rancher server (IP change, domain change)?
Did you renew SSL/TLS certificates on Rancher?
Or is this a new cluster registration you are trying?

Depending on your answer, I will tell you exactly how to fix it.
Would you like me to guide you step-by-step? 🚀
(If yes, tell me Rancher version too — 2.6.x, 2.7.x, 2.8.x, etc.)

This is a serious architecture conflict:

Rancher’s cattle-node-agent is destabilizing your AWX deployment.

And the reason is resource exhaustion or cluster network/messaging overload.

🔥 Why does this happen?

Cause	Description	Why it breaks AWX
Resource exhaustion	Rancher’s cattle-node-agent is heavy (especially when reconnecting or retrying). It uses a lot of CPU, memory, and bandwidth.	AWX Postgres and Web pods starve for CPU/memory or lose network
Node pressure	When 2 cattle-node-agents come alive, node CPU/mem pressure increases.	Kubernetes starts evicting pods (including AWX pods)
Pod preemption	If no resource limits are properly set, kube-scheduler kills “low-priority” pods to save critical pods.	Your AWX web or postgres pod might get evicted
Flapping networking	Rancher’s agents may try to reach Rancher server and overload K8s networking.	AWX needs stable DNS, API server access; it breaks
PersistentVolume interruptions	If Rancher agents cause node instability, your Postgres PVCs (storage) might get stuck or remount improperly.	AWX relies heavily on Postgres; database disruptions kill it

🧠 Let’s go deep technical quickly:

If you run:

kubectl get events -n awx --sort-by='.metadata.creationTimestamp'

kubectl describe pod <awx-pod-name> -n awx

you will likely see:

Evicted messages
NodeAffinity errors
OutOfcpu
OutOfmemory
Pod sandbox changed
Volume mount errors
Postgres connection refused or file corruption

🛠 How to properly fix this

✅ Step 1: Check Node Resources

Look at:

kubectl describe node <node-name>

Check for:

Allocatable CPU/memory
Current Usage
Conditions like MemoryPressure, DiskPressure

✅ Step 2: Set Strong Resource Requests & Limits for AWX

In your AWX YAML (awx.yaml), add hard limits like this:

yamlCopyEditspec:
  web_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m
  task_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m
  postgres_resource_requirements:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 1000m

✅ Step 3: Pin AWX pods to 1 specific node (optional)

If your nodes are unstable, you can force AWX to run only on the “good” node:

spec:
  node_selector:
    node-role.kubernetes.io/awx: "true"

Or use taints and tolerations.

✅ Step 4: Fix Rancher agents

If your Rancher agents (cattle-node-agent) are going crazy (e.g., retrying, error loops):

Make sure nodes can reach Rancher server stably
Update the registration URL if needed
Fix certificate errors

You can check the cattle-node-agent logs:

kubectl logs -n cattle-system cattle-node-agent-xxxxx

✅ Step 5: Stabilize PVCs

If AWX Postgres uses PVC (Persistent Volume Claim), ensure that:

Storage class is fast and resilient
Volumes aren’t remounting during node pressure

If you use Longhorn or local-path, make sure your storage is attached and healthy.

⚡ Quick Summary

Symptom	Root Cause	Fix
AWX web and Postgres pods crash when Rancher agent wakes up	Node pressure, network overload, volume flapping	Set AWX hard resource requests/limits, stabilize Rancher agent, pin critical pods to best nodes, ensure fast storage

✍️ What I Recommend for You Right Now

Look at AWX pods: kubectl get pods -n awx kubectl describe pod <awx-pod> -n awx kubectl get events -n awx --sort-by='.metadata.creationTimestamp'
Look at Node status: bashCopyEditkubectl describe node <node-name>
Look at Rancher cattle-node-agent logs: bashCopyEditkubectl logs -n cattle-system cattle-node-agent-xxxx

Paste me (if you want) the errors and I can tell you exactly what’s killing AWX.
You’re very close to solving it. 🚀

❓ Quick Questions

create a better AWX deployment YAML that sets:

Node Affinity
Hard CPU/memory limits
Stable PVC
Tolerations (if needed)

Troubleshooting

kubectl get secret awx-admin-password -n awx -o jsonpath=”{.data.password}” | base64 –decode

install license

April 25, 2025May 1, 2025 techhadoop

#!/bin/bash

# === CONFIGURATION ===
LICENSE_FILE="/path/to/your/maprlicense.txt"
PRIMARY_NODE="cldb-node.example.com"   # change to your CLDB node hostname or IP

# === STEP 1: Copy license to the primary node ===
echo "Copying license file to $PRIMARY_NODE..."
scp "$LICENSE_FILE" "$PRIMARY_NODE:/tmp/maprlicense.txt"

# === STEP 2: Add the license on the cluster ===
echo "Adding license using maprcli on $PRIMARY_NODE..."
ssh "$PRIMARY_NODE" "sudo maprcli license add -license /tmp/maprlicense.txt"

# === STEP 3: Verify license ===
echo "Verifying license status..."
ssh "$PRIMARY_NODE" "sudo maprcli license list -json | jq ."

echo "✅ License added and verified successfully!"


How Cluster ID is normally generated
When you first install MapR, you run something like:


sudo /opt/mapr/server/configure.sh -C <cldb-nodes> -Z <zookeeper-nodes> -N <cluster-name>
When configure.sh runs for the first time:

It creates /opt/mapr/conf/clusterid

It creates /opt/mapr/conf/mapr-clusters.conf

It registers your node with CLDB

The Cluster ID is a random large number, created automatically.

If you need to (re)generate a Cluster ID manually:
If you're setting up a new cluster, and no CLDB is initialized yet, you can force generate a Cluster ID like this:

Stop Warden if running:


sudo systemctl stop mapr-warden
Clean old config (careful, if cluster already had data, don't do this):


sudo rm -rf /opt/mapr/conf/clusterid
sudo rm -rf /opt/mapr/conf/mapr-clusters.conf
Re-run configure.sh:

Example:


sudo /opt/mapr/server/configure.sh -N mycluster.example.com -C cldb-node1,cldb-node2 -Z zk-node1,zk-node2
-N: Cluster Name

-C: CLDB nodes

-Z: ZooKeeper nodes

After that:
cat /opt/mapr/conf/clusterid
→ You will now see the new Cluster ID!

How Cluster ID is normally generated
When you first install MapR, you run something like:

sudo /opt/mapr/server/configure.sh -C <cldb-nodes> -Z <zookeeper-nodes> -N <cluster-name>
When configure.sh runs for the first time:

It creates /opt/mapr/conf/clusterid

It creates /opt/mapr/conf/mapr-clusters.conf

It registers your node with CLDB

✅ The Cluster ID is a random large number, created automatically.

🔥 If you need to (re)generate a Cluster ID manually:
If you're setting up a new cluster, and no CLDB is initialized yet, you can force generate a Cluster ID like this:

Stop Warden if running:

bash
Copy
Edit
sudo systemctl stop mapr-warden
Clean old config (careful, if cluster already had data, don't do this):

bash
Copy
Edit
sudo rm -rf /opt/mapr/conf/clusterid
sudo rm -rf /opt/mapr/conf/mapr-clusters.conf
Re-run configure.sh:

Example:

bash
Copy
Edit
sudo /opt/mapr/server/configure.sh -N mycluster.example.com -C cldb-node1,cldb-node2 -Z zk-node1,zk-node2
-N: Cluster Name

-C: CLDB nodes

-Z: ZooKeeper nodes

After that:

bash
Copy
Edit
cat /opt/mapr/conf/clusterid
→ You will now see the new Cluster ID!

Kong Konnect vs Kong Gateway

April 25, 2025 techhadoop

Aspect	Kong Gateway (self-hosted)	Kong Konnect (cloud SaaS)
Deployment	You install and manage Kong yourself (VMs, Kubernetes, bare metal)	Kong hosts the control plane; you run minimal “Data Planes”
Upgrades	You upgrade Kong manually	Kong handles upgrades for control plane automatically
Scaling	You manage scaling (HA, clustering)	Kong auto-scales the Control Plane, you scale only Data Planes
Security	You manage certificates, patching, hardening	Kong handles security for the control plane, you secure data planes
Analytics	Optional, via your own Prometheus/Grafana, or Enterprise Edition Analytics	Built-in with Konnect — real-time metrics, usage, dashboards
Admin GUI	Kong Manager on your VM	Cloud UI in Konnect (always updated)
Developer Portal	Host and manage it yourself	Cloud-hosted Developer Portal included
RBAC / Single Sign-On (SSO)	Enterprise feature; you configure LDAP, OIDC yourself	Native SSO, multi-organization support
Pricing	License + cost of infrastructure + admin work	Subscription pricing (includes hosting + support)
Reliability	Depends on your HA setup	Konnect is a SaaS SLA (uptime guaranteed)

Why choose Kong Konnect over self-managed Kong Gateway?

Faster to deploy: no need to install, configure, secure Control Plane.
Zero maintenance: no patching, backups, upgrades for control plane.
Global availability: Control Plane is multi-region by default.
Modern features: You get new Kong features faster (Konnect users get earlier access).
Built-in observability: native dashboards, logging, analytics ready.
Multi-tenant support: You can separate teams, apps, etc easily.
Reduced DevOps overhead: focus only on managing lightweight Data Planes.

JWTs and client certificates

April 22, 2025 techhadoop

Great question — JWTs and client certificates are both authentication methods, but they are not directly dependent on each other. They solve different security goals, and in some advanced setups, they can complement each other.

Let’s break it down:

🔄 JWT vs. Client Certificate — Purpose

Feature	JWT	Client Certificate (mTLS)
Type	Token-based authentication	Certificate-based mutual TLS (mTLS)
Validated By	Application / API Gateway (e.g., Kong)	TLS handshake (mutual authentication)
Authenticates	Who you are (user/app identity)	What you are (trusted machine or client)
Revocation	Hard to revoke unless you use a blacklist	Can be revoked by CRL or OCSP
Stateless	✅ Yes, self-contained	❌ No, cert revocation/status may require state
Setup Complexity	Moderate	Higher (requires PKI, CA, trust setup)

Kong Troubleshooting

April 15, 2025April 29, 2025 techhadoop

Invalid status code received from the token endpoint” means Kong tried to exchange an authorization code for a token, but the PingFederate token endpoint replied with an error

302 Found:

Kong redirects the client to the authorization endpoint of PingFederate.
This is normal behavior during the initial OIDC flow (when no token is present).

401 Unauthorized (after redirect):

The client is redirected back to Kong with an authorization code.
Then Kong calls the token endpoint to exchange code → tokens.
But this step fails (e.g., bad client credentials, redirect URI mismatch, wrong token endpoint).
Result: 401 Unauthorized, often shown to the user after the browser returns from the IdP.

A 400 Bad Request from the OpenID Connect token endpoint usually means something is wrong with the request payload you’re sending. This often happens during a token exchange or authorization code flow.

Let’s troubleshoot it step by step:

🔍 Common Causes of 400 from Token Endpoint

Invalid or missing parameters
- Missing grant_type, client_id, client_secret, code, or redirect_uri
- Using wrong grant_type (e.g., should be authorization_code, client_credentials, refresh_token, etc.)
Mismatched or invalid redirect URI
- Must match the URI registered with the provider exactly.
Invalid authorization code
- Expired or already used.
Invalid client credentials
- Bad client_id / client_secret
Wrong Content-Type
- The request should be: bashCopyEditContent-Type: application/x-www-form-urlencoded

To know why Ping returned 400, you need to:

Check PingFederate logs – often shows detailed error like:

Invalid redirect_uri
Invalid client credentials
Unsupported grant_type

Kong is probably misconfigured or failing to capture the code from the redirect step before trying the token exchange.

This usually happens due to:

Misconfigured redirect_uri
Missing or misrouted callback handling (/callback)
Client app hitting the wrong route first
Kong OIDC plugin misconfigured (missing session_secret, or improper auth_methods)

Troubleshooting

Migrating from Kong Gateway to Kong Konnect

April 15, 2025April 18, 2025 techhadoop

Migrating from Kong Gateway (self-managed/on-prem) to Kong Konnect (cloud-managed) involves a combination of:

Exporting your current Kong configuration
Translating any on-prem customizations or plugins
Importing services and routes into Konnect
Updating auth, plugins, and Dev Portal configuration
Re-pointing your traffic and observability tools

Here’s a step-by-step migration plan with optional tooling for automation:

Step 1: Inventory Your Current Kong Gateway

Start by identifying all current components:

Services
Routes
Plugins
Consumers & credentials
RBAC users & roles
Custom plugins (if any)
Certificates
Upstreams / Targets
Rate limiting or security policies

You can use:

deck dump --kong-addr http://<admin-api>:8001 --output-file kong-export.yaml

This uses decK, a declarative config tool for Kong.

Step 2: Set Up Kong Konnect

Sign up for Kong Konnect
Create a Runtime Group (this is where your data plane will connect)
Install Kong Gateway (with Konnect mode) as the Data Plane: curl -O https://download.konghq.com/gateway-3.x-centos/Packages/k/kong-3.x.rpm Configure it with: yamlCopyEditrole: data_plane cluster_control_plane: <Konnect CP endpoint> cluster_telemetry_endpoint: <Telemetry CP endpoint>

Step 3: Translate & Import Configuration

Use decK to sync into Konnect:

bashCopyEditdeck sync --konnect-runtime-group <runtime-group-name> \
          --konnect-token <your-token> \
          --state kong-export.yaml

DecK v1.16+ supports direct Konnect import via --konnect flags.

Note: decK does not migrate:

RBAC user roles

Developer Portal assets (you’ll need to re-upload manually)

Custom plugins (must be re-implemented and built for Konnect if supported)

Step 4: Migrate Authentication & Plugins

Consumers / Auth: Recreate consumers in Konnect or use Konnect Dev Portal to register apps
Certificates: Re-upload any TLS certs to Konnect
Custom Plugins: Migrate only if they are supported on Kong Konnect. Otherwise, consider rewriting logic using Lua/Python and submit to Kong support if needed.

Step 5: Reconfigure Observability

Kong Konnect offers built-in integrations:

Logs: Datadog, HTTP log, Splunk (via plugin)
Metrics: Prometheus, Kong Vitals
Use the Konnect GUI or API to configure logging plugins

Step 6: Redirect Traffic to Konnect Runtime

Update DNS or Load Balancer to send traffic to new Konnect Data Plane IPs
Perform traffic shadowing/canary if needed

Final Step: Validation & Cutover

Smoke test all endpoints
Test rate limits, auth flows, consumer access
Validate logs and metrics collection
Disable/decommission legacy Kong Gateway only after validation

Databricks

April 15, 2025 techhadoop

Databricks is a cloud-based data platform built for data engineering, data science, machine learning, and analytics. It provides a unified environment that integrates popular open-source tools like Apache Spark, Delta Lake, and MLflow, and is designed to simplify working with big data and AI workloads at scale.

What Databricks Does

Databricks allows you to:

Ingest, clean, and transform large volumes of data
Run machine learning models and notebooks collaboratively
Perform interactive and batch analytics using SQL, Python, R, Scala, and more
Securely govern and share data across teams and workspaces

Core Components

Component	Description
Databricks Workspace	Your development environment for notebooks, jobs, and clusters
Clusters	Scalable compute resources (based on Apache Spark)
Delta Lake	Open-source storage layer that adds ACID transactions and versioning to data lakes
Unity Catalog	Centralized data governance and access control layer
MLflow	Manages the lifecycle of machine learning experiments, models, and deployments
Jobs	Scheduled or triggered ETL pipelines and batch workloads
SQL Warehouses	Serverless SQL compute for BI and analytics workloads

Runs on Major Clouds

AWS
Microsoft Azure
Google Cloud

Use Cases

Data lakehouse architecture
ETL/ELT processing
Business intelligence and analytics
Real-time streaming data processing
Machine learning and MLOps
GenAI development using large language models

Quick Analogy:

Think of Databricks as a “data factory + AI lab + SQL analytics tool” all in one, built on top of scalable cloud compute and storage.

shell script

April 14, 2025April 14, 2025 techhadoop

!/bin/bash

List of your servers (can be IPs or hostnames)

SERVERS=(
server1.example.com
server2.example.com
server3.example.com
server4.example.com
server5.example.com
server6.example.com
)

FILE_PATH=”/opt/pfengine/file.txt”

for server in “${SERVERS[@]}”; do
echo “🔍 Checking $FILE_PATH on $server”

ssh -o ConnectTimeout=5 “$server” “ls -l $FILE_PATH” 2>/dev/null

if [ $? -ne 0 ]; then
echo “❌ Could not access file on $server”
fi

echo “————————————–“
done

Allow LDAP users to access the Kong Manager GUI in Kong Gateway

April 9, 2025April 9, 2025 techhadoop azure, cloud, cybersecurity, security, technology

To allow LDAP users to access the Kong Manager GUI in Kong Gateway Enterprise 3.4, you’ll need to integrate LDAP authentication via the Kong Enterprise Role-Based Access Control (RBAC) system.

Here’s how you can get it working step-by-step 👇

👤 Step 1: Configure LDAP Authentication for Kong Manager

Edit your kong.conf or pass these as environment variables if you’re using a container setup.

admin_gui_auth = ldap-auth
admin_gui_auth_conf = {
  "ldap_host": "ldap.example.com",
  "ldap_port": 389,
  "ldap_base_dn": "dc=example,dc=com",
  "ldap_attribute": "uid",
  "ldap_bind_dn": "cn=admin,dc=example,dc=com",
  "ldap_password": "adminpassword",
  "start_tls": false,
  "verify_ldap_host": false
}

✅ If you’re using LDAPS, set ldap_port = 636 and start_tls = false or configure accordingly.

Restart Kong after updating this config.

👥 Step 2: Create an RBAC User Linked to the LDAP Username

Kong still needs an RBAC user that maps to the LDAP-authenticated identity.

curl -i -X POST http://localhost:8001/rbac/users \
  --data "name=jdoe" \
  --data "user_token=jdoe-admin-token"

The name here must match the LDAP uid or whatever attribute you configured with ldap_attribute.

🔐 Step 3: Assign a Role to the RBAC User

curl -i -X POST http://localhost:8001/rbac/users/jdoe/roles \
  --data "roles=read-only"  # Or "admin", "super-admin", etc.

Available roles: read-only, admin, super-admin, or your own custom roles.

🔓 Step 4: Log into Kong Manager with LDAP User

Go to your Kong Manager GUI:

https://<KONG_MANAGER_URL>:8445

Enter:

Username: jdoe (LDAP uid)
Password: LDAP user’s actual password (Kong will bind to LDAP and verify it)

🛠️ Optional: Test LDAP Config from CLI

You can test the LDAP binding from Kong CLI:

curl -i -X POST http://localhost:8001/rbac/users \
  --data "name=testuser" \
  --data "user_token=test123"

Then try logging into Kong Manager with testuser using their LDAP password.

Kong logs (2 Zones, 4 Servers—-> Splunk)

April 8, 2025 techhadoop

In Your Setup:

– each zone has its own shared DB:

Zone A (A1 & A2) → DB-A
Zone B (B1 & B2) → DB-B

That implies:

You need to configure the plugin in both DBs (once per zone).
So you only need to enable the plugin once per zone, using the Admin API on one node in each zone.

✅ What You Should Do:

Run this plugin setup command on one Kong node per zone (e.g., A1 and B1):

bash

CopyEdit

curl -i -X POST http://localhost:8001/plugins/ \

–data “name=http-log” \

–data “config.http_endpoint=https://splunk-hec.example.com:8088/services/collector” \

–data “config.method=POST” \

–data “config.headers[Authorization]=Splunk YOUR-HEC-TOKEN” \

–data “config.queue.size=1000”

Confirm it’s active via:

curl http://localhost:8001/plugins

🛡️ Bonus Tip: Tag Your Logs by Node/Zone

To make Splunk logs more useful, you can:

Add custom headers or query parameters with zone info.
Use a transform or custom_fields in Splunk to tag logs from Zone A vs B.

Example:

–data “config.headers[X-Kong-Zone]=zone-a”

Infra Cloud Solutions

Month: April 2025