GKE Best Practices for Optimal Performance

GKE Best Practices

What is GKE?

Google Kubernetes Engine is Google Cloud’s managed Kubernetes service — Google manages the control plane, you manage the worker nodes (or let Autopilot manage everything).

GKE Modes:
┌─────────────────────────────────────────────────────────────┐
│ Standard Mode │ Autopilot Mode │
│ ───────────── │ ─────────────── │
│ You manage node pools │ Google manages everything │
│ You choose machine types │ Pay per pod not node │
│ Full node customization │ No node management │
│ More control │ More managed/serverless │
│ Best for: complex workloads│ Best for: simplicity │
└─────────────────────────────────────────────────────────────┘

1. Cluster Architecture Best Practices

Use Regional Clusters (Not Zonal)
# ❌ Zonal — single point of failure
gcloud container clusters create my-cluster \
--zone us-central1-a
# ✅ Regional — control plane + nodes across 3 zones
gcloud container clusters create my-cluster \
--region us-central1 \
--num-nodes 2 # 2 per zone = 6 total nodes
Zonal Cluster: Regional Cluster:
us-central1-a us-central1-a us-central1-b us-central1-c
control plane control control control
node node node plane plane plane
node node node node node node
Zone fails = cluster down Zone fails = cluster healthy
Separate Node Pools by Workload Type
# System node pool — for cluster components
gcloud container node-pools create system-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-2 \
--num-nodes 1 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--node-labels pool=system
# Application node pool — for your apps
gcloud container node-pools create app-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-4 \
--num-nodes 2 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--node-labels pool=application
# GPU node pool — for ML workloads
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 5 \
--node-taints nvidia.com/gpu=present:NoSchedule
# Spot node pool — for batch / fault-tolerant workloads
gcloud container node-pools create spot-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-4 \
--spot \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 20
Terraform Cluster Setup
# main.tf
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1" # regional
# Remove default node pool — use custom ones
remove_default_node_pool = true
initial_node_count = 1
# Networking
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
networking_config {
enable_intra_node_visibility = true
}
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
# Private cluster — no public node IPs
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.0.0/28"
}
# Authorized networks for control plane access
master_authorized_networks_config {
cidr_blocks {
cidr_block = "10.0.0.0/8"
display_name = "internal"
}
cidr_blocks {
cidr_block = var.office_ip
display_name = "office"
}
}
# Security
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Enable addons
addons_config {
horizontal_pod_autoscaling { disabled = false }
http_load_balancing { disabled = false }
network_policy_addon { disabled = false }
gce_persistent_disk_csi_driver_config { enabled = true }
gcs_fuse_csi_driver_config { enabled = true }
}
# Enable network policy
network_policy {
enabled = true
provider = "CALICO"
}
# Cluster autoscaling
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
minimum = 4
maximum = 100
}
resource_limits {
resource_type = "memory"
minimum = 16
maximum = 400
}
auto_provisioning_defaults {
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
}
# Maintenance window
maintenance_policy {
recurring_window {
start_time = "2024-01-01T02:00:00Z"
end_time = "2024-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
}
# Logging and monitoring
logging_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS"
]
}
monitoring_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS"
]
managed_prometheus {
enabled = true
}
}
# Release channel — get automatic updates
release_channel {
channel = "REGULAR"
}
}
# System node pool
resource "google_container_node_pool" "system" {
name = "system-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
node_count = 1
node_config {
machine_type = "n2-standard-2"
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
workload_metadata_config {
mode = "GKE_METADATA" # Workload Identity
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
taint {
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}
labels = {
pool = "system"
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Application node pool with autoscaling
resource "google_container_node_pool" "application" {
name = "app-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
autoscaling {
min_node_count = 1
max_node_count = 10
}
node_config {
machine_type = "n2-standard-4"
disk_size_gb = 100
disk_type = "pd-ssd"
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
workload_metadata_config {
mode = "GKE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
labels = {
pool = "application"
env = "production"
}
}
management {
auto_repair = true
auto_upgrade = true
}
upgrade_settings {
max_surge = 1
max_unavailable = 0
}
}

2. Security Best Practices

Workload Identity (No Service Account Keys)
# Create GCP service account
gcloud iam service-accounts create api-sa \
--display-name="API Service Account"
# Grant permissions to GCP SA
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:api-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Create Kubernetes service account
kubectl create serviceaccount api-ksa -n production
# Bind K8s SA to GCP SA
gcloud iam service-accounts add-iam-policy-binding \
api-sa@$PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:$PROJECT_ID.svc.id.goog[production/api-ksa]"
# Annotate K8s SA
kubectl annotate serviceaccount api-ksa \
-n production \
iam.gke.io/gcp-service-account=api-sa@$PROJECT_ID.iam.gserviceaccount.com
# Pod uses Workload Identity — no key files needed
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
spec:
serviceAccountName: api-ksa # ← K8s SA with WI annotation
containers:
- name: api
image: gcr.io/myproject/api:latest
# GCP SDK auto-detects credentials via metadata server
# No GOOGLE_APPLICATION_CREDENTIALS needed
Pod Security Standards
# Enforce restricted security for namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Pod that meets restricted standards
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: gcr.io/myproject/api:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp # writable tmp dir
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
Binary Authorization
# Enable Binary Authorization
gcloud services enable binaryauthorization.googleapis.com
# Create attestor — only signed images can deploy
gcloud container binauthz attestors create production-attestor \
--attestation-authority-note=projects/$PROJECT_ID/notes/production-note \
--attestation-authority-note-project=$PROJECT_ID
# Set policy — require attestation
cat > /tmp/policy.yaml << EOF
defaultAdmissionRule:
evaluationMode: REQUIRE_ATTESTATION
requireAttestationsBy:
- projects/$PROJECT_ID/attestors/production-attestor
enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
EOF
gcloud container binauthz policy import /tmp/policy.yaml
Network Policies
# Default deny all
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow api to reach database only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-to-db
namespace: production
spec:
podSelector:
matchLabels:
app: database
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- port: 5432
---
# Allow egress to Google APIs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-google-apis
namespace: production
spec:
podSelector: {}
egress:
- to:
- ipBlock:
cidr: 199.36.153.8/30 # restricted.googleapis.com
ports:
- port: 443
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP # DNS
Secret Management with Secret Manager
# Use External Secrets Operator to sync GCP secrets → K8s secrets
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: gcp-secret-store
namespace: production
spec:
provider:
gcpsm:
projectID: my-project-id
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-secrets
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: gcp-secret-store
kind: SecretStore
target:
name: api-secrets # creates K8s secret
creationPolicy: Owner
data:
- secretKey: db-password
remoteRef:
key: prod/api/db-password
- secretKey: api-key
remoteRef:
key: prod/api/external-api-key

3. Resource Management Best Practices

Always Set Resource Requests and Limits
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: gcr.io/myproject/api:latest
resources:
requests:
cpu: "250m" # guaranteed CPU
memory: "256Mi" # guaranteed memory
limits:
cpu: "500m" # max CPU (throttled if exceeded)
memory: "512Mi" # max memory (OOM killed if exceeded)
LimitRange — Default Limits per Namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # default limit if not set
cpu: "500m"
memory: "512Mi"
defaultRequest: # default request if not set
cpu: "100m"
memory: "128Mi"
max: # hard max per container
cpu: "4"
memory: "8Gi"
min: # minimum per container
cpu: "50m"
memory: "64Mi"
- type: Pod
max:
cpu: "8"
memory: "16Gi"
- type: PersistentVolumeClaim
max:
storage: "100Gi"
ResourceQuota per Namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
# Compute
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
# Objects
pods: "100"
services: "20"
persistentvolumeclaims: "20"
secrets: "50"
configmaps: "50"
# Service types
services.loadbalancers: "3"
services.nodeports: "0"
Priority Classes
# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
globalDefault: false
description: "Critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 100000
description: "Important production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 1000
description: "Batch and background jobs"
---
# Use in deployment
spec:
template:
spec:
priorityClassName: critical # ← won't be evicted for lower priority

4. Autoscaling Best Practices

Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scale at 60% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: my-subscription
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100 # double pods in one step
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # wait 5min before scale down
policies:
- type: Pods
value: 2
periodSeconds: 60
Vertical Pod Autoscaler
# Install VPA first
# kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Recommend only — don't auto-update
# Options: Off | Initial | Recreate | Auto
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "2"
memory: 2Gi
controlledResources:
- cpu
- memory
---
# Check VPA recommendations
# kubectl describe vpa api-vpa -n production
# Look for: Status.Recommendation.ContainerRecommendations
Cluster Autoscaler Best Practices
# Pod Disruption Budget — prevent CA from evicting too many pods
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # keep at least 2 pods running
# OR
# maxUnavailable: 1 # allow at most 1 pod down
selector:
matchLabels:
app: api
# Configure cluster autoscaler behavior
gcloud container clusters update my-cluster \
--region us-central1 \
--autoscaling-profile optimize-utilization # or balanced
# Set scale-down delay
gcloud container node-pools update app-pool \
--cluster my-cluster \
--region us-central1 \
--autoscaling-profile optimize-utilization

5. Networking Best Practices

Use Private Cluster with VPC-Native Networking
# Create VPC and subnets
gcloud compute networks create prod-vpc \
--subnet-mode custom
gcloud compute networks subnets create prod-subnet \
--network prod-vpc \
--region us-central1 \
--range 10.0.0.0/20 \
--secondary-range pods=10.4.0.0/14,services=10.0.16.0/20
# Create private cluster
gcloud container clusters create prod-cluster \
--region us-central1 \
--network prod-vpc \
--subnetwork prod-subnet \
--cluster-secondary-range-name pods \
--services-secondary-range-name services \
--enable-private-nodes \
--master-ipv4-cidr 172.16.0.0/28 \
--enable-ip-alias
Cloud Armor WAF for Ingress
# BackendConfig — attach Cloud Armor policy
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: api-backend-config
namespace: production
spec:
securityPolicy:
name: prod-waf-policy # Cloud Armor policy name
connectionDraining:
drainingTimeoutSec: 60
healthCheck:
checkIntervalSec: 15
timeoutSec: 15
healthyThreshold: 1
unhealthyThreshold: 2
type: HTTP
requestPath: /health
port: 8080
---
# Service references BackendConfig
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: production
annotations:
cloud.google.com/backend-config: '{"default":"api-backend-config"}'
cloud.google.com/neg: '{"ingress": true}' # Container-native LB
spec:
selector:
app: api
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
# GKE Ingress with HTTPS and managed cert
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: gce
kubernetes.io/ingress.global-static-ip-name: prod-ip
networking.gke.io/managed-certificates: api-cert
kubernetes.io/ingress.allow-http: "false"
spec:
rules:
- host: api.acme.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
# Create Cloud Armor WAF policy
gcloud compute security-policies create prod-waf-policy \
--description "Production WAF policy"
# Enable OWASP rules
gcloud compute security-policies rules create 1000 \
--security-policy prod-waf-policy \
--expression "evaluatePreconfiguredExpr('xss-v33-stable')" \
--action deny-403
# Rate limiting
gcloud compute security-policies rules create 2000 \
--security-policy prod-waf-policy \
--expression "true" \
--action throttle \
--rate-limit-threshold-count 1000 \
--rate-limit-threshold-interval-sec 60 \
--conform-action allow \
--exceed-action deny-429

6. Reliability Best Practices

Pod Anti-Affinity — Spread Across Zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
replicas: 6
template:
spec:
# Spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
# Don't put two api pods on same node
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api
topologyKey: kubernetes.io/hostname
Readiness and Liveness Probes
containers:
- name: api
image: gcr.io/myproject/api:latest
ports:
- containerPort: 8080
# Liveness — restart pod if unhealthy
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # wait before first check
periodSeconds: 10
failureThreshold: 3 # fail 3 times = restart
timeoutSeconds: 5
# Readiness — remove from LB if not ready
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
# Startup — for slow-starting apps
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # allow 5 min to start
periodSeconds: 10
Graceful Shutdown
containers:
- name: api
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15 # allow LB to drain connections
# Allow time for preStop + app shutdown
terminationGracePeriodSeconds: 60

7. Cost Optimization Best Practices

Use Spot VMs for Non-Critical Workloads
# Schedule batch jobs on spot nodes
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
template:
spec:
# Target spot node pool
nodeSelector:
cloud.google.com/gke-spot: "true"
tolerations:
- key: "cloud.google.com/gke-spot"
operator: Equal
value: "true"
effect: NoSchedule
# Handle spot preemption gracefully
terminationGracePeriodSeconds: 25 # spot gives 30s warning
restartPolicy: OnFailure
containers:
- name: processor
image: gcr.io/myproject/processor:latest
Committed Use Discounts
# Purchase committed use for baseline workloads
gcloud compute commitments create prod-commitment \
--plan 1-year \
--region us-central1 \
--resources vcpu=20,memory=80GB
# Savings: ~37% for 1-year, ~55% for 3-year
Node Auto-Provisioning with Resource Limits
# Set cluster-level resource limits for NAP
gcloud container clusters update prod-cluster \
--region us-central1 \
--enable-autoprovisioning \
--max-cpu 100 \
--max-memory 400 \
--min-cpu 4 \
--min-memory 16 \
--autoprovisioning-scopes=https://www.googleapis.com/auth/cloud-platform

8. Observability Best Practices

Google Cloud Managed Prometheus

# Enable managed Prometheus (built into GKE)
gcloud container clusters update prod-cluster \
--region us-central1 \
--enable-managed-prometheus
# Deploy PodMonitoring to scrape your apps
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: api-monitoring
namespace: production
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 30s
path: /metrics
Structured Logging
# Always log in JSON format for Cloud Logging
import json
import logging
class JsonFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"severity": record.levelname,
"message": record.getMessage(),
"timestamp": self.formatTime(record),
"component": record.name,
"httpRequest": getattr(record, "httpRequest", None),
"labels": {
"service": "api",
"version": "v2",
"env": "production"
}
})

Cloud Trace Integration

# Auto-instrument with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(CloudTraceSpanExporter())
)
trace.set_tracer_provider(provider)

9. CI/CD Best Practices

Cloud Build + Artifact Registry
# cloudbuild.yaml
steps:
# Build image
- name: gcr.io/cloud-builders/docker
args:
- build
- -t
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- -t
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:latest
- .
# Scan for vulnerabilities
- name: gcr.io/cloud-builders/gcloud
args:
- artifacts
- docker
- images
- scan
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- --format=json
# Push to Artifact Registry
- name: gcr.io/cloud-builders/docker
args:
- push
- --all-tags
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api
# Deploy to GKE
- name: gcr.io/cloud-builders/kubectl
args:
- set
- image
- deployment/api
- api=us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- -n
- production
env:
- CLOUDSDK_COMPUTE_REGION=us-central1
- CLOUDSDK_CONTAINER_CLUSTER=prod-cluster
options:
machineType: E2_HIGHCPU_8
logging: CLOUD_LOGGING_ONLY

10. GKE Best Practices Checklist

Cluster Setup
✅ Regional cluster (not zonal)
✅ Private cluster (no public node IPs)
✅ Separate node pools by workload type
✅ Release channel enabled (auto-updates)
✅ Maintenance window set
✅ VPC-native networking
Security
✅ Workload Identity (no SA keys)
✅ Binary Authorization
✅ Pod Security Standards (restricted)
✅ Network Policies (default deny)
✅ Secrets in Secret Manager
✅ Shielded nodes enabled
✅ Container image scanning
✅ Cloud Armor WAF on ingress
Resource Management
✅ Requests and limits on every container
✅ LimitRange per namespace
✅ ResourceQuota per namespace
✅ Priority classes defined
✅ PodDisruptionBudgets set
Reliability
✅ Minimum 3 replicas for prod services
✅ Pod anti-affinity across zones
✅ HPA configured
✅ Liveness + readiness + startup probes
✅ Graceful shutdown (preStop + terminationGrace)
✅ PodDisruptionBudget (minAvailable ≥ 1)
Cost
✅ Spot VMs for batch/non-critical
✅ Committed use discounts for baseline
✅ Cluster autoscaler enabled
✅ VPA recommendations reviewed
✅ Node auto-provisioning for mixed workloads
Observability
✅ Managed Prometheus enabled
✅ Cloud Logging with structured JSON
✅ Cloud Trace instrumented
✅ Dashboards for golden signals
✅ Alerts on SLO breaches

GKE best practices come down to three pillars — security by default (private cluster, Workload Identity, least privilege), reliability by design (regional cluster, anti-affinity, autoscaling, probes), and cost efficiency (spot VMs, committed use, right-sizing with VPA). Get these right from day one and you avoid the most painful production incidents.

Leave a Reply