GKE Security Best Practices: A Comprehensive Guide

Here’s a practical, real-world GKE security best practices guide . I’ll focus on production-grade controls, interview depth, and actionable configs.


GKE Security Best Practices (Enterprise-Ready)

1) Cluster Architecture & Isolation

Image
Image

Use Private Clusters (MANDATORY)

  • Disable public control plane access
  • Use authorized networks if public endpoint is required
  • Enable:
    • Private nodes
    • Private control plane endpoint
gcloud container clusters create secure-cluster \
--enable-private-nodes \
--enable-private-endpoint \
--master-ipv4-cidr=172.16.0.0/28

Separate Node Pools (Blast Radius Control)

  • System workloads vs application workloads
  • High-risk workloads in isolated pools

Multi-zone / Regional Clusters

  • Improves availability + reduces attack surface from single-zone failure

2) Identity & Access Management (IAM + RBAC)

Use Google Cloud IAM + Kubernetes RBAC together

  • IAM → controls access to GKE API
  • RBAC → controls inside cluster

Enable Workload Identity (CRITICAL)

  • Replace service account keys (never use JSON keys)
  • Secure pod → GCP API access
gcloud container clusters update secure-cluster \
--workload-pool=PROJECT_ID.svc.id.goog

Principle of Least Privilege

  • No cluster-admin unless absolutely required
  • Use Role + RoleBinding instead of ClusterRole

3) Network Security

Image
Image
Image
Image

✅ Enable Network Policies (Calico)

gcloud container clusters update secure-cluster \
--enable-network-policy

Example:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress

✅ Restrict Egress Traffic

  • Prevent data exfiltration
  • Only allow required endpoints (e.g., APIs)

✅ Use Internal Load Balancers

  • Avoid public exposure unless necessary

✅ Use Service Mesh (mTLS)

Use Istio:

  • Encrypt pod-to-pod traffic
  • Enforce zero-trust networking

4) Node & OS Security

✅ Use Shielded GKE Nodes

  • Secure boot
  • Integrity monitoring

✅ Enable GKE Sandbox (gVisor)

  • Strong workload isolation

✅ Use COS (Container-Optimized OS)

  • Minimal attack surface
  • Auto-updates

✅ Disable SSH Access

  • Use IAP or OS Login instead

5) Workload Security (Pods)

✅ Use Pod Security Standards (PSS)

  • Enforce:
    • restricted policy
    • No privileged containers

✅ Run as Non-Root

securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false

✅ Read-Only Root Filesystem

securityContext:
readOnlyRootFilesystem: true

✅ Drop Linux Capabilities

capabilities:
drop:
- ALL

6) Image Security

Use Artifact Registry (private images)

  • Avoid Docker Hub in production

Enable Image Scanning

Use Google Artifact Registry:

  • Detect CVEs automatically

Use Trusted Images Only

  • Distroless images preferred
  • Pin image versions (no latest)

7) Secrets Management

Never store secrets in YAML


Use Google Secret Manager

  • Integrate with Workload Identity

Enable Secret Encryption

--database-encryption-key=projects/.../cryptoKeys/...

8) Logging, Monitoring & Threat Detection

Enable Cloud Logging & Monitoring

  • Audit logs
  • VPC flow logs

Use Google Security Command Center

  • Detect misconfigurations
  • Threat detection

Enable Kubernetes Audit Logs

Critical for:

  • Who did what
  • API misuse

9) Policy Enforcement (VERY IMPORTANT)

Use Open Policy Agent / Gatekeeper

Example:

  • Block privileged containers
  • Enforce labels
  • Restrict images

Use Pod Security Admission (PSA)

  • Replace PodSecurityPolicy (deprecated)

10) Patch & Upgrade Strategy

Enable Auto Upgrade

  • Nodes + control plane

Use Release Channels

  • Rapid / Regular / Stable (use Regular/Stable for prod)

11) API & Ingress Security

Use Cloud Armor (WAF)

  • Protect ingress endpoints

Enable HTTPS Only

  • Use managed certs

Rate Limiting

  • Prevent abuse

12) Supply Chain Security (Advanced)

Binary Authorization

  • Only allow signed images

SBOM + Provenance

  • Verify build pipeline

Interview Cheat Sheet (Memorize This)

If asked: “How do you secure GKE?” → Answer like this:

👉 5-layer model:

  1. Identity
    • IAM + RBAC + Workload Identity
  2. Network
    • Private cluster + Network policies + mTLS
  3. Compute
    • Shielded nodes + gVisor
  4. Workloads
    • Non-root, no privilege, PSS
  5. Supply Chain
    • Image scanning + Binary Authorization

Real-World Failure Scenarios (Interview Gold)

Scenario 1: Data Exfiltration

  • Cause: No egress restrictions
  • Fix: NetworkPolicy + firewall rules

Scenario 2: Pod Escape

  • Cause: Privileged container
  • Fix: PSS + OPA

Scenario 3: Credential Leak

  • Cause: Service account JSON key
  • Fix: Workload Identity

Scenario 4: Public Exposure

  • Cause: Public LoadBalancer
  • Fix: Internal LB + Cloud Armor

Mastering GKE: Essential Questions for Kubernetes Interviews

Transitioning from AKS to GKE (Google Kubernetes Engine) for an interview requires understanding Google’s specific “flavor” of managed Kubernetes. GKE is often considered the most advanced managed service because it was built by the company that invented Kubernetes.

Here are the top GKE-specific interview questions categorized by role and complexity for 2026.


1. Architectural & Foundational

These questions test your understanding of GKE’s unique management models.

  • Standard vs. Autopilot: What is the primary difference between GKE Standard and GKE Autopilot? When would you choose one over the other?Answer Focus: Standard gives you full control over node management and configuration. Autopilot is a fully managed “hands-off” experience where Google manages the nodes, scaling, and security hardening, and you only pay for the pods you run.
  • Regional vs. Zonal Clusters: Why would you choose a Regional cluster over a Zonal one for a production environment?Answer Focus: Regional clusters replicate the Control Plane across three zones in a region, providing high availability ($99.95\%$ SLA) even if a whole zone goes down.
  • VPC-Native Clusters: What are VPC-native clusters, and why are they the default in 2026?Answer Focus: They use Alias IP ranges, allowing pod IPs to be natively routable within the VPC. This improves performance and allows pods to talk directly to other Google Cloud services (like Cloud SQL) without complex NAT rules.

2. Networking & Security

GKE has specific tools for identity and traffic management that differ from AKS.

  • Workload Identity: Explain how Workload Identity works. Why is it superior to using Service Account JSON keys?Answer Focus: It binds a Kubernetes Service Account (KSA) to a Google Cloud Service Account (GSA). This allows pods to securely call GCP APIs (like Storage or Vision) using short-lived tokens instead of risky, permanent static keys.
  • Gateway API vs. Ingress: GKE was one of the first to implement the Gateway API. How does it differ from traditional Ingress?Answer Focus: Gateway API is more expressive and role-oriented. It separates the infrastructure (GatewayClass) from the routing (HTTPRoute), allowing Ops and Dev teams to manage their parts independently.
  • Private Clusters: In a Private GKE cluster, how do nodes communicate with the Control Plane and the Internet?Answer Focus: Nodes have no public IPs. They use a Private Endpoint to talk to the Control Plane. To reach the internet (e.g., for updates), you must configure a Cloud NAT.

3. Scaling & Operations

  • Cluster Autoscaler vs. Horizontal Pod Autoscaler (HPA): How do they work together during a traffic spike?Answer Focus: HPA detects high CPU/memory and adds more Pods. When those pods have no room to run (Pending state), the Cluster Autoscaler detects this and adds more Nodes.
  • Node Auto-Provisioning (NAP): How is NAP different from the standard Cluster Autoscaler?Answer Focus: Standard Autoscaler adds nodes to existing pools. NAP can create entirely new node pools with different machine types (e.g., adding a GPU pool) on the fly based on what the pods need.
  • Binary Authorization: How do you ensure only “trusted” images are deployed to GKE?Answer Focus: Binary Authorization is a deploy-time security control. It ensures that images have been signed by your CI/CD pipeline (e.g., Cloud Build) before they are allowed to run.

4. Advanced & “2026” Trends

  • GKE Enterprise (Anthos): What is GKE Enterprise, and how does it handle multi-cluster management?Answer Focus: It uses Fleet Management to group clusters. It includes Config Sync (GitOps) and Anthos Service Mesh to manage policies and traffic across multiple regions or even other clouds.
  • AI Workloads: How does GKE simplify running LLMs or AI training jobs?Answer Focus: Mention GKE’s native support for TPUs (Tensor Processing Units), GPU sharing (Time-sharing vs. Multi-instance GPU), and the AI Toolchain Operator (KAITO).
  • Cost Optimization: What are “Spot VMs” in GKE, and what is the best practice for using them?Answer Focus: Spot VMs offer up to $91\%$ savings but can be preempted. Best practice is to use them for fault-tolerant, stateless batch jobs and use Node Taints to keep critical system pods off them.

Interview Pro-Tips for GKE:

  1. Mention the “Managed” Benefit: Always emphasize that GKE handles Auto-Repair (fixing broken nodes) and Auto-Upgrade (keeping K8s versions current) better than other providers.
  2. Infrastructure as Code: Expect questions on how to provision GKE using Terraform or Config Connector.
  3. Observability: Familiarize yourself with Cloud Operations Suite (formerly Stackdriver). In GKE, logs and metrics are “on by default” and integrated directly into the Google Cloud Console.

GKE Best Practices for Optimal Performance

GKE Best Practices

What is GKE?

Google Kubernetes Engine is Google Cloud’s managed Kubernetes service — Google manages the control plane, you manage the worker nodes (or let Autopilot manage everything).

GKE Modes:
┌─────────────────────────────────────────────────────────────┐
│ Standard Mode │ Autopilot Mode │
│ ───────────── │ ─────────────── │
│ You manage node pools │ Google manages everything │
│ You choose machine types │ Pay per pod not node │
│ Full node customization │ No node management │
│ More control │ More managed/serverless │
│ Best for: complex workloads│ Best for: simplicity │
└─────────────────────────────────────────────────────────────┘

1. Cluster Architecture Best Practices

Use Regional Clusters (Not Zonal)
# ❌ Zonal — single point of failure
gcloud container clusters create my-cluster \
--zone us-central1-a
# ✅ Regional — control plane + nodes across 3 zones
gcloud container clusters create my-cluster \
--region us-central1 \
--num-nodes 2 # 2 per zone = 6 total nodes
Zonal Cluster: Regional Cluster:
us-central1-a us-central1-a us-central1-b us-central1-c
control plane control control control
node node node plane plane plane
node node node node node node
Zone fails = cluster down Zone fails = cluster healthy
Separate Node Pools by Workload Type
# System node pool — for cluster components
gcloud container node-pools create system-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-2 \
--num-nodes 1 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--node-labels pool=system
# Application node pool — for your apps
gcloud container node-pools create app-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-4 \
--num-nodes 2 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--node-labels pool=application
# GPU node pool — for ML workloads
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 5 \
--node-taints nvidia.com/gpu=present:NoSchedule
# Spot node pool — for batch / fault-tolerant workloads
gcloud container node-pools create spot-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n2-standard-4 \
--spot \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 20
Terraform Cluster Setup
# main.tf
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1" # regional
# Remove default node pool — use custom ones
remove_default_node_pool = true
initial_node_count = 1
# Networking
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.subnet.name
networking_config {
enable_intra_node_visibility = true
}
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
# Private cluster — no public node IPs
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = "172.16.0.0/28"
}
# Authorized networks for control plane access
master_authorized_networks_config {
cidr_blocks {
cidr_block = "10.0.0.0/8"
display_name = "internal"
}
cidr_blocks {
cidr_block = var.office_ip
display_name = "office"
}
}
# Security
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Enable addons
addons_config {
horizontal_pod_autoscaling { disabled = false }
http_load_balancing { disabled = false }
network_policy_addon { disabled = false }
gce_persistent_disk_csi_driver_config { enabled = true }
gcs_fuse_csi_driver_config { enabled = true }
}
# Enable network policy
network_policy {
enabled = true
provider = "CALICO"
}
# Cluster autoscaling
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
minimum = 4
maximum = 100
}
resource_limits {
resource_type = "memory"
minimum = 16
maximum = 400
}
auto_provisioning_defaults {
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
}
# Maintenance window
maintenance_policy {
recurring_window {
start_time = "2024-01-01T02:00:00Z"
end_time = "2024-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
}
# Logging and monitoring
logging_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS"
]
}
monitoring_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS"
]
managed_prometheus {
enabled = true
}
}
# Release channel — get automatic updates
release_channel {
channel = "REGULAR"
}
}
# System node pool
resource "google_container_node_pool" "system" {
name = "system-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
node_count = 1
node_config {
machine_type = "n2-standard-2"
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
workload_metadata_config {
mode = "GKE_METADATA" # Workload Identity
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
taint {
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}
labels = {
pool = "system"
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Application node pool with autoscaling
resource "google_container_node_pool" "application" {
name = "app-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
autoscaling {
min_node_count = 1
max_node_count = 10
}
node_config {
machine_type = "n2-standard-4"
disk_size_gb = 100
disk_type = "pd-ssd"
service_account = google_service_account.nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
workload_metadata_config {
mode = "GKE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
labels = {
pool = "application"
env = "production"
}
}
management {
auto_repair = true
auto_upgrade = true
}
upgrade_settings {
max_surge = 1
max_unavailable = 0
}
}

2. Security Best Practices

Workload Identity (No Service Account Keys)
# Create GCP service account
gcloud iam service-accounts create api-sa \
--display-name="API Service Account"
# Grant permissions to GCP SA
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:api-sa@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Create Kubernetes service account
kubectl create serviceaccount api-ksa -n production
# Bind K8s SA to GCP SA
gcloud iam service-accounts add-iam-policy-binding \
api-sa@$PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:$PROJECT_ID.svc.id.goog[production/api-ksa]"
# Annotate K8s SA
kubectl annotate serviceaccount api-ksa \
-n production \
iam.gke.io/gcp-service-account=api-sa@$PROJECT_ID.iam.gserviceaccount.com
# Pod uses Workload Identity — no key files needed
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
spec:
serviceAccountName: api-ksa # ← K8s SA with WI annotation
containers:
- name: api
image: gcr.io/myproject/api:latest
# GCP SDK auto-detects credentials via metadata server
# No GOOGLE_APPLICATION_CREDENTIALS needed
Pod Security Standards
# Enforce restricted security for namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Pod that meets restricted standards
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: gcr.io/myproject/api:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp # writable tmp dir
- name: cache
mountPath: /app/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
Binary Authorization
# Enable Binary Authorization
gcloud services enable binaryauthorization.googleapis.com
# Create attestor — only signed images can deploy
gcloud container binauthz attestors create production-attestor \
--attestation-authority-note=projects/$PROJECT_ID/notes/production-note \
--attestation-authority-note-project=$PROJECT_ID
# Set policy — require attestation
cat > /tmp/policy.yaml << EOF
defaultAdmissionRule:
evaluationMode: REQUIRE_ATTESTATION
requireAttestationsBy:
- projects/$PROJECT_ID/attestors/production-attestor
enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
EOF
gcloud container binauthz policy import /tmp/policy.yaml
Network Policies
# Default deny all
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow api to reach database only
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-to-db
namespace: production
spec:
podSelector:
matchLabels:
app: database
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- port: 5432
---
# Allow egress to Google APIs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-google-apis
namespace: production
spec:
podSelector: {}
egress:
- to:
- ipBlock:
cidr: 199.36.153.8/30 # restricted.googleapis.com
ports:
- port: 443
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- port: 53
protocol: UDP # DNS
Secret Management with Secret Manager
# Use External Secrets Operator to sync GCP secrets → K8s secrets
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: gcp-secret-store
namespace: production
spec:
provider:
gcpsm:
projectID: my-project-id
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-secrets
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: gcp-secret-store
kind: SecretStore
target:
name: api-secrets # creates K8s secret
creationPolicy: Owner
data:
- secretKey: db-password
remoteRef:
key: prod/api/db-password
- secretKey: api-key
remoteRef:
key: prod/api/external-api-key

3. Resource Management Best Practices

Always Set Resource Requests and Limits
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: gcr.io/myproject/api:latest
resources:
requests:
cpu: "250m" # guaranteed CPU
memory: "256Mi" # guaranteed memory
limits:
cpu: "500m" # max CPU (throttled if exceeded)
memory: "512Mi" # max memory (OOM killed if exceeded)
LimitRange — Default Limits per Namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # default limit if not set
cpu: "500m"
memory: "512Mi"
defaultRequest: # default request if not set
cpu: "100m"
memory: "128Mi"
max: # hard max per container
cpu: "4"
memory: "8Gi"
min: # minimum per container
cpu: "50m"
memory: "64Mi"
- type: Pod
max:
cpu: "8"
memory: "16Gi"
- type: PersistentVolumeClaim
max:
storage: "100Gi"
ResourceQuota per Namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
# Compute
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
# Objects
pods: "100"
services: "20"
persistentvolumeclaims: "20"
secrets: "50"
configmaps: "50"
# Service types
services.loadbalancers: "3"
services.nodeports: "0"
Priority Classes
# Define priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical
value: 1000000
globalDefault: false
description: "Critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 100000
description: "Important production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 1000
description: "Batch and background jobs"
---
# Use in deployment
spec:
template:
spec:
priorityClassName: critical # ← won't be evicted for lower priority

4. Autoscaling Best Practices

Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # scale at 60% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: my-subscription
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100 # double pods in one step
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # wait 5min before scale down
policies:
- type: Pods
value: 2
periodSeconds: 60
Vertical Pod Autoscaler
# Install VPA first
# kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Recommend only — don't auto-update
# Options: Off | Initial | Recreate | Auto
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: "2"
memory: 2Gi
controlledResources:
- cpu
- memory
---
# Check VPA recommendations
# kubectl describe vpa api-vpa -n production
# Look for: Status.Recommendation.ContainerRecommendations
Cluster Autoscaler Best Practices
# Pod Disruption Budget — prevent CA from evicting too many pods
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # keep at least 2 pods running
# OR
# maxUnavailable: 1 # allow at most 1 pod down
selector:
matchLabels:
app: api
# Configure cluster autoscaler behavior
gcloud container clusters update my-cluster \
--region us-central1 \
--autoscaling-profile optimize-utilization # or balanced
# Set scale-down delay
gcloud container node-pools update app-pool \
--cluster my-cluster \
--region us-central1 \
--autoscaling-profile optimize-utilization

5. Networking Best Practices

Use Private Cluster with VPC-Native Networking
# Create VPC and subnets
gcloud compute networks create prod-vpc \
--subnet-mode custom
gcloud compute networks subnets create prod-subnet \
--network prod-vpc \
--region us-central1 \
--range 10.0.0.0/20 \
--secondary-range pods=10.4.0.0/14,services=10.0.16.0/20
# Create private cluster
gcloud container clusters create prod-cluster \
--region us-central1 \
--network prod-vpc \
--subnetwork prod-subnet \
--cluster-secondary-range-name pods \
--services-secondary-range-name services \
--enable-private-nodes \
--master-ipv4-cidr 172.16.0.0/28 \
--enable-ip-alias
Cloud Armor WAF for Ingress
# BackendConfig — attach Cloud Armor policy
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: api-backend-config
namespace: production
spec:
securityPolicy:
name: prod-waf-policy # Cloud Armor policy name
connectionDraining:
drainingTimeoutSec: 60
healthCheck:
checkIntervalSec: 15
timeoutSec: 15
healthyThreshold: 1
unhealthyThreshold: 2
type: HTTP
requestPath: /health
port: 8080
---
# Service references BackendConfig
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: production
annotations:
cloud.google.com/backend-config: '{"default":"api-backend-config"}'
cloud.google.com/neg: '{"ingress": true}' # Container-native LB
spec:
selector:
app: api
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
# GKE Ingress with HTTPS and managed cert
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: gce
kubernetes.io/ingress.global-static-ip-name: prod-ip
networking.gke.io/managed-certificates: api-cert
kubernetes.io/ingress.allow-http: "false"
spec:
rules:
- host: api.acme.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
# Create Cloud Armor WAF policy
gcloud compute security-policies create prod-waf-policy \
--description "Production WAF policy"
# Enable OWASP rules
gcloud compute security-policies rules create 1000 \
--security-policy prod-waf-policy \
--expression "evaluatePreconfiguredExpr('xss-v33-stable')" \
--action deny-403
# Rate limiting
gcloud compute security-policies rules create 2000 \
--security-policy prod-waf-policy \
--expression "true" \
--action throttle \
--rate-limit-threshold-count 1000 \
--rate-limit-threshold-interval-sec 60 \
--conform-action allow \
--exceed-action deny-429

6. Reliability Best Practices

Pod Anti-Affinity — Spread Across Zones
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: production
spec:
replicas: 6
template:
spec:
# Spread across zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api
# Don't put two api pods on same node
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api
topologyKey: kubernetes.io/hostname
Readiness and Liveness Probes
containers:
- name: api
image: gcr.io/myproject/api:latest
ports:
- containerPort: 8080
# Liveness — restart pod if unhealthy
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # wait before first check
periodSeconds: 10
failureThreshold: 3 # fail 3 times = restart
timeoutSeconds: 5
# Readiness — remove from LB if not ready
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
# Startup — for slow-starting apps
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # allow 5 min to start
periodSeconds: 10
Graceful Shutdown
containers:
- name: api
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15 # allow LB to drain connections
# Allow time for preStop + app shutdown
terminationGracePeriodSeconds: 60

7. Cost Optimization Best Practices

Use Spot VMs for Non-Critical Workloads
# Schedule batch jobs on spot nodes
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
template:
spec:
# Target spot node pool
nodeSelector:
cloud.google.com/gke-spot: "true"
tolerations:
- key: "cloud.google.com/gke-spot"
operator: Equal
value: "true"
effect: NoSchedule
# Handle spot preemption gracefully
terminationGracePeriodSeconds: 25 # spot gives 30s warning
restartPolicy: OnFailure
containers:
- name: processor
image: gcr.io/myproject/processor:latest
Committed Use Discounts
# Purchase committed use for baseline workloads
gcloud compute commitments create prod-commitment \
--plan 1-year \
--region us-central1 \
--resources vcpu=20,memory=80GB
# Savings: ~37% for 1-year, ~55% for 3-year
Node Auto-Provisioning with Resource Limits
# Set cluster-level resource limits for NAP
gcloud container clusters update prod-cluster \
--region us-central1 \
--enable-autoprovisioning \
--max-cpu 100 \
--max-memory 400 \
--min-cpu 4 \
--min-memory 16 \
--autoprovisioning-scopes=https://www.googleapis.com/auth/cloud-platform

8. Observability Best Practices

Google Cloud Managed Prometheus

# Enable managed Prometheus (built into GKE)
gcloud container clusters update prod-cluster \
--region us-central1 \
--enable-managed-prometheus
# Deploy PodMonitoring to scrape your apps
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: api-monitoring
namespace: production
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 30s
path: /metrics
Structured Logging
# Always log in JSON format for Cloud Logging
import json
import logging
class JsonFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"severity": record.levelname,
"message": record.getMessage(),
"timestamp": self.formatTime(record),
"component": record.name,
"httpRequest": getattr(record, "httpRequest", None),
"labels": {
"service": "api",
"version": "v2",
"env": "production"
}
})

Cloud Trace Integration

# Auto-instrument with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(CloudTraceSpanExporter())
)
trace.set_tracer_provider(provider)

9. CI/CD Best Practices

Cloud Build + Artifact Registry
# cloudbuild.yaml
steps:
# Build image
- name: gcr.io/cloud-builders/docker
args:
- build
- -t
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- -t
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:latest
- .
# Scan for vulnerabilities
- name: gcr.io/cloud-builders/gcloud
args:
- artifacts
- docker
- images
- scan
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- --format=json
# Push to Artifact Registry
- name: gcr.io/cloud-builders/docker
args:
- push
- --all-tags
- us-central1-docker.pkg.dev/$PROJECT_ID/prod/api
# Deploy to GKE
- name: gcr.io/cloud-builders/kubectl
args:
- set
- image
- deployment/api
- api=us-central1-docker.pkg.dev/$PROJECT_ID/prod/api:$SHORT_SHA
- -n
- production
env:
- CLOUDSDK_COMPUTE_REGION=us-central1
- CLOUDSDK_CONTAINER_CLUSTER=prod-cluster
options:
machineType: E2_HIGHCPU_8
logging: CLOUD_LOGGING_ONLY

10. GKE Best Practices Checklist

Cluster Setup
✅ Regional cluster (not zonal)
✅ Private cluster (no public node IPs)
✅ Separate node pools by workload type
✅ Release channel enabled (auto-updates)
✅ Maintenance window set
✅ VPC-native networking
Security
✅ Workload Identity (no SA keys)
✅ Binary Authorization
✅ Pod Security Standards (restricted)
✅ Network Policies (default deny)
✅ Secrets in Secret Manager
✅ Shielded nodes enabled
✅ Container image scanning
✅ Cloud Armor WAF on ingress
Resource Management
✅ Requests and limits on every container
✅ LimitRange per namespace
✅ ResourceQuota per namespace
✅ Priority classes defined
✅ PodDisruptionBudgets set
Reliability
✅ Minimum 3 replicas for prod services
✅ Pod anti-affinity across zones
✅ HPA configured
✅ Liveness + readiness + startup probes
✅ Graceful shutdown (preStop + terminationGrace)
✅ PodDisruptionBudget (minAvailable ≥ 1)
Cost
✅ Spot VMs for batch/non-critical
✅ Committed use discounts for baseline
✅ Cluster autoscaler enabled
✅ VPA recommendations reviewed
✅ Node auto-provisioning for mixed workloads
Observability
✅ Managed Prometheus enabled
✅ Cloud Logging with structured JSON
✅ Cloud Trace instrumented
✅ Dashboards for golden signals
✅ Alerts on SLO breaches

GKE best practices come down to three pillars — security by default (private cluster, Workload Identity, least privilege), reliability by design (regional cluster, anti-affinity, autoscaling, probes), and cost efficiency (spot VMs, committed use, right-sizing with VPA). Get these right from day one and you avoid the most painful production incidents.

Enterprise RAG: Streamlining Internal AI on GCP

What is RAG?

Retrieval-Augmented Generation (RAG) = give an LLM access to your private data at query time, so it answers based on your documents — not just its training data.


GCP-Native RAG Architecture (Full Stack)

┌─────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ (Web App / Slack Bot / Internal Portal) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API LAYER │
│ Cloud Run / Cloud Functions │
└──────┬───────────────┬──────────────────┬───────────────────┘
↓ ↓ ↓
┌────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Retrieval │ │ LLM Layer │ │ Auth & Security │
│ Engine │ │ (Vertex AI)│ │ (IAM / IAP) │
└────────────┘ └─────────────┘ └──────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ VECTOR STORE │
│ Vertex AI Vector Search / AlloyDB / pgvector │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ KNOWLEDGE BASE (Raw Docs) │
│ GCS Buckets │ BigQuery │ Drive │ Confluence │ Jira │
└─────────────────────────────────────────────────────────────┘

GCP Services Mapping

RAG ComponentGCP Service
Document StorageCloud Storage (GCS)
Embedding ModelVertex AI Embeddings (text-embedding-005)
Vector StoreVertex AI Vector Search or AlloyDB pgvector
LLMVertex AI Gemini 1.5 Pro / Flash
OrchestrationCloud Run, Cloud Functions, or Vertex AI Pipelines
Document parsingDocument AI
Data ingestion pipelineDataflow / Cloud Composer (Airflow)
Metadata & structured dataBigQuery
Auth & access controlIAM, Identity-Aware Proxy (IAP)
MonitoringCloud Logging, Cloud Monitoring, Vertex AI Model Monitoring
Secret managementSecret Manager

Phase 1 — Document Ingestion Pipeline

[ Raw Documents ]
GCS / Drive / Confluence / SharePoint
[ Document AI ] ← OCR, form parsing, table extraction
[ Chunking & Cleaning ] ← Split into ~512 token chunks with overlap
[ Vertex AI Embeddings ] ← text-embedding-005 → vector per chunk
[ Vector Store ]
Vertex AI Vector Search (managed) or AlloyDB + pgvector (flexible)
[ Metadata → BigQuery ] ← source, timestamp, doc_id, chunk_id

Chunking Strategy (Critical for Quality)

StrategyBest for
Fixed size (512 tokens, 20% overlap)General documents
Semantic chunkingMixed-content docs
Sentence-levelFAQs, support docs
Section/header-basedStructured docs (manuals, wikis)
Parent-child chunkingRetrieve child, return parent context

Phase 2 — Retrieval Engine

# Simplified RAG retrieval flow on GCP
def retrieve(query: str, top_k: int = 5):
# 1. Embed the user query
query_embedding = vertexai_embed(query) # text-embedding-005
# 2. Vector similarity search
results = vector_search.find_neighbors(
embedding=query_embedding,
num_neighbors=top_k
)
# 3. Optional: Re-rank results
reranked = rerank(query, results) # Vertex AI Ranking API
# 4. Fetch full chunk text from GCS / BigQuery
chunks = fetch_chunks(reranked)
return chunks

Retrieval Techniques (Use in Combination)

TechniqueWhat it does
Dense retrievalVector similarity (semantic search)
Sparse retrievalBM25 keyword search
Hybrid searchDense + sparse combined (best quality)
Re-rankingVertex AI Ranking API re-orders top results
HyDELLM generates hypothetical answer → embed that for retrieval
Multi-query retrievalLLM generates N query variants → retrieve for all

Phase 3 — Generation (LLM Layer)

def generate_answer(query: str, chunks: list):
context = "\n\n".join([c.text for c in chunks])
prompt = f"""
You are an internal AI assistant for Acme Corp.
Answer ONLY based on the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite the source document.
CONTEXT:
{context}
QUESTION:
{query}
ANSWER:
"""
response = gemini_pro.generate_content(prompt)
return response.text

Gemini Models on Vertex AI

ModelBest for
Gemini 1.5 ProComplex reasoning, long documents (1M context)
Gemini 1.5 FlashFast, cost-efficient responses
Gemini 1.0 ProSimpler Q&A tasks
Claude on VertexAlternative via Model Garden

Phase 4 — API & Serving Layer

Cloud Run (containerized FastAPI)
├── POST /chat → RAG query endpoint
├── POST /ingest → Trigger document ingestion
├── GET /sources → List indexed documents
└── GET /health → Health check

Cloud Run is ideal because:

  • Serverless, scales to zero
  • Fast cold starts
  • Easy CI/CD via Cloud Build
  • Integrates with IAP for auth

Phase 5 — Internal AI Assistant UI

Options for the frontend:

OptionBest for
Cloud Run + React/Next.jsCustom internal portal
Slack BotTeams already using Slack
Google Chat BotGoogle Workspace shops
Vertex AI Agent BuilderNo-code, managed RAG UI
Looker / Data Studio embedAnalytics-heavy teams

Enterprise-Grade Features

1. Access Control (Critical)

IAM Roles → control who can call the RAG API
IAP → protect the web UI (Google SSO)
Document-level ACL → filter retrieved chunks by user's permissions
VPC Service Controls → isolate all GCP services in a perimeter

2. Observability Stack

Cloud Logging → all query logs, errors
Cloud Monitoring → latency, throughput, error rate dashboards
BigQuery → store all Q&A pairs for analysis
Vertex AI Evals → measure answer quality over time

3. Guardrails

Vertex AI Safety Filters → block harmful outputs
Grounding checks → ensure answer comes from retrieved context
Confidence scoring → flag low-confidence answers for human review
Citation enforcement → always return source doc + page

Full GCP RAG Stack — Production Setup

┌─ INGESTION (Batch + Real-time) ──────────────────────────────┐
│ Cloud Composer (Airflow) → Document AI → Embeddings → VectorDB│
└──────────────────────────────────────────────────────────────┘
┌─ SERVING ────────────────────────────────────────────────────┐
│ Cloud Run (FastAPI RAG service) │
│ ├── Vertex AI Vector Search (retrieval) │
│ ├── Vertex AI Ranking API (re-rank) │
│ └── Gemini 1.5 Pro (generation) │
└──────────────────────────────────────────────────────────────┘
┌─ FRONTEND ───────────────────────────────────────────────────┐
│ Next.js on Cloud Run + IAP (Google SSO) │
│ or Slack / Google Chat Bot │
└──────────────────────────────────────────────────────────────┘
┌─ OBSERVABILITY ──────────────────────────────────────────────┐
│ Cloud Logging → BigQuery → Looker Dashboard │
└──────────────────────────────────────────────────────────────┘

Vertex AI Agent Builder (Managed RAG — Fastest Path)

If you want to skip building from scratch, GCP offers a fully managed RAG solution:

  1. Upload docs to GCS
  2. Create a Data Store in Agent Builder
  3. Create an Agent and attach the data store
  4. Deploy — get a chat UI + API instantly

Great for POCs and internal tools where customization isn’t critical.


Cost Optimization Tips

TipSaving
Use Gemini Flash for simple Q&A~10x cheaper than Pro
Cache frequent queries (Memorystore/Redis)Reduce LLM calls
Batch embed documents overnightLower embedding costs
Limit top_k retrieval chunksReduce context = less tokens
Use committed use discounts on VertexUp to 20% off

RAG Quality Evaluation

Always measure these metrics:

MetricWhat it measures
FaithfulnessIs the answer grounded in retrieved docs?
Answer RelevanceDoes it actually answer the question?
Context PrecisionAre retrieved chunks relevant?
Context RecallDid retrieval find all needed info?

Tools: RAGAS framework, Vertex AI Evaluation Service, custom BigQuery dashboards.


Timeline for Enterprise RAG on GCP

PhaseTimelineDeliverable
POC1–2 weeksAgent Builder + sample docs
MVP4–6 weeksCloud Run RAG API + basic UI
Production8–12 weeksFull pipeline, auth, monitoring
OptimizationOngoingEval loop, fine-tuning, cost control

This is a battle-tested architecture used by enterprises running internal knowledge assistants, HR bots, IT support agents, and compliance Q&A systems on GCP.

Vertex AI: Google Cloud’s All-in-One AI Solution

Vertex AI is Google Cloud’s unified AI/ML platform — a single place where you can build, deploy, train, and manage machine learning models and AI applications at enterprise scale.

Think of it as Google’s answer to Azure AI + AWS SageMaker — it brings together everything an AI team needs under one roof.


The Core Idea

Before Vertex AI, Google had many scattered AI tools:

AI Platform (training)
AutoML (no-code ML)
AI Hub (model sharing)
Notebooks (experimentation)
Predictions (serving)

Vertex AI unified all of them into one platform in 2021.


Vertex AI — Main Components## What is Vertex AI?

Vertex AI is Google Cloud’s fully managed, unified AI/ML platform — a single place to build, train, deploy, and manage machine learning models and generative AI applications at enterprise scale.


The 4 Main Pillars

1. Data

Everything starts with data. Vertex AI provides tools to manage, label, and store training data in a structured way.

  • Datasets — upload and manage structured, image, video, text, or tabular data
  • Feature Store — a centralized repository to store and share ML features across teams, avoiding redundant computation
  • Data Labeling — human-in-the-loop tool to annotate training data (images, text, video)
  • BigQuery ML — run ML models directly inside BigQuery using SQL, no data movement needed

2. Build

Where models are actually created — either automatically or with full custom code.

  • AutoML — no-code model training; you bring data, Google finds the best model architecture automatically
  • Custom training — full control; use TensorFlow, PyTorch, scikit-learn, or any framework on managed compute
  • Workbench — managed JupyterLab notebooks with GCP integrations pre-wired
  • Colab Enterprise — Google Colab but enterprise-grade, with IAM, VPC, and persistent storage

3. Deploy

Serving models to production reliably and at scale.

  • Endpoints — deploy models as REST APIs with autoscaling, A/B testing, and traffic splitting
  • Batch prediction — run predictions on large datasets offline without a live endpoint
  • Model registry — versioned catalog of all your trained models with lineage tracking
  • Explainability — understand why a model made a prediction (feature attribution)

4. MLOps

The operational layer that makes ML repeatable and production-grade.

  • Pipelines — orchestrate end-to-end ML workflows (data → train → evaluate → deploy) as DAGs
  • Experiments — track hyperparameters, metrics, and artifacts across training runs
  • Model monitoring — detect data drift and prediction drift in production automatically
  • Metadata — full lineage tracking of every artifact, dataset, and model version

Generative AI Layer

On top of classical ML, Vertex AI has a dedicated generative AI tier:

  • Model Garden — a catalog of 130+ foundation models (Gemini, Llama, Claude, Mistral, etc.) ready to use or fine-tune
  • Gemini API — access Google’s most capable multimodal model (text, images, video, code, audio)
  • Vertex AI Studio — a UI playground to prompt, test, and compare models without writing code
  • Embeddings API — convert text into vectors for semantic search and RAG (text-embedding-004)

Vertex AI Search + Vector Search

A specialized layer for RAG and semantic search:

  • Vertex AI Search — fully managed search engine over your documents, grounded in your data
  • Vector Search — high-scale approximate nearest neighbor (ANN) search, stores and queries billions of vectors using Google’s ScaNN algorithm

This is what powers the GCP RAG pipeline from the previous article.


Vertex AI vs Competitors

FeatureVertex AI (GCP)Azure AI (Microsoft)SageMaker (AWS)
AutoML
Managed notebooks✅ Workbench✅ Azure ML Studio✅ Studio Lab
Foundation models✅ Gemini, Model Garden✅ Azure OpenAI✅ Bedrock
Vector search✅ Vertex AI Search✅ Azure AI Search✅ OpenSearch
Embeddings✅ text-embedding-004✅ ada-002 / text-3✅ Titan
MLOps pipelines✅ Vertex Pipelines✅ Azure ML Pipelines✅ SageMaker Pipelines
Tight GCP integration✅ Native

Key Takeaway

Vertex AI is to machine learning what Google Cloud is to infrastructure — fully managed, deeply integrated, and designed to scale from prototype to production without switching tools. Whether you’re training a custom model, deploying Gemini, or building a RAG pipeline with vector search, it all lives under one unified platform with shared IAM, billing, and networking.

Integrate n8n with GCP for Efficient Document Management

Integrating n8n with GCP for Document Management

This mirrors the Azure RAG architecture but uses Google Cloud Platform services — Vertex AI for embeddings, Vertex AI Search (or AlloyDB/Cloud SQL with pgvector) for vector storage, and n8n as the orchestration layer.


The Full Architecture

Your Documents (PDFs, Docs, Sheets)
Google Cloud Storage (GCS)
Document AI / Dataflow (chunk + clean)
Vertex AI Embeddings (text → vector)
Vertex AI Search / pgvector (store vectors)
n8n Workflow
User gets grounded answer + sources

GCP Services Mapping

Azure ServiceGCP EquivalentRole
Azure Data LakeGoogle Cloud Storage (GCS)Store raw documents
Azure Data FactoryCloud Dataflow / Document AIProcess & chunk text
Azure OpenAI EmbeddingsVertex AI EmbeddingsConvert text → vectors
Azure AI SearchVertex AI Search / pgvectorStore & search vectors
Azure OpenAI ChatVertex AI Gemini / PaLMGenerate answers
n8nn8nOrchestrate everything

Step-by-Step Implementation


Step 1 — Store Documents in GCS

Upload all your PDFs, Word docs, and text files to a GCS bucket:

# Create a bucket
gsutil mb gs://my-company-docs
# Upload documents
gsutil cp *.pdf gs://my-company-docs/raw/

Bucket structure:

gs://my-company-docs/
├── raw/ ← original documents
├── processed/ ← cleaned text chunks
└── embeddings/ ← vector JSON files

Step 2 — Process & Chunk Documents

Use Google Document AI to extract clean text from PDFs, then split into chunks:

# Cloud Function or Dataflow job
from google.cloud import documentai, storage
def chunk_document(text, chunk_size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append({
"chunk_id": f"chunk_{i}",
"text": chunk,
"source": "refund_policy.pdf",
"page": i // chunk_size + 1
})
return chunks

Output chunk format:

{
"chunk_id": "refund_policy_001",
"text": "Refunds are available within 30 days of purchase...",
"source": "refund_policy.pdf",
"page": 1,
"metadata": {
"department": "finance",
"last_updated": "2026-01-15"
}
}

Step 3 — Generate Embeddings with Vertex AI

Call the Vertex AI Embeddings API to convert each chunk into a vector:

# REST API call
POST https://us-central1-aiplatform.googleapis.com/v1/projects/YOUR_PROJECT/
locations/us-central1/publishers/google/models/text-embedding-004:predict
Headers:
Authorization: Bearer $(gcloud auth print-access-token)
Content-Type: application/json
Body:
{
"instances": [
{ "content": "Refunds are available within 30 days of purchase..." }
]
}

Response:

{
"predictions": [
{
"embeddings": {
"values": [0.023, -0.841, 0.334, ...],
"statistics": { "truncated": false, "token_count": 42 }
}
}
]
}

Vertex AI embedding models:

ModelDimensionsBest for
text-embedding-004768General text, RAG
text-multilingual-embedding-002768Multi-language docs
text-embedding-preview-0815768Latest preview

Step 4 — Store Vectors

You have two main options on GCP:

Option A — Vertex AI Search (fully managed)

# Create a data store
gcloud alpha discovery-engine data-stores create \
--project=YOUR_PROJECT \
--location=global \
--display-name="company-docs" \
--industry-vertical=GENERIC \
--solution-types=SOLUTION_TYPE_SEARCH

Option B — AlloyDB / Cloud SQL with pgvector (more control)

-- Enable pgvector extension
CREATE EXTENSION vector;
-- Create table with vector field
CREATE TABLE document_chunks (
chunk_id TEXT PRIMARY KEY,
text TEXT,
source TEXT,
page INT,
metadata JSONB,
embedding VECTOR(768) -- matches Vertex AI output dimensions
);
-- Create HNSW index for fast similarity search
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Insert a chunk with its vector:

INSERT INTO document_chunks
(chunk_id, text, source, embedding)
VALUES (
'refund_policy_001',
'Refunds are available within 30 days...',
'refund_policy.pdf',
'[0.023, -0.841, 0.334, ...]'::vector
);

Step 5 — Build the n8n Workflow

The n8n workflow has these nodes:

Webhook Trigger
HTTP Request → Vertex AI Embeddings
HTTP Request → pgvector / Vertex AI Search
Code Node → Format retrieved context
HTTP Request → Vertex AI Gemini (chat)
Respond to Webhook

Step 6 — Webhook Receives User Question

Incoming request to n8n:

{
"question": "What is the refund policy?",
"user_id": "user_123"
}

Step 7 — n8n Calls Vertex AI Embeddings

HTTP Request node configuration:

Method: POST
URL: https://us-central1-aiplatform.googleapis.com/v1/projects/
{{ $env.GCP_PROJECT }}/locations/us-central1/publishers/google/
models/text-embedding-004:predict
Headers:
Authorization: Bearer {{ $env.GCP_ACCESS_TOKEN }}
Content-Type: application/json
Body:
{
"instances": [
{ "content": "{{ $json.question }}" }
]
}

Output stored in state:

{ "query_vector": [0.021, -0.834, 0.291, ...] }

Step 8 — n8n Searches pgvector

HTTP Request node (calling Cloud SQL proxy or AlloyDB REST):

-- n8n Code Node generates this query
SELECT
chunk_id,
text,
source,
page,
1 - (embedding <=> '[0.021, -0.834, 0.291, ...]'::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> '[0.021, -0.834, 0.291, ...]'::vector
LIMIT 5;

pgvector distance operators:

OperatorMetricUse case
<=>Cosine distanceText similarity (recommended)
<->Euclidean distanceImage embeddings
<#>Negative dot productNormalized vectors

Results returned:

[
{ "chunk_id": "refund_policy_001", "text": "Refunds are available within 30 days...", "source": "refund_policy.pdf", "similarity": 0.97 },
{ "chunk_id": "returns_guide_003", "text": "To initiate a return, visit our portal...", "source": "returns_guide.pdf", "similarity": 0.81 }
]

Step 9 — Format Context in n8n Code Node

// n8n Code Node
const results = items[0].json.results;
const question = $node["Webhook Trigger"].json.question;
const context = results
.map(r => `Source: ${r.source} (Page ${r.page})\nContent: ${r.text}`)
.join("\n\n---\n\n");
return [{
json: {
question: question,
context: context,
sources: results.map(r => r.source)
}
}];

Step 10 — Send Grounded Prompt to Vertex AI Gemini

HTTP Request node:

Method: POST
URL: https://us-central1-aiplatform.googleapis.com/v1/projects/
{{ $env.GCP_PROJECT }}/locations/us-central1/publishers/google/
models/gemini-1.5-pro:generateContent
Body:
{
"contents": [{
"role": "user",
"parts": [{
"text": "You are an internal company assistant.\nAnswer ONLY using the context below.\nIf the answer is not in the context, say: I don't know.\nAlways cite the source document.\n\nContext:\n{{ $json.context }}\n\nQuestion: {{ $json.question }}"
}]
}],
"generationConfig": {
"temperature": 0.2,
"maxOutputTokens": 512
}
}

Step 11 — Return Answer to User

n8n Respond to Webhook node:

{
"answer": "Refunds are available within 30 days of purchase. To initiate a return, visit our returns portal.",
"sources": ["refund_policy.pdf", "returns_guide.pdf"],
"confidence": "high"
}

Complete n8n Workflow Diagram

┌─────────────────────────────────────────────────────────┐
│ n8n WORKFLOW │
│ │
│ [Webhook]──→[Vertex AI Embed]──→[pgvector Search] │
│ ↓ │
│ [Code: Format] │
│ ↓ │
│ [Gemini Chat] │
│ ↓ │
│ [Respond] │
└─────────────────────────────────────────────────────────┘

GCP vs Azure — Side by Side

StepAzureGCP
Document storageAzure Data LakeGoogle Cloud Storage
Text extractionAzure Form RecognizerDocument AI
ChunkingAzure Data FactoryCloud Dataflow / Functions
Embedding modeltext-embedding-ada-002text-embedding-004
Vector dimensions1,536768
Vector storeAzure AI SearchAlloyDB pgvector / Vertex AI Search
Search algorithmHNSW (built-in)HNSW via pgvector
LLMAzure OpenAI ChatVertex AI Gemini
Orchestrationn8nn8n

Security Best Practices on GCP

n8n running on GCP VM / Cloud Run
Uses Workload Identity (no hardcoded keys)
Accesses GCS, Vertex AI, AlloyDB
via IAM roles:
- roles/aiplatform.user
- roles/storage.objectViewer
- roles/cloudsql.client

Store secrets in Google Secret Manager, not in n8n environment variables directly:

# Store API credentials securely
gcloud secrets create vertex-ai-key --data-file=key.json
# n8n fetches at runtime via HTTP Request node
GET https://secretmanager.googleapis.com/v1/projects/YOUR_PROJECT/
secrets/vertex-ai-key/versions/latest:access

Key Takeaway

The GCP RAG pipeline with n8n gives you:

  • GCS for durable, scalable document storage
  • Document AI for accurate PDF/text extraction
  • Vertex AI Embeddings for state-of-the-art semantic vectors
  • pgvector on AlloyDB for flexible, SQL-native vector search
  • Gemini for grounded, citation-aware answer generation
  • n8n as the glue — zero custom application code needed

The result is a fully managed, enterprise-grade document Q&A system where every answer is grounded in your actual documents, with sources always cited.