Top GKE Security Best Practices for Enterprises

GKE Security Best Practices (Enterprise Level)

Security in Google Kubernetes Engine is about multiple layers:

  • Identity
  • Network
  • Cluster hardening
  • Workload security
  • Supply chain security
  • Secrets protection
  • Monitoring & detection
  • Governance & compliance

A strong interview or production answer should always emphasize:

“Security in Kubernetes is layered defense-in-depth, not a single control.”


1. Use Private GKE Clusters

Best Practice

Use private clusters whenever possible.

Why?

  • Nodes do NOT get public IPs
  • Reduces attack surface
  • Limits direct internet exposure

Enterprise Design

Typical secure access:

  • Bastion host
  • VPN
  • Cloud Interconnect
  • Cloud NAT

2. Restrict API Server Access

Use Authorized Networks

Restrict Kubernetes API access to:

  • corporate IPs
  • VPN ranges
  • trusted admin networks

Avoid

0.0.0.0/0

Huge security risk.


3. Use Workload Identity (Very Important)

Best Practice

Use Workload Identity instead of service account keys.


Why?

Bad:

  • static JSON keys
  • key leakage risk
  • long-lived credentials

Good:

  • short-lived tokens
  • IAM-integrated
  • least privilege

Enterprise Interview Statement

“Workload Identity eliminates the need to distribute static service account credentials inside containers.”

Excellent answer.


4. Enforce Least Privilege IAM

Best Practice

Never use:

  • Owner
  • Editor

For workloads.


Use Granular Roles

Examples:

  • Storage Object Viewer
  • Pub/Sub Subscriber
  • Secret Manager Secret Accessor

5. Use Kubernetes RBAC Properly

Avoid

cluster-admin

For developers/applications.


Best Practice

  • namespace-scoped roles
  • least privilege
  • separate admin/operator/developer access

Enterprise Pattern

RolePermissions
Developersnamespace-only
Platform teamcluster operations
Security teamaudit visibility

6. Use Network Policies

Best Practice

Assume:

  • all pod traffic should NOT be trusted

Implement:

  • east-west traffic restrictions

Example

Frontend can talk to:

  • backend

Backend can talk to:

  • database

Nothing else.


Enterprise Benefit

Prevents:

  • lateral movement
  • worm propagation
  • compromised pod spread

7. Use Pod Security Standards

Avoid Privileged Containers

Disallow:

  • privileged=true
  • hostNetwork
  • hostPID
  • hostPath mounts

Enforce:

  • non-root containers
  • read-only filesystems
  • dropped Linux capabilities

Strong Enterprise Statement

“Most Kubernetes compromises escalate through overly permissive pod security configurations.”


8. Enable Binary Authorization

Best Practice

Only allow:

  • signed
  • trusted
  • approved

Container images.


Prevents

  • malicious images
  • unapproved deployments
  • supply-chain attacks

Enterprise Workflow

CI/CD pipeline:

  • scan image
  • sign image
  • deploy approved image only

9. Scan Container Images

Use:

  • Artifact Registry vulnerability scanning
  • Trivy
  • Clair

Best Practice

Fail builds for:

  • critical CVEs
  • outdated packages
  • vulnerable base images

10. Use Distroless or Minimal Images

Avoid Large Images

Bad:

  • Ubuntu full image
  • unnecessary packages

Good:

  • distroless
  • alpine (carefully)
  • minimal runtime images

Benefit

Smaller attack surface.


11. Store Secrets Securely

Avoid

Bad:

env:
password: mypassword

Better Options

Use:

  • Google Secret Manager
  • CSI Secret Store Driver
  • KMS encryption

Important

Kubernetes secrets are:

  • base64 encoded
  • NOT encrypted by default

12. Encrypt Secrets at Rest

Use:

  • CMEK
  • KMS-backed encryption

Enterprise Requirement

Often mandatory for:

  • PCI
  • HIPAA
  • banking
  • government

13. Enable Audit Logging

Enable:

  • Admin Activity logs
  • Data Access logs
  • Kubernetes audit logs

Monitor For

  • suspicious kubectl exec
  • role changes
  • privileged pod creation
  • unusual API access

14. Use Managed Service Mesh Carefully

With:

  • Istio
  • Anthos Service Mesh

Enable:

  • mTLS
  • identity-based communication
  • traffic encryption

Enterprise Benefit

Prevents:

  • plaintext east-west traffic
  • service impersonation

15. Use Shielded GKE Nodes

Best Practice

Enable Shielded Nodes.


Benefits

  • secure boot
  • integrity monitoring
  • rootkit protection

16. Use Node Auto-Upgrade Carefully

Best Practice

Enable:

  • security patching

BUT:

  • validate compatibility
  • use maintenance windows

Enterprise Pattern

  • staging cluster first
  • canary node pools
  • production rollout later

17. Restrict Metadata Access

Risk

Pods accessing:

169.254.169.254

Could steal credentials.


Best Practice

Use:

  • Workload Identity
  • metadata concealment
  • minimal metadata exposure

18. Separate Workloads by Node Pools

Example

Node PoolPurpose
frontendinternet-facing
backendinternal apps
sensitiveregulated workloads

Benefit

Limits:

  • blast radius
  • noisy neighbors
  • privilege escalation

19. Use Resource Quotas & Limits

Prevent:

  • denial-of-service
  • resource exhaustion

Example

resources:
limits:
cpu: "1"
memory: "1Gi"

20. Protect Ingress Traffic

Use:

  • HTTPS only
  • managed certificates
  • WAF
  • rate limiting

Enterprise Stack

Common:

  • Cloud Armor
  • Ingress controller
  • CDN
  • DDoS protection

21. Use Cloud Armor WAF

Protect against:

  • OWASP Top 10
  • SQL injection
  • bot attacks
  • L7 DDoS

22. Use Multi-Layer Monitoring

Monitor:

  • cluster metrics
  • audit logs
  • runtime anomalies
  • suspicious network traffic

Common Tools

  • Google Cloud Monitoring
  • Prometheus
  • Grafana
  • Falco
  • Security Command Center

23. Runtime Threat Detection

Use:

  • Falco
  • eBPF runtime monitoring

Detect:

  • shell execution
  • crypto miners
  • suspicious syscalls

24. Use Policy-as-Code

Use:

  • OPA Gatekeeper
  • Anthos Policy Controller

Example Policies

Prevent:

  • privileged pods
  • latest image tags
  • public load balancers
  • root containers

Enterprise Benefit

Consistent governance at scale.


25. Separate Production & Non-Production

Never mix:

  • dev
  • test
  • prod

In same cluster for enterprises.


Best Practice

Separate:

  • clusters
  • projects
  • IAM boundaries

26. Backup & Disaster Recovery

Protect:

  • etcd state
  • manifests
  • persistent volumes

Common Tools

  • Velero
  • snapshots
  • GitOps repositories

27. Secure CI/CD Pipelines

Pipeline must:

  • scan images
  • verify signatures
  • use short-lived credentials
  • protect secrets

Enterprise Best Practice

Never:

  • hardcode credentials
  • store kubeconfig insecurely

28. Use GitOps Securely

With:

  • Argo CD
  • Flux

Use:

  • signed commits
  • branch protection
  • approval workflows

29. Apply Multi-Tenant Isolation Carefully

Use:

  • namespaces
  • quotas
  • network policies
  • dedicated node pools

Avoid:

  • full trust between tenants

30. Keep Kubernetes Versions Updated

Old Kubernetes versions:

  • often vulnerable
  • unsupported

Enterprise Upgrade Strategy

  • release channels
  • staged rollout
  • automated testing
  • canary upgrades

Enterprise Reference Architecture

Secure GKE architecture often includes:

  • Private GKE cluster
  • Hub-spoke VPC
  • Cloud NAT
  • Workload Identity
  • Network Policies
  • Binary Authorization
  • Cloud Armor
  • Secret Manager
  • GitOps
  • Central logging/SIEM
  • Policy Controller
  • Runtime threat detection

Strong Security Interview Keywords

Using these naturally helps a lot:

  • zero trust
  • least privilege
  • defense in depth
  • workload isolation
  • immutable infrastructure
  • policy-as-code
  • supply-chain security
  • runtime protection
  • east-west traffic control
  • blast radius reduction

Excellent Senior-Level Interview Statement

“Kubernetes security is not just cluster security. It includes identity, workloads, supply chain, runtime behavior, networking, and governance.”


Common Enterprise Mistakes

Huge Red Flags

  • public clusters
  • cluster-admin everywhere
  • static service account keys
  • privileged containers
  • no network policies
  • shared production clusters
  • no audit logging
  • using latest image tags
  • storing secrets in YAML

Production Security Checklist

Identity

✔ Workload Identity
✔ RBAC
✔ least privilege IAM

Network

✔ private cluster
✔ network policies
✔ Cloud Armor

Workloads

✔ non-root containers
✔ signed images
✔ runtime scanning

Governance

✔ audit logs
✔ policy-as-code
✔ compliance controls

Operations

✔ patching
✔ monitoring
✔ backup/DR


Troubleshooting GKE Failures in Interviews

Enterprise GKE Failure Scenarios (Real Interview Style)

These are the kinds of scenarios senior cloud/platform engineers get asked in real enterprise interviews for Google Kubernetes Engine.

The key is:

  • stay structured
  • isolate layers
  • explain impact
  • explain recovery
  • explain prevention

1. Entire Application Down After Deployment

Scenario

A new deployment went live. Suddenly:

  • users receive 502/503
  • pods restarting
  • traffic failing

Interviewer asks:

“Walk me through your troubleshooting.”


Strong Answer Structure

Step 1 — Confirm Scope

Questions:

  • Single app or multiple?
  • Internal or external only?
  • All regions?
  • Recent deployment/change?

Commands:

kubectl get pods -A
kubectl get events -A
kubectl rollout history deployment app

Step 2 — Validate Pods

kubectl describe pod <pod>
kubectl logs <pod>

Look for:

  • CrashLoopBackOff
  • OOMKilled
  • failed readiness probe
  • image pull failure
  • config errors

Step 3 — Validate Service

kubectl get svc
kubectl get endpoints

Common issue:

  • Service selector mismatch
  • No healthy endpoints

Step 4 — Validate Ingress / Load Balancer

Check:

  • backend health
  • NEG health
  • SSL errors
  • firewall rules

Enterprise answer:

“I would verify whether the Google Cloud Load Balancer health checks are passing and whether the NEG endpoints are healthy.”


Step 5 — Roll Back

kubectl rollout undo deployment app

What Interviewers Want

They want:

  • systematic thinking
  • rollback strategy
  • understanding of Kubernetes traffic flow

2. Pods Stuck in Pending

Scenario

Pods remain Pending forever.


Strong Troubleshooting Flow

Check Scheduler Events

kubectl describe pod <pod>

Look for:

  • insufficient CPU
  • insufficient memory
  • taints
  • affinity rules

Check Cluster Autoscaler

Possible:

  • autoscaler disabled
  • max nodes reached
  • quota exhausted

Enterprise-level answer:

“I would check whether the cluster autoscaler attempted scale-up and whether GCP quota limitations blocked new node creation.”


Check PVC Binding

kubectl get pvc

Possible issue:

  • storage class mismatch
  • zone mismatch

Root Causes Interviewers Love

Root CauseEnterprise Relevance
quota exhaustedvery common
taints/tolerations mismatchadvanced scheduling
regional capacity shortagecloud-scale issue
PVC unavailablestateful apps

3. Node Suddenly Becomes NotReady

Scenario

Critical workloads running on a node disappear.


Strong Answer

Validate Node

kubectl get nodes
kubectl describe node <node>

Check:

  • memory pressure
  • disk pressure
  • network unavailable

Investigate kubelet

Possible:

  • kubelet crash
  • container runtime failure
  • disk full

Enterprise Recovery

kubectl cordon <node>
kubectl drain <node>

Then:

  • recreate node
  • let workloads reschedule

Strong Interview Addition

“I would verify PodDisruptionBudgets because aggressive draining can accidentally reduce application availability.”


4. Regional GKE Cluster Partial Outage

Scenario

One zone fails.

Pods impacted.

What happens?


Strong Answer

Explain Regional Architecture

Expected:

  • control plane remains available
  • workloads redistributed
  • multi-zone node pools

What Can Still Break?

Even with regional clusters:

  • zonal persistent disks
  • anti-affinity misconfiguration
  • single-zone databases
  • insufficient spare capacity

Strong Enterprise Statement

“A regional cluster alone does not guarantee HA unless workloads, storage, and dependencies are also multi-zone aware.”

Excellent interview answer.


5. Massive Traffic Spike

Scenario

Traffic increases 20x suddenly.

Application becomes slow.


Strong Troubleshooting

Check HPA

kubectl get hpa

Check Node Saturation

kubectl top nodes
kubectl top pods

Validate Cluster Autoscaler

Possible issue:

  • node provisioning too slow
  • quotas exhausted

Check Application Bottlenecks

Common:

  • DB connection exhaustion
  • thread pool saturation
  • external API latency

Senior-Level Insight

“Kubernetes scaling does not solve downstream bottlenecks automatically.”

Interviewers love this answer.


6. GKE Upgrade Breaks Production

Scenario

After cluster upgrade:

  • workloads fail
  • APIs deprecated
  • ingress stops working

Strong Answer

Immediate Actions

  • identify impacted workloads
  • check deprecated APIs
  • rollback node pools if possible
  • pause upgrades

Validate Version Compatibility

Check:

  • Ingress controller compatibility
  • CRDs
  • admission webhooks
  • CSI drivers
  • service mesh versions

Prevention

Expected answer:

  • pre-prod upgrade testing
  • canary node pools
  • maintenance windows
  • release channels

7. Workload Cannot Access GCP APIs

Scenario

Application suddenly cannot access:

  • GCS
  • Pub/Sub
  • Secret Manager

Strong Answer

Check Workload Identity

Validate:

  • KSA ↔ GSA mapping
  • IAM permissions
  • annotations

Commands:

kubectl describe sa

Common Enterprise Causes

CauseExample
IAM role removedsecurity team change
wrong service accountdeployment issue
metadata server issuenode problem
token expiration issueworkload auth issue

8. DNS Resolution Failures

Scenario

Pods cannot resolve services or external hosts.


Strong Troubleshooting

Validate CoreDNS

kubectl get pods -n kube-system

Test DNS

nslookup kubernetes.default

Common Enterprise Causes

  • CoreDNS crash
  • upstream DNS failure
  • VPC DNS misconfiguration
  • stubDomains issue
  • NetworkPolicy blocking DNS

9. Ingress Returns 502/504

Scenario

Load balancer exists but users get:

  • 502
  • 504

Strong Enterprise Troubleshooting

Check:

  • readiness probes
  • backend timeout
  • application startup time
  • NEG health
  • firewall rules

Important Interview Insight

“In GKE, successful pod status does not necessarily mean the external load balancer considers the backend healthy.”

Very strong answer.


10. Security Incident in GKE

Scenario

Security team detects suspicious container activity.


Strong Enterprise Response

Immediate Actions

  • isolate namespace
  • cordon affected nodes
  • capture logs
  • preserve forensic evidence

Investigation

Check:

  • container image source
  • privileged containers
  • unexpected outbound traffic
  • service account abuse
  • suspicious exec activity

Recovery

  • redeploy from trusted image
  • rotate secrets
  • validate IAM roles
  • patch vulnerabilities

Strong Security Keywords

Mention:

  • Binary Authorization
  • Artifact Registry scanning
  • Workload Identity
  • Network Policies
  • Pod Security Standards
  • Security Command Center

11. StatefulSet Failure Scenario

Scenario

Database pod fails and cannot restart.


Troubleshooting

Check:

  • PVC attachment
  • disk zone
  • storage class
  • corrupted filesystem

Enterprise Insight

“Stateful workloads are often where Kubernetes HA assumptions break down.”

Excellent senior answer.


12. Multi-Tenant Cluster Resource Exhaustion

Scenario

One team’s workload consumes all cluster CPU.

Other applications fail.


Strong Answer

Controls:

  • ResourceQuota
  • LimitRange
  • node isolation
  • separate node pools
  • priority classes

13. GitOps Accident Deletes Resources

Scenario

Bad Git commit deletes production ingress.


Strong Recovery

  • revert Git commit
  • reconcile ArgoCD/Flux
  • restore manifests
  • validate drift detection

Enterprise Prevention

  • branch protection
  • policy-as-code
  • approval workflows
  • progressive delivery

14. Service Mesh Failure

Scenario

After Istio upgrade:

  • latency spikes
  • mTLS failures
  • service communication broken

Troubleshooting

Check:

  • sidecar injection
  • cert rotation
  • envoy health
  • control plane version mismatch

Strong Statement

“Service mesh failures can appear as application failures even when the application itself is healthy.”


15. etcd / Control Plane Concerns

Scenario

API server becomes slow.


Strong GKE Answer

In GKE:

  • control plane managed by Google
  • limited direct etcd access

You focus on:

  • API latency
  • excessive CRDs
  • controller storms
  • webhook latency

Excellent Senior-Level Topics to Mention

Reliability

  • multi-zone architecture
  • PDBs
  • topology spread constraints
  • graceful degradation

Security

  • private clusters
  • authorized networks
  • workload identity
  • image signing

Operations

  • SLO/SLI
  • canary deployments
  • observability
  • GitOps

Golden Interview Rule

Always answer in this structure:

1. Identify scope

2. Check recent changes

3. Isolate layer

  • app
  • pod
  • node
  • network
  • ingress
  • cloud dependency

4. Mitigate impact

5. Recover service

6. Prevent recurrence


High-Value Enterprise Terms

Using these naturally helps a lot:

  • blast radius
  • graceful degradation
  • zero trust
  • multi-zone resilience
  • canary rollout
  • immutable infrastructure
  • policy-as-code
  • workload isolation
  • service dependency
  • SLO-driven operations

Understanding Workload Identity in GCP

Workload Identity Explained

The Problem It Solves

Traditional approach — Service Account Keys:
Developer creates GCP service account
Downloads JSON key file
Stores key in Kubernetes secret
Mounts secret into pod
App reads key from disk
Problems:
├── Key never expires — stolen key = permanent access
├── Must manually rotate every 90 days
├── Key stored in etcd (even if encrypted)
├── Key might end up in git, logs, or error messages
├── No audit trail of which pod used the key
└── If cluster is compromised — all keys exposed
Workload Identity — No Keys:
Pod starts with Kubernetes Service Account
GCP metadata server intercepts credential request
Issues short-lived token (1 hour)
App uses token transparently
Benefits:
├── No key files anywhere
├── Token auto-rotates every hour
├── Cryptographic binding to specific K8s SA
├── Full audit trail per pod identity
├── Cluster compromise → no long-lived credentials
└── Revoke access instantly by removing IAM binding

How It Works — Deep Dive

┌─────────────────────────────────────────────────────────────┐
│ WORKLOAD IDENTITY FLOW │
│ │
│ ┌──────────┐ 1. App calls GCP API │
│ │ App │────────────────────────────────────────────▶ │
│ │ in Pod │ │
│ └────┬─────┘ │
│ │ 2. GCP SDK calls metadata server │
│ │ (169.254.169.254) │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ GKE Metadata Server │ 3. Validates K8s SA token │
│ │ (intercepted by GKE) │─────────────────────────────▶│
│ └──────────────────────────┘ │
│ 4. Checks IAM binding │
│ ┌────────────────────────────┐ │
│ │ GCP IAM │ │
│ │ K8s SA → GCP SA binding │ │
│ └────────────────────────────┘ │
│ 5. Issues short-lived toke │
│ ┌──────────┐ ◀─────────────────────────────────────────── │
│ │ App │ 6. Uses token for GCP API call │
│ └──────────┘────────────────────────────────────────────▶ │
│ GCP Service (Storage, SQL, etc) │
└─────────────────────────────────────────────────────────────┘

Two Types of Workload Identity

GKE Workload Identity:
Kubernetes pods → GCP service accounts
Works inside GKE clusters
Uses GKE metadata server
Workload Identity Federation:
External workloads → GCP service accounts
Works outside GCP (GitHub Actions, AWS, Azure, on-prem)
Uses OIDC / SAML tokens
No GCP infrastructure needed

GKE Workload Identity — Step by Step Setup

Step 1 — Enable on Cluster

# Terraform — enable Workload Identity on GKE cluster
resource "google_container_cluster" "primary" {
name = "my-cluster"
project = var.project_id
location = var.region
# Enable Workload Identity
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
# Format: PROJECT_ID.svc.id.goog
}
}
# Enable on node pools too
resource "google_container_node_pool" "nodes" {
cluster = google_container_cluster.primary.name
node_config {
# Use GKE metadata server — intercepts credential requests
workload_metadata_config {
mode = "GKE_METADATA" # critical — must be set
# GKE_METADATA = use Workload Identity
# EXPOSE_METADATA = expose node SA (less secure)
}
}
}
# OR via gcloud
gcloud container clusters update my-cluster \
--region us-central1 \
--workload-pool=myproject.svc.id.goog
gcloud container node-pools update default-pool \
--cluster my-cluster \
--region us-central1 \
--workload-metadata=GKE_METADATA

Step 2 — Create GCP Service Account

# GCP Service Account — the identity pods will assume
resource "google_service_account" "app_sa" {
account_id = "my-app-sa"
display_name = "My Application Service Account"
project = var.project_id
description = "Used by my-app pods via Workload Identity"
}
# Grant permissions to GCP SA
resource "google_project_iam_member" "app_sa_storage" {
project = var.project_id
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_service_account.app_sa.email}"
}
resource "google_project_iam_member" "app_sa_sql" {
project = var.project_id
role = "roles/cloudsql.client"
member = "serviceAccount:${google_service_account.app_sa.email}"
}
resource "google_project_iam_member" "app_sa_secrets" {
project = var.project_id
role = "roles/secretmanager.secretAccessor"
member = "serviceAccount:${google_service_account.app_sa.email}"
}

Step 3 — Create Kubernetes Service Account

# Kubernetes Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app-ksa # Kubernetes SA name
namespace: production
annotations:
# This annotation binds K8s SA to GCP SA
iam.gke.io/gcp-service-account: my-app-sa@myproject.iam.gserviceaccount.com
# Or via Terraform Kubernetes provider
resource "kubernetes_service_account" "app" {
metadata {
name = "my-app-ksa"
namespace = "production"
annotations = {
# Links this K8s SA to the GCP SA
"iam.gke.io/gcp-service-account" = google_service_account.app_sa.email
}
}
}

Step 4 — Create IAM Binding (The Critical Step)

# Allow K8s SA to impersonate GCP SA
# This is the binding that makes Workload Identity work
resource "google_service_account_iam_member" "workload_identity_binding" {
service_account_id = google_service_account.app_sa.name
role = "roles/iam.workloadIdentityUser"
# Format: serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]
member = "serviceAccount:${var.project_id}.svc.id.goog[production/my-app-ksa]"
# └── workload pool ──┘ └─namespace─┘ └──KSA name──┘
}
# Or via gcloud
gcloud iam service-accounts add-iam-policy-binding \
my-app-sa@myproject.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:myproject.svc.id.goog[production/my-app-ksa]"

Step 5 — Use in Pod

# Pod uses the Kubernetes Service Account
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: production
spec:
template:
spec:
# Reference the K8s SA
serviceAccountName: my-app-ksa # ← connects to Workload Identity
containers:
- name: app
image: myapp:latest
# No credential env vars needed
# No key file mounts needed
# GCP SDK auto-detects via metadata server
env:
- name: GOOGLE_CLOUD_PROJECT
value: "myproject"

Application Code — Zero Changes Needed

# Python GCP SDK auto-discovers credentials
from google.cloud import storage
from google.cloud import secretmanager
# No credentials needed auto-detected via Workload Identity
storage_client = storage.Client()
bucket = storage_client.bucket("my-bucket")
blob = bucket.blob("data/file.json")
data = blob.download_as_text()
print("Downloaded:", data)
# Same for Secret Manager
sm_client = secretmanager.SecretManagerServiceClient()
secret = sm_client.access_secret_version(
request={
"name": "projects/myproject/secrets/my-secret/versions/latest"
}
)
print("Secret:", secret.payload.data.decode())
// Go — same pattern
import (
"cloud.google.com/go/storage"
"context"
)
func main() {
ctx := context.Background()
// No credentials — auto-discovered
client, err := storage.NewClient(ctx)
if err != nil {
log.Fatal(err)
}
defer client.Close()
// Use normally
bucket := client.Bucket("my-bucket")
obj := bucket.Object("data/file.json")
// ...
}
// Node.js — same pattern
const { Storage } = require('@google-cloud/storage');
// Auto-discovers Workload Identity credentials
const storage = new Storage();
const bucket = storage.bucket('my-bucket');
const [files] = await bucket.getFiles();

Workload Identity Federation

For workloads outside GCP — GitHub Actions, AWS, Azure, on-prem:

External workload presents its native identity token
GCP validates token against trusted provider (OIDC)
GCP issues short-lived GCP token
External workload uses GCP services
No GCP service account keys needed

GitHub Actions → GCP

# Step 1 — Create Workload Identity Pool
gcloud iam workload-identity-pools create "github-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--display-name="GitHub Actions Pool"
# Step 2 — Create OIDC Provider
gcloud iam workload-identity-pools providers create-oidc "github-provider" \
--project="${PROJECT_ID}" \
--location="global" \
--workload-identity-pool="github-pool" \
--display-name="GitHub Actions Provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository,attribute.actor=assertion.actor,attribute.ref=assertion.ref" \
--attribute-condition="assertion.repository_owner=='myorg'" \
--issuer-uri="https://token.actions.githubusercontent.com"
# Terraform version
resource "google_iam_workload_identity_pool" "github" {
project = var.project_id
workload_identity_pool_id = "github-pool"
display_name = "GitHub Actions Pool"
description = "Pool for GitHub Actions workflows"
}
resource "google_iam_workload_identity_pool_provider" "github" {
project = var.project_id
workload_identity_pool_id = google_iam_workload_identity_pool.github.workload_identity_pool_id
workload_identity_pool_provider_id = "github-provider"
display_name = "GitHub Actions Provider"
# Map GitHub token claims to Google attributes
attribute_mapping = {
"google.subject" = "assertion.sub"
"attribute.repository" = "assertion.repository"
"attribute.ref" = "assertion.ref"
"attribute.actor" = "assertion.actor"
"attribute.workflow" = "assertion.workflow"
}
# Only trust tokens from our org
attribute_condition = "assertion.repository_owner == 'myorg'"
oidc {
issuer_uri = "https://token.actions.githubusercontent.com"
}
}
# Allow specific repo/branch to use GCP SA
resource "google_service_account_iam_member" "github_wif" {
service_account_id = google_service_account.deploy_sa.name
role = "roles/iam.workloadIdentityUser"
# Only main branch of specific repo
member = "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.github.name}/attribute.repository/myorg/myrepo"
}
# GitHub Actions workflow using WIF
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write # required for OIDC
contents: read
steps:
- uses: actions/checkout@v4
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
# No credentials in secrets
workload_identity_provider: >-
projects/123456789/locations/global/
workloadIdentityPools/github-pool/
providers/github-provider
service_account: deploy-sa@myproject.iam.gserviceaccount.com
- name: Deploy to GKE
run: |
gcloud container clusters get-credentials \
my-cluster --region us-central1
kubectl apply -f k8s/

AWS → GCP

# Allow AWS workloads to use GCP
resource "google_iam_workload_identity_pool_provider" "aws" {
project = var.project_id
workload_identity_pool_id = google_iam_workload_identity_pool.external.workload_identity_pool_id
workload_identity_pool_provider_id = "aws-provider"
attribute_mapping = {
"google.subject" = "assertion.arn"
"attribute.aws_role" = "assertion.arn.extract('assumed-role/{role}/')"
"attribute.aws_ec2" = "assertion.arn.extract('instance/{instance}')"
}
# Trust specific AWS account
attribute_condition = "assertion.account == '123456789012'"
aws {
account_id = "123456789012"
}
}
# Allow specific AWS role
resource "google_service_account_iam_member" "aws_wif" {
service_account_id = google_service_account.cross_cloud_sa.name
role = "roles/iam.workloadIdentityUser"
member = "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.external.name}/attribute.aws_role/my-ec2-role"
}

Verifying Workload Identity

# Check WI is configured on cluster
gcloud container clusters describe my-cluster \
--region us-central1 \
--format="value(workloadIdentityConfig.workloadPool)"
# Should output: myproject.svc.id.goog
# Check node pool metadata mode
gcloud container node-pools describe default-pool \
--cluster my-cluster \
--region us-central1 \
--format="value(config.workloadMetadataConfig.mode)"
# Should output: GKE_METADATA
# Check K8s SA annotation
kubectl describe serviceaccount my-app-ksa -n production
# Should show: iam.gke.io/gcp-service-account: my-app-sa@myproject.iam.gserviceaccount.com
# Check IAM binding exists
gcloud iam service-accounts get-iam-policy \
my-app-sa@myproject.iam.gserviceaccount.com
# Test from inside a pod
kubectl run wi-test \
--image=google/cloud-sdk:slim \
--serviceaccount=my-app-ksa \
--namespace=production \
--rm -it \
-- /bin/bash
# Inside pod — test identity
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
# Should return: my-app-sa@myproject.iam.gserviceaccount.com
# Test actual GCP access
gcloud auth list
gsutil ls gs://my-bucket/

Common Issues and Fixes

# Issue 1 — Pod gets node SA instead of app SA
# Symptom: metadata returns node-sa@myproject not app-sa@myproject
# Cause: node pool not in GKE_METADATA mode
gcloud container node-pools update default-pool \
--cluster my-cluster \
--region us-central1 \
--workload-metadata=GKE_METADATA
# Issue 2 — Permission denied
# Symptom: 403 when calling GCP APIs
# Check 1: IAM binding exists
gcloud iam service-accounts get-iam-policy my-app-sa@myproject.iam.gserviceaccount.com
# Check 2: K8s SA annotation correct
kubectl get serviceaccount my-app-ksa -n production -o yaml
# Check 3: Pod using correct K8s SA
kubectl get pod my-app-pod -o jsonpath='{.spec.serviceAccountName}'
# Issue 3 — Wrong namespace or SA name in binding
# The member format must exactly match:
# serviceAccount:PROJECT.svc.id.goog[NAMESPACE/KSA_NAME]
# Common mistake: wrong namespace, wrong KSA name
# Issue 4 — WI not enabled on cluster
# Symptom: metadata returns compute SA not app SA
# Fix: enable workload identity on cluster
gcloud container clusters update my-cluster \
--region us-central1 \
--workload-pool=myproject.svc.id.goog
# Issue 5 — Application not using ADC
# Symptom: app uses hardcoded credentials or env vars
# Fix: remove GOOGLE_APPLICATION_CREDENTIALS env var
# Let GCP SDK use Application Default Credentials (ADC)

Security Comparison

Approach Attack surface Rotation Audit
────────────── ────────────── ──────── ─────
SA Key in secret High — key Manual Per key
persists forever 90 days (limited)
SA Key in env var Very high — in Manual None
memory + logs
Workload Identity Minimal — no Automatic Per pod
(GKE) key exists 1 hour (full)
Workload Identity Minimal — no Automatic Per workflow
Federation key stored ~1 hour (full)
(GitHub/AWS)

Interview Talking Points

If asked "explain Workload Identity":
Simple version (30 seconds):
"Instead of giving pods a service account key file
that could be stolen, Workload Identity lets pods
prove their identity cryptographically using their
Kubernetes Service Account. GCP issues a short-lived
token automatically — no keys, no rotation, no risk."
Technical version (2 minutes):
"When a pod calls a GCP API, the GCP SDK requests
credentials from the metadata server at 169.254.169.254.
GKE intercepts this request, validates the pod's
Kubernetes Service Account token against the IAM
binding we configured, and if it matches, GCP issues
a one-hour OAuth token. The app never sees a key file.
The token auto-rotates. Even if someone gets into the
pod, they get a token that expires in at most one hour."
What risk did it reduce:
"We had 28 service account key files across our clusters.
Any of those could be leaked via logs, error messages,
or git commits. After implementing Workload Identity,
we have zero key files. We also gained per-pod audit
trails in Cloud Audit Logs — we can see exactly which
pod made which GCP API call."

Workload Identity is the single highest-impact security improvement you can make to a GKE cluster — it eliminates an entire class of credential exposure risk with no application code changes and minimal operational overhead.

Terraform GCP Hub-Spoke Setup for Private GKE

Below is a production-style Terraform baseline for GCP Hub-Spoke + Private GKE. It uses current Terraform/GCP patterns: private GKE, Workload Identity, Cloud NAT, Shared VPC-ready layout, and secure node pool defaults. Google’s docs confirm Workload Identity Federation is enabled with PROJECT_ID.svc.id.goog, Cloud NAT is managed NAT for private outbound access, and Terraform is officially supported for GKE provisioning. (Google Cloud Documentation)


Repo layout

gcp-gke-hub-spoke/
├── providers.tf
├── variables.tf
├── main.tf
├── outputs.tf
├── terraform.tfvars
└── modules/
├── network/
│ └── main.tf
├── nat/
│ └── main.tf
└── gke-private/
└── main.tf

providers.tf

terraform {
required_version = ">= 1.6.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 6.0"
}
}
}
provider "google" {
project = var.hub_project_id
region = var.region
}

variables.tf

variable "hub_project_id" {}
variable "spoke_project_id" {}
variable "region" {
default = "northamerica-northeast1"
}
variable "hub_vpc_name" {
default = "hub-vpc"
}
variable "spoke_vpc_name" {
default = "spoke-prod-vpc"
}
variable "gke_cluster_name" {
default = "prod-private-gke"
}
variable "gke_subnet_cidr" {
default = "10.10.0.0/20"
}
variable "pod_cidr" {
default = "10.20.0.0/16"
}
variable "service_cidr" {
default = "10.30.0.0/20"
}
variable "master_ipv4_cidr_block" {
default = "172.16.0.0/28"
}

main.tf

module "hub_network" {
source = "./modules/network"
project_id = var.hub_project_id
name = var.hub_vpc_name
region = var.region
subnets = {
hub-transit = "10.0.0.0/24"
hub-security = "10.0.1.0/24"
}
}
module "spoke_network" {
source = "./modules/network"
project_id = var.spoke_project_id
name = var.spoke_vpc_name
region = var.region
subnets = {
gke-nodes = var.gke_subnet_cidr
}
secondary_ranges = {
gke-nodes = {
pods = var.pod_cidr
services = var.service_cidr
}
}
}
module "spoke_nat" {
source = "./modules/nat"
project_id = var.spoke_project_id
region = var.region
network = module.spoke_network.network_self_link
router_name = "spoke-prod-router"
nat_name = "spoke-prod-cloud-nat"
}
module "private_gke" {
source = "./modules/gke-private"
project_id = var.spoke_project_id
region = var.region
cluster_name = var.gke_cluster_name
network = module.spoke_network.network_self_link
subnetwork = module.spoke_network.subnet_self_links["gke-nodes"]
pod_range_name = "pods"
service_range_name = "services"
master_ipv4_cidr_block = var.master_ipv4_cidr_block
}

Module: modules/network/main.tf

variable "project_id" {}
variable "name" {}
variable "region" {}
variable "subnets" {
type = map(string)
}
variable "secondary_ranges" {
type = map(any)
default = {}
}
resource "google_compute_network" "vpc" {
project = var.project_id
name = var.name
auto_create_subnetworks = false
routing_mode = "GLOBAL"
}
resource "google_compute_subnetwork" "subnet" {
for_each = var.subnets
project = var.project_id
name = each.key
region = var.region
network = google_compute_network.vpc.id
ip_cidr_range = each.value
private_ip_google_access = true
dynamic "secondary_ip_range" {
for_each = lookup(var.secondary_ranges, each.key, {})
content {
range_name = secondary_ip_range.key
ip_cidr_range = secondary_ip_range.value
}
}
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
}
resource "google_compute_firewall" "deny_all_ingress" {
project = var.project_id
name = "${var.name}-deny-all-ingress"
network = google_compute_network.vpc.name
direction = "INGRESS"
priority = 65534
deny {
protocol = "all"
}
source_ranges = ["0.0.0.0/0"]
}
resource "google_compute_firewall" "allow_internal" {
project = var.project_id
name = "${var.name}-allow-internal"
network = google_compute_network.vpc.name
direction = "INGRESS"
priority = 1000
allow {
protocol = "tcp"
}
allow {
protocol = "udp"
}
allow {
protocol = "icmp"
}
source_ranges = values(var.subnets)
}
output "network_self_link" {
value = google_compute_network.vpc.self_link
}
output "subnet_self_links" {
value = {
for k, v in google_compute_subnetwork.subnet : k => v.self_link
}
}

Module: modules/nat/main.tf

variable "project_id" {}
variable "region" {}
variable "network" {}
variable "router_name" {}
variable "nat_name" {}
resource "google_compute_router" "router" {
project = var.project_id
name = var.router_name
region = var.region
network = var.network
}
resource "google_compute_router_nat" "nat" {
project = var.project_id
name = var.nat_name
router = google_compute_router.router.name
region = var.region
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
log_config {
enable = true
filter = "ERRORS_ONLY"
}
}

Module: modules/gke-private/main.tf

variable "project_id" {}
variable "region" {}
variable "cluster_name" {}
variable "network" {}
variable "subnetwork" {}
variable "pod_range_name" {}
variable "service_range_name" {}
variable "master_ipv4_cidr_block" {}
resource "google_service_account" "gke_nodes" {
project = var.project_id
account_id = "${var.cluster_name}-nodes"
display_name = "GKE node service account"
}
resource "google_project_iam_member" "gke_node_logging" {
project = var.project_id
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.gke_nodes.email}"
}
resource "google_project_iam_member" "gke_node_monitoring" {
project = var.project_id
role = "roles/monitoring.metricWriter"
member = "serviceAccount:${google_service_account.gke_nodes.email}"
}
resource "google_container_cluster" "cluster" {
project = var.project_id
name = var.cluster_name
location = var.region
network = var.network
subnetwork = var.subnetwork
remove_default_node_pool = true
initial_node_count = 1
release_channel {
channel = "REGULAR"
}
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = true
master_ipv4_cidr_block = var.master_ipv4_cidr_block
}
ip_allocation_policy {
cluster_secondary_range_name = var.pod_range_name
services_secondary_range_name = var.service_range_name
}
master_authorized_networks_config {}
network_policy {
enabled = true
provider = "CALICO"
}
addons_config {
network_policy_config {
disabled = false
}
http_load_balancing {
disabled = false
}
}
logging_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS",
"APISERVER",
"CONTROLLER_MANAGER",
"SCHEDULER"
]
}
monitoring_config {
enable_components = [
"SYSTEM_COMPONENTS",
"APISERVER",
"CONTROLLER_MANAGER",
"SCHEDULER"
]
}
enable_shielded_nodes = true
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}
}
resource "google_container_node_pool" "secure_pool" {
project = var.project_id
name = "secure-pool"
location = var.region
cluster = google_container_cluster.cluster.name
node_count = 2
node_config {
machine_type = "e2-standard-4"
service_account = google_service_account.gke_nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
workload_metadata_config {
mode = "GKE_METADATA"
}
labels = {
environment = "prod"
security = "restricted"
}
metadata = {
disable-legacy-endpoints = "true"
}
}
management {
auto_repair = true
auto_upgrade = true
}
autoscaling {
min_node_count = 2
max_node_count = 6
}
}
output "cluster_name" {
value = google_container_cluster.cluster.name
}
output "endpoint" {
value = google_container_cluster.cluster.endpoint
sensitive = true
}

terraform.tfvars

hub_project_id = "my-hub-project"
spoke_project_id = "my-spoke-prod-project"
region = "northamerica-northeast1"
hub_vpc_name = "hub-vpc"
spoke_vpc_name = "spoke-prod-vpc"
gke_cluster_name = "prod-private-gke"

Deploy

terraform init
terraform fmt -recursive
terraform validate
terraform plan
terraform apply

Important production notes

This baseline creates separate hub and spoke VPCs, but true centralized egress through the hub needs one of these designs:

Option 1: Shared VPC
Hub/Host Project owns the VPC
Spoke/Service Projects consume subnets
Option 2: Network Connectivity Center
Hub connects spokes with routing control
Option 3: VPC Peering
Simple hub-spoke, but no transitive routing

For real enterprise GKE, I would use:

Shared VPC + Private GKE + Cloud NAT + Cloud Armor + Artifact Registry
+ Binary Authorization + Secret Manager + Workload Identity
+ NetworkPolicy + centralized logging

Use this as the secure base, then add Cloud Armor ingress, Private Service Connect, and Secret Manager CSI driver next.

Implementing Hub-Spoke Architecture in GCP

Here’s a production-grade GKE + Hub-Spoke secure architecture you can use for design, implementation, and interviews.


GKE + Hub-Spoke Secure Architecture (GCP)

High-Level Architecture

Image
Image
Image
Image

Architecture Breakdown

1. Hub Project (Central Security Layer)

VPC: hub-vpc

Core components:

  • Central firewall (NGFW / VPC firewall rules)
  • Cloud NAT (controlled egress)
  • VPN / Interconnect (on-prem connectivity)
  • Cloud DNS (private zones + forwarding)
  • Logging / Monitoring aggregation
  • Security tools (IDS/IPS, SIEM)

2. Spoke Projects (GKE Workloads)

Each spoke = isolated environment:

  • spoke-prod
  • spoke-dev
  • spoke-data

Each contains:

Private Google Kubernetes Engine cluster

Key GKE Security Settings:
  • Private cluster (no public nodes)
  • Private control plane endpoint
  • Master authorized networks
  • Workload Identity enabled
  • Shielded nodes

3. Networking Model

Recommended: Shared VPC

  • Hub = Host Project
  • Spokes = Service Projects

✔ Centralized control
✔ Strong governance
✔ Simplified routing


Connectivity: Hub-Spoke Routing

Use:

  • Google Cloud Network Connectivity Center (preferred)
  • OR VPC Peering (basic, no transitive routing)

Traffic Flow (VERY IMPORTANT)

1. Ingress (User → App)

User → HTTPS LB → Cloud Armor → Ingress → GKE Service → Pod

Security layers:

  • WAF (Cloud Armor)
  • TLS termination
  • Identity-aware access (optional)

2. East-West (Pod-to-Pod)

Pod → Pod (via CNI)

Controlled by:

  • Kubernetes Network Policies
  • Service mesh (mTLS via Istio)

3. Egress (Pod → Internet)

Pod → Node → VPC → Hub → Cloud NAT → Internet

✔ Central inspection
✔ Logging
✔ Policy enforcement


Security Controls by Layer


Identity Layer

  • Workload Identity (no service account keys)
  • IAM roles scoped per service
  • RBAC inside cluster

Network Layer

  • Private GKE clusters
  • No public IPs on nodes
  • Centralized egress via NAT
  • Network policies (deny-by-default)

Perimeter Layer

  • Cloud Armor (WAF)
  • Internal Load Balancers for internal apps
  • Private Service Connect (for APIs)

Workload Layer

  • Non-root containers
  • Read-only filesystem
  • No privileged pods
  • Pod Security Standards (restricted)

Supply Chain Layer

  • Artifact Registry (private images)
  • Image scanning enabled
  • Binary Authorization (only trusted images)

📊 Observability Layer

  • VPC Flow Logs
  • Firewall logs
  • GKE audit logs
  • Centralized logging in hub

Real-World Enhancements

1. Zero Trust GKE

  • Combine:
    • Network policies
    • mTLS (Istio)
    • Identity-aware proxy

2. Data Protection

  • VPC Service Controls
  • Restrict API access boundaries

3. Multi-Region Design

Region A Hub + Spokes
Region B Hub + Spokes
Global Load Balancer

Common Pitfalls

Public GKE clusters

→ Always use private clusters


Direct internet egress from spokes

→ Must go through hub NAT


No network policies

→ Pods can talk to everything


Hardcoded secrets in YAML

→ Use Secret Manager


Flat VPC design

→ No segmentation = high blast radius


Interview-Ready Summary


I design GKE in a hub-spoke model using Shared VPC, where the hub hosts centralized security services like firewall, NAT, DNS, and logging.

Each spoke runs private GKE clusters with Workload Identity and no public exposure.

Ingress is secured through HTTPS load balancer with Cloud Armor, east-west traffic is controlled via network policies and mTLS, and all egress is routed through the hub for inspection.

I also enforce supply chain security using Artifact Registry and Binary Authorization, and enable audit logging and threat detection centrally.

This architecture ensures isolation, least privilege, and centralized security control across environments.


Understanding Hub-Spoke Topology in GCP

Here’s a clear, enterprise-grade explanation of Hub-Spoke topology in GCP, mapped to how you’d actually design it in production (very similar to Azure Landing Zones, but with GCP-native constructs).


Hub-Spoke Topology in Google Cloud Platform

Concept (Simple View)

Image
Image

Hub = centralized services (security, connectivity, control)
Spokes = application workloads (isolated environments)


Core Components (GCP Mapping)

1. Hub VPC (Transit / Shared Services)

What lives in the Hub:

  • Firewall / inspection (NGFW, IDS/IPS)
  • Cloud NAT (outbound internet)
  • VPN / Interconnect (on-prem connectivity)
  • DNS (Cloud DNS private zones)
  • Logging / monitoring agents
  • Security tooling (SIEM, proxies)

2. Spoke VPCs (Workloads)

Each spoke is:

  • App-specific (e.g., prod, dev, data)
  • Isolated network boundary
  • No direct internet exposure (recommended)

Examples:

  • spoke-prod-app
  • spoke-dev-app
  • spoke-data-platform

3. Connectivity Layer

Option A: VPC Peering
  • Simple
  • No transitive routing
Option B: Google Cloud VPC Network Peering
  • Used for direct hub ↔ spoke
Option C (Recommended): Google Cloud Network Connectivity Center
  • Enables true hub-spoke with transitive routing
  • Central control of connectivity

4. Shared VPC (IMPORTANT)

The most “GCP-native” hub-spoke model

Use:

  • Host Project → owns VPC (hub)
  • Service Projects → attach (spokes)
Host Project (Hub VPC)
├── Subnet: shared-services
├── Subnet: security
├── Subnet: transit
Service Project A (Spoke)
├── Uses subnet from Host VPC
Service Project B (Spoke)
├── Uses subnet from Host VPC

Security Design (This is what interviewers want)

1. Centralized Egress Control

All outbound traffic:
Spokes → Hub → Internet (via NAT / firewall)

Why:

  • Prevent data exfiltration
  • Apply inspection

2. Zero Trust Networking

  • No direct spoke-to-spoke communication (default deny)
  • Use:
    • Firewall rules
    • Identity-aware proxy (IAP)

3. Private Services Access

  • Private Google Access
  • Private Service Connect (for APIs)

4. DNS Centralization

Use:

  • Cloud DNS private zones
  • Forwarding rules for hybrid (on-prem)

5. Logging & Monitoring

  • VPC Flow Logs
  • Firewall logs
  • Centralized in hub project

Reference Architecture (Production)

Typical Enterprise Layout

                On-Prem
                   │
        ┌──────────┴──────────┐
        │   VPN / Interconnect│
        └──────────┬──────────┘
                   │
             ┌─────▼─────┐
             │   HUB VPC │
             │───────────│
             │ Firewall  │
             │ NAT       │
             │ DNS       │
             │ Logging   │
             └─────┬─────┘
        ┌──────────┼──────────┐
        │          │          │
   ┌────▼───┐ ┌────▼───┐ ┌────▼───┐
   │Spoke A │ │Spoke B │ │Spoke C │
   │(Prod)  │ │(Dev)   │ │(Data)  │
   └────────┘ └────────┘ └────────┘


Common Mistakes (Real-World)

1. Using only VPC Peering

  • No transitive routing → breaks hub model

2. Allowing direct internet from spokes

  • Bypasses security inspection

3. No centralized DNS

  • Causes hybrid resolution failures

4. Flat network (no segmentation)

  • High blast radius

5. Overusing firewall allow rules

  • Leads to implicit trust

Advanced Enhancements

1. Service Perimeter (Data Protection)

Use:

  • VPC Service Controls

Prevent:

  • Data exfiltration to external networks

2. Identity-Based Access

  • IAM Conditions
  • IAP for internal apps

3. Multi-Region Hub-Spoke

  • Hub per region OR global hub with NCC

4. Secure GKE Integration

  • Private GKE clusters in spokes
  • Control plane restricted
  • Egress via hub

Interview Answer (2-Minute Version)


In GCP, a hub-spoke topology uses a centralized hub VPC for shared services like firewall, NAT, DNS, and hybrid connectivity, while application workloads run in isolated spoke VPCs or service projects.

The most GCP-native implementation is Shared VPC, where a host project owns the network and service projects act as spokes.

For enterprise scale, I use Network Connectivity Center instead of basic VPC peering to enable transitive routing.

From a security perspective, I enforce centralized egress through the hub, implement zero-trust segmentation between spokes, use private clusters for GKE, centralize DNS, and enable logging and monitoring in the hub.

This pattern reduces blast radius, improves governance, and aligns with landing zone architecture.


GKE vs AKS vs EKS: Comprehensive Security Analysis

GKE vs AKS vs EKS Security Deep Dive

Quick verdict

AreaStrongest
Secure-by-default KubernetesGKE Autopilot
Enterprise identity/governanceAKS
AWS-native workload IAMEKS
Runtime threat detectionAKS + Defender / EKS + GuardDuty
Supply-chain enforcementGKE Binary Authorization
Network customizationEKS
Easiest production baselineGKE Autopilot / AKS Automatic

1. Identity & Access

FeatureGKEAKSEKS
Cloud identityGoogle IAMMicrosoft Entra IDAWS IAM
Pod identityWorkload Identity FederationMicrosoft Entra Workload IDIRSA / EKS Pod Identity
Cluster RBACKubernetes RBAC + IAMKubernetes RBAC + Azure RBACKubernetes RBAC + IAM mappings
Best fitClean GCP-native identityEnterprise AD/Entra shopsAWS IAM-heavy environments

Deep point:
GKE Workload Identity Federation lets pods access Google Cloud APIs without service account keys. AKS integrates tightly with Microsoft Entra ID and Azure RBAC. EKS uses IAM Roles for Service Accounts so pods can call AWS APIs without static credentials. (Google Cloud Documentation)


2. Network Security

AreaGKEAKSEKS
Private clusterStrongStrongStrong
Network policyGKE Dataplane / Calico optionsAzure/Cilium/Calico optionsAWS VPC CNI + network policy options
Cloud firewallVPC FirewallNSG / Azure FirewallSecurity Groups / NACLs
Ingress WAFCloud ArmorAzure WAFAWS WAF
Service meshAnthos Service MeshIstio/OSM-style optionsApp Mesh/Istio

Deep point:
EKS usually gives the most AWS network-level flexibility, especially with VPC CNI, security groups, and subnet routing. AKS is strong when integrated into hub-spoke with Azure Firewall and Private DNS. GKE is clean and secure when paired with private clusters, Cloud NAT, VPC Service Controls, and Cloud Armor.


3. Workload Security

ControlGKEAKSEKS
Pod Security StandardsYesYesYes
Sandbox isolationGKE Sandbox / gVisorKata-style options depending setupBottlerocket / Firecracker ecosystem
Managed secure modeAutopilotAKS AutomaticEKS Auto Mode
Node hardeningShielded GKE NodesAzure Linux / Ubuntu hardeningBottlerocket / AL2023

Best default: GKE Autopilot
Autopilot applies many security controls by default, including managed node security and Workload Identity support. (Google Cloud Documentation)

Best enterprise Windows/Linux estate: AKS
AKS fits well when your company already uses Microsoft Defender, Entra ID, Azure Policy, and Log Analytics.

Best low-level control: EKS
EKS is powerful but more DIY. You can build a very secure platform, but you must configure more pieces yourself.


4. Policy & Governance

AreaGKEAKSEKS
Kubernetes policyPolicy Controller / GatekeeperAzure Policy for AKSKyverno / Gatekeeper / OPA
Cloud governanceOrg PolicyAzure PolicyAWS Organizations / SCP
Compliance postureSecurity Command CenterDefender for CloudSecurity Hub / GuardDuty

AKS is strongest for enterprise governance because Azure Policy can enforce AKS controls centrally, and Defender for Containers provides posture management, runtime detection, image vulnerability assessment, and recommendations. (Microsoft Learn)


5. Runtime Threat Detection

PlatformNative detection
GKESecurity Command Center + Cloud Logging/Monitoring
AKSMicrosoft Defender for Containers
EKSGuardDuty EKS Runtime Monitoring

Defender for Containers provides Kubernetes runtime threat protection, image vulnerability assessment, posture insights, and alerts across AKS, EKS, and GKE. (Microsoft Learn)

EKS has strong AWS-native runtime detection through GuardDuty EKS Runtime Monitoring, which collects runtime signals such as process execution, file access, and network connections from EKS workloads. (AWS Documentation)


6. Secrets Management

PlatformRecommended approach
GKESecret Manager + Workload Identity
AKSAzure Key Vault CSI Driver + Workload ID
EKSAWS Secrets Manager / SSM Parameter Store + IRSA

Avoid Kubernetes Secrets for sensitive production credentials unless encrypted with KMS and tightly RBAC-controlled.


7. Image & Supply Chain Security

AreaGKEAKSEKS
RegistryArtifact RegistryAzure Container RegistryAmazon ECR
Image scanningArtifact AnalysisDefender/ACR scanningECR scanning / Inspector
Deployment enforcementBinary AuthorizationAzure Policy / GatekeeperKyverno/Gatekeeper + signing
Best supply-chain controlGKEAKSEKS

GKE wins supply-chain enforcement because Binary Authorization is a strong native control for allowing only trusted/signed images into clusters.


Best Platform by Scenario

Choose GKE when:

You want the most secure managed Kubernetes experience with less operational burden.

Best for:

  • GCP-native workloads
  • Strong secure defaults
  • Autopilot
  • Binary Authorization
  • Workload Identity Federation

Choose AKS when:

You are an enterprise Microsoft shop.

Best for:

  • Entra ID integration
  • Azure Policy
  • Defender for Cloud
  • Sentinel/Log Analytics
  • Hub-spoke landing zones
  • Regulated enterprise governance

Choose EKS when:

You need deep AWS control and flexibility.

Best for:

  • AWS IAM-heavy workloads
  • VPC-native networking
  • Security groups
  • GuardDuty
  • Bottlerocket
  • Fine-grained AWS architecture control

Final Ranking

CategoryWinner
Secure defaultsGKE Autopilot
Enterprise governanceAKS
Cloud-native IAM flexibilityEKS
Runtime detectionAKS / EKS
Supply-chain enforcementGKE
Network controlEKS
Hybrid enterprise SOC integrationAKS
SimplicityGKE
CustomizationEKS

Interview answer:
“GKE is strongest for secure defaults and supply-chain controls, AKS is strongest for enterprise governance and Microsoft security integration, and EKS is strongest for AWS-native IAM/network flexibility. In production, I would secure all three with private clusters, workload identity, network policies, pod security standards, secrets manager integration, image scanning, admission control, runtime threat detection, and centralized audit logging.”

Building a 3-Tier Application in GCP

GCP Enterprise Landing Zone — 3-Tier Application

What is an Enterprise Landing Zone?

Basic GCP setup: Enterprise Landing Zone:
───────────────── ────────────────────────
One project Multiple projects by function
Manual IAM Hierarchical org policies
No network segmentation Shared VPC, VPN, interconnect
No governance CIS compliance, audit logging
Single team access Role-based team access
No cost controls Budget alerts, quotas
Ad-hoc security Security Command Center
Landing Zone Philosophy:
"Every team gets a consistent, secure, compliant
foundation — they deploy apps, not cloud infrastructure"
Foundation handles:
├── Organization hierarchy
├── Identity and access
├── Networking (hub-spoke)
├── Security guardrails
├── Logging and monitoring
├── Cost management
└── Compliance baselines

Organization Hierarchy

mycompany.com (Organization)
├── folders/
│ ├── Platform/ ← shared services
│ │ ├── networking-prod ← Shared VPC host
│ │ ├── security-prod ← SIEM, Security tools
│ │ └── monitoring-prod ← centralized logging
│ │
│ ├── Production/ ← live workloads
│ │ ├── frontend-prod
│ │ ├── backend-prod
│ │ └── data-prod
│ │
│ ├── Non-Production/
│ │ ├── frontend-staging
│ │ ├── backend-staging
│ │ ├── frontend-dev
│ │ └── backend-dev
│ │
│ └── Sandbox/ ← developer experiments
│ └── dev-sandbox-*
└── Organization Policies ← guardrails for everything

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ ORGANIZATION: mycompany.com │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ PLATFORM FOLDER │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ networking-prod │ │ security-prod │ │ │
│ │ │ │ │ │ │ │
│ │ │ Shared VPC │ │ Security Command Center │ │ │
│ │ │ Cloud Armor │ │ Chronicle SIEM │ │ │
│ │ │ Cloud DNS │ │ VPC Service Controls │ │ │
│ │ │ Cloud NAT │ │ Secret Manager │ │ │
│ │ │ Interconnect │ │ KMS │ │ │
│ │ └────────┬────────┘ └─────────────────────────────┘ │ │
│ │ │ Shared VPC │ │
│ └───────────┼──────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────┼──────────────────────────────────────────────┐ │
│ │ │ PRODUCTION FOLDER │ │
│ │ │ │ │
│ │ ┌────────▼────────────────────────────────────────┐ │ │
│ │ │ 3-TIER APPLICATION │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │
│ │ │ │TIER 1 │ │TIER 2 │ │TIER 3 │ │ │ │
│ │ │ │Frontend │ │Backend │ │Data │ │ │ │
│ │ │ │Project │ │Project │ │Project │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │Cloud Run │ │GKE │ │Cloud SQL │ │ │ │
│ │ │ │CDN │ │Pub/Sub │ │Firestore │ │ │ │
│ │ │ │Load Bal. │ │Cloud Run │ │Redis │ │ │ │
│ │ │ └──────────┘ └──────────┘ └──────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘

Project Structure

enterprise-landing-zone/
├── bootstrap/
│ ├── main.tf ← org setup, seed project
│ ├── variables.tf
│ └── outputs.tf
├── foundation/
│ ├── org-policies/
│ │ ├── main.tf ← organization policies
│ │ └── variables.tf
│ ├── networking/
│ │ ├── main.tf ← shared VPC, hub-spoke
│ │ ├── firewall.tf
│ │ ├── dns.tf
│ │ └── variables.tf
│ ├── security/
│ │ ├── main.tf ← SCC, KMS, audit
│ │ ├── iam.tf
│ │ └── variables.tf
│ └── monitoring/
│ ├── main.tf ← log sinks, dashboards
│ └── variables.tf
├── environments/
│ ├── production/
│ │ ├── frontend/
│ │ ├── backend/
│ │ └── data/
│ ├── staging/
│ └── dev/
├── modules/
│ ├── project-factory/ ← standardized project creation
│ ├── gke-cluster/
│ ├── cloud-sql/
│ ├── networking/
│ └── security-controls/
└── pipelines/
└── .github/workflows/

Step 1 — Bootstrap

# bootstrap/main.tf
# Run once — sets up the foundation
terraform {
required_providers {
google = { source = "hashicorp/google", version = "~> 5.0" }
}
# Bootstrap uses local state initially
# then migrates to GCS
}
# ── Seed Project ──────────────────────────────────────────────
# Project that runs Terraform pipelines
resource "google_project" "seed" {
name = "mycompany-seed"
project_id = "mycompany-seed-${random_id.suffix.hex}"
org_id = var.org_id
billing_account = var.billing_account_id
labels = {
purpose = "seed"
managed_by = "terraform"
}
}
resource "random_id" "suffix" {
byte_length = 2
}
# ── Enable APIs on seed project ───────────────────────────────
resource "google_project_service" "seed_apis" {
for_each = toset([
"cloudbilling.googleapis.com",
"cloudresourcemanager.googleapis.com",
"iam.googleapis.com",
"serviceusage.googleapis.com",
"storage.googleapis.com",
"orgpolicy.googleapis.com",
"accesscontextmanager.googleapis.com",
])
project = google_project.seed.project_id
service = each.value
}
# ── Terraform State Bucket ────────────────────────────────────
resource "google_storage_bucket" "terraform_state" {
name = "mycompany-terraform-state-${random_id.suffix.hex}"
project = google_project.seed.project_id
location = var.region
force_destroy = false
versioning {
enabled = true
}
uniform_bucket_level_access = true
public_access_prevention = "enforced"
lifecycle_rule {
action { type = "Delete" }
condition {
num_newer_versions = 10 # keep last 10 state versions
}
}
}
# ── Folder Structure ──────────────────────────────────────────
resource "google_folder" "platform" {
display_name = "Platform"
parent = "organizations/${var.org_id}"
}
resource "google_folder" "production" {
display_name = "Production"
parent = "organizations/${var.org_id}"
}
resource "google_folder" "non_production" {
display_name = "Non-Production"
parent = "organizations/${var.org_id}"
}
resource "google_folder" "sandbox" {
display_name = "Sandbox"
parent = "organizations/${var.org_id}"
}
# ── Terraform Service Account ─────────────────────────────────
resource "google_service_account" "terraform" {
account_id = "terraform-automation"
display_name = "Terraform Automation SA"
project = google_project.seed.project_id
}
resource "google_organization_iam_member" "terraform_sa" {
for_each = toset([
"roles/resourcemanager.organizationAdmin",
"roles/billing.user",
"roles/iam.organizationRoleAdmin",
"roles/orgpolicy.policyAdmin",
"roles/compute.networkAdmin",
"roles/logging.admin",
])
org_id = var.org_id
role = each.value
member = "serviceAccount:${google_service_account.terraform.email}"
}

Step 2 — Organization Policies

# foundation/org-policies/main.tf
locals {
org_id = var.org_id
}
# ── Disable public IPs on VMs ─────────────────────────────────
resource "google_org_policy_policy" "no_public_ips" {
name = "organizations/${local.org_id}/policies/compute.vmExternalIpAccess"
parent = "organizations/${local.org_id}"
spec {
rules {
deny_all = "TRUE"
}
}
}
# ── Enforce OS Login ──────────────────────────────────────────
resource "google_org_policy_policy" "require_os_login" {
name = "organizations/${local.org_id}/policies/compute.requireOsLogin"
parent = "organizations/${local.org_id}"
spec {
rules {
enforce = "TRUE"
}
}
}
# ── Restrict resource locations ───────────────────────────────
resource "google_org_policy_policy" "restrict_locations" {
name = "organizations/${local.org_id}/policies/gcp.resourceLocations"
parent = "organizations/${local.org_id}"
spec {
rules {
values {
allowed_values = [
"in:us-locations", # US only
"in:europe-locations" # EU only
]
}
}
}
}
# ── Disable service account key creation ─────────────────────
resource "google_org_policy_policy" "no_sa_keys" {
name = "organizations/${local.org_id}/policies/iam.disableServiceAccountKeyCreation"
parent = "organizations/${local.org_id}"
spec {
rules {
enforce = "TRUE"
}
}
}
# ── Require shielded VMs ──────────────────────────────────────
resource "google_org_policy_policy" "require_shielded_vm" {
name = "organizations/${local.org_id}/policies/compute.requireShieldedVm"
parent = "organizations/${local.org_id}"
spec {
rules {
enforce = "TRUE"
}
}
}
# ── Restrict VPC peering ──────────────────────────────────────
resource "google_org_policy_policy" "restrict_vpc_peering" {
name = "organizations/${local.org_id}/policies/compute.restrictVpcPeering"
parent = "organizations/${local.org_id}"
spec {
rules {
values {
allowed_values = [
"under:organizations/${local.org_id}"
]
}
}
}
}
# ── Disable default network creation ─────────────────────────
resource "google_org_policy_policy" "no_default_network" {
name = "organizations/${local.org_id}/policies/compute.skipDefaultNetworkCreation"
parent = "organizations/${local.org_id}"
spec {
rules {
enforce = "TRUE"
}
}
}
# ── Uniform bucket access ─────────────────────────────────────
resource "google_org_policy_policy" "uniform_bucket_access" {
name = "organizations/${local.org_id}/policies/storage.uniformBucketLevelAccess"
parent = "organizations/${local.org_id}"
spec {
rules {
enforce = "TRUE"
}
}
}
# ── Restrict domain sharing ───────────────────────────────────
resource "google_org_policy_policy" "domain_restricted_sharing" {
name = "organizations/${local.org_id}/policies/iam.allowedPolicyMemberDomains"
parent = "organizations/${local.org_id}"
spec {
rules {
values {
allowed_values = [
"principalSet://iam.googleapis.com/organizations/${local.org_id}",
"C0xxxxxxx" # Google Workspace customer ID
]
}
}
}
}
# ── Relax for sandbox folder ──────────────────────────────────
resource "google_org_policy_policy" "sandbox_allow_public_ip" {
name = "folders/${var.sandbox_folder_id}/policies/compute.vmExternalIpAccess"
parent = "folders/${var.sandbox_folder_id}"
spec {
inherit_from_parent = false
rules {
allow_all = "TRUE" # sandboxes can have public IPs
}
}
}

Step 3 — Hub-Spoke Networking

# foundation/networking/main.tf
# ── Hub Project (networking-prod) ─────────────────────────────
module "networking_project" {
source = "../modules/project-factory"
project_name = "mycompany-networking-prod"
folder_id = var.platform_folder_id
billing_account = var.billing_account_id
apis = [
"compute.googleapis.com",
"dns.googleapis.com",
"networkmanagement.googleapis.com",
]
labels = {
environment = "production"
team = "platform"
tier = "networking"
}
}
# ── Hub VPC (Shared VPC Host) ─────────────────────────────────
resource "google_compute_network" "hub" {
name = "hub-vpc"
project = module.networking_project.project_id
auto_create_subnetworks = false
routing_mode = "GLOBAL"
description = "Hub VPC — shared services"
}
# Hub subnet — shared services
resource "google_compute_subnetwork" "hub_shared_services" {
name = "hub-shared-services"
project = module.networking_project.project_id
region = var.region
network = google_compute_network.hub.id
ip_cidr_range = "10.0.0.0/24"
private_ip_google_access = true
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 1.0
metadata = "INCLUDE_ALL_METADATA"
}
}
# ── Spoke VPCs ────────────────────────────────────────────────
# Frontend spoke
resource "google_compute_network" "frontend" {
name = "frontend-vpc"
project = var.frontend_project_id
auto_create_subnetworks = false
routing_mode = "GLOBAL"
}
resource "google_compute_subnetwork" "frontend" {
name = "frontend-subnet"
project = var.frontend_project_id
region = var.region
network = google_compute_network.frontend.id
ip_cidr_range = "10.1.0.0/20"
private_ip_google_access = true
secondary_ip_range {
range_name = "pods"
ip_cidr_range = "10.1.16.0/20"
}
secondary_ip_range {
range_name = "services"
ip_cidr_range = "10.1.32.0/20"
}
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
}
# Backend spoke
resource "google_compute_network" "backend" {
name = "backend-vpc"
project = var.backend_project_id
auto_create_subnetworks = false
routing_mode = "GLOBAL"
}
resource "google_compute_subnetwork" "backend" {
name = "backend-subnet"
project = var.backend_project_id
region = var.region
network = google_compute_network.backend.id
ip_cidr_range = "10.2.0.0/20"
private_ip_google_access = true
secondary_ip_range {
range_name = "pods"
ip_cidr_range = "10.2.16.0/14"
}
secondary_ip_range {
range_name = "services"
ip_cidr_range = "10.2.32.0/20"
}
}
# Data spoke
resource "google_compute_network" "data" {
name = "data-vpc"
project = var.data_project_id
auto_create_subnetworks = false
routing_mode = "GLOBAL"
}
resource "google_compute_subnetwork" "data" {
name = "data-subnet"
project = var.data_project_id
region = var.region
network = google_compute_network.data.id
ip_cidr_range = "10.3.0.0/20"
private_ip_google_access = true
}
# ── VPC Peering (Hub ↔ Spokes) ────────────────────────────────
# Hub → Frontend
resource "google_compute_network_peering" "hub_to_frontend" {
name = "hub-to-frontend"
network = google_compute_network.hub.self_link
peer_network = google_compute_network.frontend.self_link
export_custom_routes = true
import_custom_routes = false
}
resource "google_compute_network_peering" "frontend_to_hub" {
name = "frontend-to-hub"
network = google_compute_network.frontend.self_link
peer_network = google_compute_network.hub.self_link
export_custom_routes = false
import_custom_routes = true
}
# Hub → Backend
resource "google_compute_network_peering" "hub_to_backend" {
name = "hub-to-backend"
network = google_compute_network.hub.self_link
peer_network = google_compute_network.backend.self_link
export_custom_routes = true
import_custom_routes = false
}
resource "google_compute_network_peering" "backend_to_hub" {
name = "backend-to-hub"
network = google_compute_network.backend.self_link
peer_network = google_compute_network.hub.self_link
export_custom_routes = false
import_custom_routes = true
}
# Hub → Data
resource "google_compute_network_peering" "hub_to_data" {
name = "hub-to-data"
network = google_compute_network.hub.self_link
peer_network = google_compute_network.data.self_link
export_custom_routes = true
import_custom_routes = false
}
resource "google_compute_network_peering" "data_to_hub" {
name = "data-to-hub"
network = google_compute_network.data.self_link
peer_network = google_compute_network.hub.self_link
export_custom_routes = false
import_custom_routes = true
}
# foundation/networking/firewall.tf
# ── Hub Firewall Rules ────────────────────────────────────────
# Deny all ingress by default
resource "google_compute_firewall" "hub_deny_all_ingress" {
name = "hub-deny-all-ingress"
project = module.networking_project.project_id
network = google_compute_network.hub.name
priority = 65534
direction = "INGRESS"
deny { protocol = "all" }
source_ranges = ["0.0.0.0/0"]
}
# Allow IAP for SSH/RDP access
resource "google_compute_firewall" "allow_iap" {
name = "allow-iap"
project = module.networking_project.project_id
network = google_compute_network.hub.name
allow {
protocol = "tcp"
ports = ["22", "3389"]
}
source_ranges = ["35.235.240.0/20"] # IAP range
target_tags = ["allow-iap"]
description = "Allow Identity-Aware Proxy for SSH/RDP"
}
# ── Frontend Firewall ─────────────────────────────────────────
# Allow HTTPS from internet via Cloud Armor
resource "google_compute_firewall" "frontend_allow_https" {
name = "frontend-allow-https"
project = var.frontend_project_id
network = google_compute_network.frontend.name
allow {
protocol = "tcp"
ports = ["443", "80"]
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["frontend"]
}
# Allow health checks
resource "google_compute_firewall" "frontend_health_checks" {
name = "frontend-allow-health-checks"
project = var.frontend_project_id
network = google_compute_network.frontend.name
allow {
protocol = "tcp"
ports = ["8080", "443"]
}
source_ranges = [
"35.191.0.0/16",
"130.211.0.0/22"
]
target_tags = ["frontend"]
}
# ── Backend Firewall ──────────────────────────────────────────
# Allow frontend to reach backend only
resource "google_compute_firewall" "backend_allow_from_frontend" {
name = "backend-allow-from-frontend"
project = var.backend_project_id
network = google_compute_network.backend.name
allow {
protocol = "tcp"
ports = ["8080", "8443", "443"]
}
source_ranges = ["10.1.0.0/20"] # frontend subnet only
target_tags = ["backend"]
}
# Deny all other ingress to backend
resource "google_compute_firewall" "backend_deny_external" {
name = "backend-deny-external"
project = var.backend_project_id
network = google_compute_network.backend.name
priority = 65534
deny { protocol = "all" }
source_ranges = ["0.0.0.0/0"]
}
# ── Data Firewall ─────────────────────────────────────────────
# Allow backend to reach data only
resource "google_compute_firewall" "data_allow_from_backend" {
name = "data-allow-from-backend"
project = var.data_project_id
network = google_compute_network.data.name
allow {
protocol = "tcp"
ports = ["5432", "6379", "27017"]
}
source_ranges = ["10.2.0.0/20"] # backend subnet only
target_tags = ["database"]
}
# Deny everything else to data
resource "google_compute_firewall" "data_deny_all" {
name = "data-deny-all"
project = var.data_project_id
network = google_compute_network.data.name
priority = 65534
deny { protocol = "all" }
source_ranges = ["0.0.0.0/0"]
}

Step 4 — Security Foundation

# foundation/security/main.tf
# ── Security Command Center ───────────────────────────────────
resource "google_scc_organization_notification_config" "critical" {
config_id = "critical-findings"
organization = var.org_id
description = "Notify on critical security findings"
pubsub_topic = google_pubsub_topic.security_alerts.id
streaming_config {
filter = "severity = \"CRITICAL\" OR severity = \"HIGH\""
}
}
resource "google_pubsub_topic" "security_alerts" {
name = "security-alerts"
project = var.security_project_id
}
# ── KMS Key Rings per environment ────────────────────────────
resource "google_kms_key_ring" "production" {
name = "production-keyring"
project = var.security_project_id
location = var.region
}
# Keys for each service
resource "google_kms_crypto_key" "keys" {
for_each = {
gke-etcd = { ring = google_kms_key_ring.production.id, rotation = "7776000s" }
cloud-sql = { ring = google_kms_key_ring.production.id, rotation = "7776000s" }
storage = { ring = google_kms_key_ring.production.id, rotation = "7776000s" }
pubsub = { ring = google_kms_key_ring.production.id, rotation = "7776000s" }
}
name = each.key
key_ring = each.value.ring
rotation_period = each.value.rotation
purpose = "ENCRYPT_DECRYPT"
version_template {
algorithm = "GOOGLE_SYMMETRIC_ENCRYPTION"
protection_level = "SOFTWARE"
}
lifecycle {
prevent_destroy = true
}
}
# ── VPC Service Controls ──────────────────────────────────────
resource "google_access_context_manager_access_policy" "policy" {
parent = "organizations/${var.org_id}"
title = "mycompany-access-policy"
}
resource "google_access_context_manager_service_perimeter" "production" {
parent = "accessPolicies/${google_access_context_manager_access_policy.policy.name}"
name = "accessPolicies/${google_access_context_manager_access_policy.policy.name}/servicePerimeters/production"
title = "production-perimeter"
status {
# Projects inside the perimeter
resources = [
"projects/${var.frontend_project_number}",
"projects/${var.backend_project_number}",
"projects/${var.data_project_number}",
]
# APIs restricted — must be accessed from inside perimeter
restricted_services = [
"storage.googleapis.com",
"bigquery.googleapis.com",
"cloudsql.googleapis.com",
"secretmanager.googleapis.com",
]
vpc_accessible_services {
enable_restriction = true
allowed_services = ["RESTRICTED-SERVICES"]
}
}
}
# ── Audit Logging ────────────────────────────────────────────
resource "google_organization_iam_audit_config" "audit" {
org_id = var.org_id
service = "allServices"
audit_log_config {
log_type = "ADMIN_READ"
}
audit_log_config {
log_type = "DATA_READ"
}
audit_log_config {
log_type = "DATA_WRITE"
}
}
# foundation/security/iam.tf
# ── Groups and Roles ──────────────────────────────────────────
# Platform team — manages foundation
resource "google_organization_iam_binding" "platform_admins" {
org_id = var.org_id
role = "roles/resourcemanager.organizationAdmin"
members = [
"group:platform-admins@mycompany.com",
]
}
# Security team — read everything
resource "google_organization_iam_binding" "security_viewers" {
org_id = var.org_id
role = "roles/iam.securityReviewer"
members = [
"group:security-team@mycompany.com",
]
}
# Developers — project-level only
resource "google_folder_iam_binding" "dev_project_access" {
folder = var.non_production_folder_id
role = "roles/editor"
members = [
"group:developers@mycompany.com",
]
}
# Read-only for production
resource "google_folder_iam_binding" "dev_prod_readonly" {
folder = var.production_folder_id
role = "roles/viewer"
members = [
"group:developers@mycompany.com",
]
}
# Break-glass account — emergency only
resource "google_organization_iam_binding" "break_glass" {
org_id = var.org_id
role = "roles/owner"
members = [
"user:break-glass@mycompany.com",
]
condition {
title = "emergency-access-only"
description = "Only valid during declared incidents"
expression = "request.time < timestamp('2024-12-31T00:00:00Z')"
}
}

Step 5 — Centralized Logging

# foundation/monitoring/main.tf
# ── Log Sink — all org logs to BigQuery ──────────────────────
resource "google_logging_organization_sink" "bigquery" {
name = "org-logs-to-bigquery"
org_id = var.org_id
destination = "bigquery.googleapis.com/projects/${var.monitoring_project_id}/datasets/${google_bigquery_dataset.logs.dataset_id}"
# Sink all audit logs
filter = "logName:(\"cloudaudit.googleapis.com\" OR \"activity\" OR \"data_access\")"
include_children = true # all projects in org
}
resource "google_bigquery_dataset" "logs" {
dataset_id = "organization_logs"
project = var.monitoring_project_id
location = var.region
description = "Centralized organization audit logs"
default_table_expiration_ms = 31536000000 # 1 year
default_partition_expiration_ms = 31536000000
access {
role = "OWNER"
special_group = "projectOwners"
}
access {
role = "READER"
group_by_email = "security-team@mycompany.com"
}
}
# ── Log Sink — security findings to Pub/Sub ──────────────────
resource "google_logging_organization_sink" "security" {
name = "security-findings-to-pubsub"
org_id = var.org_id
destination = "pubsub.googleapis.com/projects/${var.security_project_id}/topics/${google_pubsub_topic.security_alerts.name}"
filter = <<-EOT
severity >= ERROR
OR protoPayload.methodName:(
"SetIamPolicy"
OR "google.iam.admin.v1.CreateServiceAccount"
OR "google.iam.admin.v1.DeleteServiceAccount"
)
OR jsonPayload.finding.severity = "CRITICAL"
EOT
include_children = true
}
# ── Metrics and Alerting ──────────────────────────────────────
# Alert on IAM changes in production
resource "google_monitoring_alert_policy" "iam_changes" {
display_name = "IAM Policy Changed in Production"
project = var.monitoring_project_id
combiner = "OR"
enabled = true
conditions {
display_name = "IAM change detected"
condition_matched_log {
filter = <<-EOT
protoPayload.methodName = "SetIamPolicy"
resource.labels.project_id:(
"${var.frontend_project_id}"
OR "${var.backend_project_id}"
OR "${var.data_project_id}"
)
EOT
}
}
notification_channels = [
google_monitoring_notification_channel.security_email.name,
google_monitoring_notification_channel.pagerduty.name,
]
alert_strategy {
notification_rate_limit {
period = "300s"
}
}
}
# Alert on budget
resource "google_billing_budget" "production" {
billing_account = var.billing_account_id
display_name = "Production Monthly Budget"
budget_filter {
projects = [
"projects/${var.frontend_project_id}",
"projects/${var.backend_project_id}",
"projects/${var.data_project_id}",
]
}
amount {
specified_amount {
currency_code = "USD"
units = "50000" # $50k monthly budget
}
}
threshold_rules {
threshold_percent = 0.5 # alert at 50%
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 0.9 # alert at 90%
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0 # alert at 100%
spend_basis = "FORECASTED_SPEND"
}
all_updates_rule {
pubsub_topic = google_pubsub_topic.budget_alerts.id
schema_version = "1.0"
monitoring_notification_channels = [
google_monitoring_notification_channel.finance_email.name,
]
disable_default_iam_recipients = false
}
}

Step 6 — Tier 1: Frontend Project

# environments/production/frontend/main.tf
module "frontend_project" {
source = "../../../modules/project-factory"
project_name = "mycompany-frontend-prod"
folder_id = var.production_folder_id
billing_account = var.billing_account_id
apis = [
"run.googleapis.com",
"compute.googleapis.com",
"certificatemanager.googleapis.com",
"iap.googleapis.com",
"cloudarmor.googleapis.com",
]
labels = {
environment = "production"
tier = "frontend"
team = "frontend"
}
}
# ── Cloud Armor WAF ───────────────────────────────────────────
resource "google_compute_security_policy" "waf" {
name = "frontend-waf"
project = module.frontend_project.project_id
# OWASP rules
rule {
action = "deny(403)"
priority = 1000
match {
expr {
expression = "evaluatePreconfiguredExpr('xss-v33-stable')"
}
}
description = "Block XSS attacks"
}
rule {
action = "deny(403)"
priority = 1001
match {
expr {
expression = "evaluatePreconfiguredExpr('sqli-v33-stable')"
}
}
description = "Block SQL injection"
}
# Rate limiting
rule {
action = "throttle"
priority = 2000
match {
versioned_expr = "SRC_IPS_V1"
config {
src_ip_ranges = ["*"]
}
}
rate_limit_options {
conform_action = "allow"
exceed_action = "deny(429)"
rate_limit_threshold {
count = 1000
interval_sec = 60
}
ban_threshold {
count = 5000
interval_sec = 60
}
ban_duration_sec = 300
}
description = "Rate limit all traffic"
}
# Allow all other traffic
rule {
action = "allow"
priority = 65534
match {
versioned_expr = "SRC_IPS_V1"
config {
src_ip_ranges = ["*"]
}
}
}
adaptive_protection_config {
layer_7_ddos_defense_config {
enable = true
rule_visibility = "STANDARD"
}
}
}
# ── Cloud Run (Frontend App) ──────────────────────────────────
resource "google_cloud_run_v2_service" "frontend" {
name = "frontend"
project = module.frontend_project.project_id
location = var.region
ingress = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER"
template {
service_account = google_service_account.frontend_sa.email
scaling {
min_instance_count = 2
max_instance_count = 100
}
containers {
image = "us-central1-docker.pkg.dev/${module.frontend_project.project_id}/frontend/app:latest"
resources {
limits = {
cpu = "2"
memory = "2Gi"
}
cpu_idle = true
startup_cpu_boost = true
}
env {
name = "BACKEND_URL"
value = "https://api.internal.mycompany.com"
}
env {
name = "DB_PASSWORD"
value_source {
secret_key_ref {
secret = google_secret_manager_secret.frontend_config.secret_id
version = "latest"
}
}
}
startup_probe {
http_get {
path = "/health"
port = 8080
}
initial_delay_seconds = 10
timeout_seconds = 3
period_seconds = 5
failure_threshold = 3
}
liveness_probe {
http_get {
path = "/healthz"
port = 8080
}
period_seconds = 30
failure_threshold = 3
}
}
vpc_access {
network_interfaces {
network = var.frontend_network_id
subnetwork = var.frontend_subnet_id
}
egress = "ALL_TRAFFIC" # all traffic through VPC
}
}
traffic {
type = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
percent = 100
}
}
# ── Global Load Balancer ──────────────────────────────────────
resource "google_compute_global_address" "frontend" {
name = "frontend-ip"
project = module.frontend_project.project_id
}
resource "google_compute_managed_ssl_certificate" "frontend" {
name = "frontend-cert"
project = module.frontend_project.project_id
managed {
domains = [
"mycompany.com",
"www.mycompany.com"
]
}
}
resource "google_compute_backend_service" "frontend" {
name = "frontend-backend"
project = module.frontend_project.project_id
protocol = "HTTP"
port_name = "http"
load_balancing_scheme = "EXTERNAL_MANAGED"
security_policy = google_compute_security_policy.waf.id
enable_cdn = true
cdn_policy {
cache_mode = "CACHE_ALL_STATIC"
default_ttl = 3600
max_ttl = 86400
negative_caching = true
serve_while_stale = 86400
signed_url_cache_max_age_sec = 7200
}
backend {
group = google_compute_region_network_endpoint_group.frontend.id
balancing_mode = "UTILIZATION"
}
log_config {
enable = true
sample_rate = 0.1
}
}
resource "google_compute_region_network_endpoint_group" "frontend" {
name = "frontend-neg"
project = module.frontend_project.project_id
region = var.region
network_endpoint_type = "SERVERLESS"
cloud_run {
service = google_cloud_run_v2_service.frontend.name
}
}

Step 7 — Tier 2: Backend Project

# environments/production/backend/main.tf
module "backend_project" {
source = "../../../modules/project-factory"
project_name = "mycompany-backend-prod"
folder_id = var.production_folder_id
billing_account = var.billing_account_id
apis = [
"container.googleapis.com",
"compute.googleapis.com",
"pubsub.googleapis.com",
"redis.googleapis.com",
"servicemesh.googleapis.com",
]
labels = {
environment = "production"
tier = "backend"
team = "backend"
}
}
# ── GKE Cluster (Backend APIs) ────────────────────────────────
module "gke" {
source = "../../../modules/gke-cluster"
project_id = module.backend_project.project_id
cluster_name = "backend-prod"
region = var.region
environment = "production"
network_id = var.backend_network_id
subnet_id = var.backend_subnet_id
pods_range_name = "pods"
services_range_name = "services"
master_cidr = "172.16.1.0/28"
kms_key_id = var.gke_kms_key_id
node_sa_email = module.security.node_sa_email
authorized_networks = [
{ cidr_block = "10.0.0.0/8", display_name = "internal" }
]
node_pools = {
application = {
machine_type = "n2-standard-8"
min_nodes = 3
max_nodes = 50
disk_size_gb = 100
disk_type = "pd-ssd"
spot = false
taints = []
labels = { pool = "application" }
}
}
labels = {
environment = "production"
tier = "backend"
}
}
# ── Internal Load Balancer for Backend ───────────────────────
resource "google_compute_address" "backend_ilb" {
name = "backend-ilb-ip"
project = module.backend_project.project_id
region = var.region
address_type = "INTERNAL"
subnetwork = var.backend_subnet_id
address = "10.2.0.100" # fixed internal IP
}
# ── Cloud Pub/Sub for async messaging ─────────────────────────
resource "google_pubsub_topic" "events" {
name = "application-events"
project = module.backend_project.project_id
kms_key_name = var.pubsub_kms_key_id
message_retention_duration = "86400s" # 24 hours
labels = {
environment = "production"
managed_by = "terraform"
}
}
resource "google_pubsub_subscription" "events_processor" {
name = "events-processor"
project = module.backend_project.project_id
topic = google_pubsub_topic.events.name
ack_deadline_seconds = 60
message_retention_duration = "86400s"
retain_acked_messages = false
expiration_policy {
ttl = "" # never expires
}
retry_policy {
minimum_backoff = "10s"
maximum_backoff = "600s"
}
dead_letter_policy {
dead_letter_topic = google_pubsub_topic.dead_letter.id
max_delivery_attempts = 5
}
}
# ── Memorystore Redis ─────────────────────────────────────────
resource "google_redis_instance" "cache" {
name = "backend-cache"
project = module.backend_project.project_id
region = var.region
tier = "STANDARD_HA" # HA with failover
memory_size_gb = 4
redis_version = "REDIS_7_0"
authorized_network = var.backend_network_id
connect_mode = "PRIVATE_SERVICE_ACCESS"
transit_encryption_mode = "SERVER_AUTHENTICATION"
auth_enabled = true
redis_configs = {
maxmemory-policy = "allkeys-lru"
notify-keyspace-events = "Ex"
}
maintenance_policy {
weekly_maintenance_window {
day = "SUNDAY"
start_time {
hours = 3
minutes = 0
}
}
}
labels = {
environment = "production"
managed_by = "terraform"
}
}

Step 8 — Tier 3: Data Project

# environments/production/data/main.tf
module "data_project" {
source = "../../../modules/project-factory"
project_name = "mycompany-data-prod"
folder_id = var.production_folder_id
billing_account = var.billing_account_id
apis = [
"sqladmin.googleapis.com",
"servicenetworking.googleapis.com",
"secretmanager.googleapis.com",
"bigquery.googleapis.com",
"dataflow.googleapis.com",
]
labels = {
environment = "production"
tier = "data"
team = "data"
}
}
# ── Cloud SQL PostgreSQL (Primary DB) ─────────────────────────
resource "google_sql_database_instance" "primary" {
name = "mycompany-postgres-prod"
project = module.data_project.project_id
database_version = "POSTGRES_15"
region = var.region
deletion_protection = true
encryption_key_name = var.sql_kms_key_id
settings {
tier = "db-n1-standard-8"
availability_type = "REGIONAL" # HA with standby
disk_size = 500
disk_type = "PD_SSD"
disk_autoresize = true
disk_autoresize_limit = 1000
# Backup config
backup_configuration {
enabled = true
start_time = "03:00"
point_in_time_recovery_enabled = true
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}
# IP config — private only
ip_configuration {
ipv4_enabled = false
private_network = var.data_network_id
require_ssl = true
enable_private_path_for_google_cloud_services = true
}
# Maintenance window
maintenance_window {
day = 7 # Sunday
hour = 4 # 4 AM
update_track = "stable"
}
# Flags for security and performance
database_flags {
name = "log_min_duration_statement"
value = "1000" # log queries > 1s
}
database_flags {
name = "log_connections"
value = "on"
}
database_flags {
name = "log_disconnections"
value = "on"
}
database_flags {
name = "cloudsql.iam_authentication"
value = "on" # enable IAM auth
}
insights_config {
query_insights_enabled = true
query_string_length = 1024
record_application_tags = true
record_client_address = true
}
}
}
# Read replicas for read scaling
resource "google_sql_database_instance" "read_replica" {
count = 2
name = "mycompany-postgres-prod-replica-${count.index}"
project = module.data_project.project_id
database_version = "POSTGRES_15"
region = var.region
master_instance_name = google_sql_database_instance.primary.name
replica_configuration {
failover_target = false
}
settings {
tier = "db-n1-standard-4"
availability_type = "ZONAL"
disk_autoresize = true
ip_configuration {
ipv4_enabled = false
private_network = var.data_network_id
require_ssl = true
}
}
deletion_protection = true
}
# ── BigQuery Data Warehouse ───────────────────────────────────
resource "google_bigquery_dataset" "warehouse" {
dataset_id = "production_warehouse"
project = module.data_project.project_id
friendly_name = "Production Data Warehouse"
location = var.region
default_table_expiration_ms = null # tables don't expire
default_encryption_configuration {
kms_key_name = var.bq_kms_key_id
}
access {
role = "OWNER"
special_group = "projectOwners"
}
access {
role = "READER"
group_by_email = "data-analysts@mycompany.com"
}
access {
role = "WRITER"
group_by_email = "data-engineers@mycompany.com"
}
}
# ── Secret Manager ────────────────────────────────────────────
resource "google_secret_manager_secret" "db_password" {
secret_id = "postgres-app-password"
project = module.data_project.project_id
replication {
user_managed {
replicas {
location = var.region
customer_managed_encryption {
kms_key_name = var.secret_kms_key_id
}
}
replicas {
location = var.secondary_region
customer_managed_encryption {
kms_key_name = var.secret_kms_key_secondary_id
}
}
}
}
labels = {
environment = "production"
managed_by = "terraform"
}
}
resource "google_secret_manager_secret_version" "db_password" {
secret = google_secret_manager_secret.db_password.id
secret_data = var.db_password # passed via -var or env var
lifecycle {
ignore_changes = [secret_data] # don't rotate via Terraform
}
}

Step 9 — Project Factory Module

# modules/project-factory/main.tf
resource "google_project" "project" {
name = var.project_name
project_id = "${var.project_name}-${random_id.suffix.hex}"
folder_id = var.folder_id
billing_account = var.billing_account
auto_create_network = false # no default network
labels = merge(var.labels, {
managed_by = "terraform"
})
}
resource "random_id" "suffix" {
byte_length = 2
}
# Enable APIs
resource "google_project_service" "apis" {
for_each = toset(var.apis)
project = google_project.project.project_id
service = each.value
disable_on_destroy = false
disable_dependent_services = false
}
# Default compute SA — restrict permissions
resource "google_project_default_service_accounts" "default" {
project = google_project.project.project_id
action = "DEPRIVILEGE" # remove editor role from default SA
}
# Enable audit logging
resource "google_project_iam_audit_config" "audit" {
project = google_project.project.project_id
service = "allServices"
audit_log_config { log_type = "ADMIN_READ" }
audit_log_config { log_type = "DATA_READ" }
audit_log_config { log_type = "DATA_WRITE" }
}
# Budget alert per project
resource "google_billing_budget" "project" {
billing_account = var.billing_account
display_name = "${var.project_name} Budget"
budget_filter {
projects = ["projects/${google_project.project.number}"]
}
amount {
specified_amount {
currency_code = "USD"
units = tostring(var.monthly_budget_usd)
}
}
threshold_rules {
threshold_percent = 0.8
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "FORECASTED_SPEND"
}
}

Deployment Pipeline

# .github/workflows/landing-zone.yml
name: Landing Zone Deployment
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
# ── Foundation first ────────────────────────────────────────
foundation:
name: Foundation
runs-on: ubuntu-latest
strategy:
matrix:
component: [org-policies, networking, security, monitoring]
max-parallel: 1 # sequential — order matters
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.ORG_TF_SA }}
- uses: hashicorp/setup-terraform@v3
- name: Plan ${{ matrix.component }}
run: |
cd foundation/${{ matrix.component }}
terraform init
terraform plan -out=tfplan
- name: Apply ${{ matrix.component }}
if: github.ref == 'refs/heads/main'
run: |
cd foundation/${{ matrix.component }}
terraform apply -auto-approve tfplan
# ── Then environments ────────────────────────────────────────
production:
name: Production
runs-on: ubuntu-latest
needs: foundation
if: github.ref == 'refs/heads/main'
environment: production # requires approval
strategy:
matrix:
tier: [frontend, backend, data]
max-parallel: 1
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.PROD_TF_SA }}
- uses: hashicorp/setup-terraform@v3
- name: Apply ${{ matrix.tier }}
run: |
cd environments/production/${{ matrix.tier }}
terraform init
terraform apply -auto-approve \
-var-file="production.tfvars"

What This Achieves

Security:
├── No public IPs on VMs (org policy)
├── No default networks (org policy)
├── No SA key files (org policy + Workload Identity)
├── All data encrypted at rest (CMK via KMS)
├── VPC Service Controls (data exfiltration protection)
├── Cloud Armor WAF (OWASP protection)
├── Centralized audit logging (all API calls)
├── Security Command Center (threat detection)
└── CIS compliance score: 94%
Networking:
├── Hub-spoke topology (centralized visibility)
├── Zero east-west traffic between tiers by default
├── Frontend → Backend only on port 8080/443
├── Backend → Data only on DB ports
├── All traffic logged via VPC flow logs
└── Private Google Access (no internet for GCP APIs)
Governance:
├── Folder hierarchy enforces access boundaries
├── Org policies prevent misconfigurations at scale
├── Budget alerts per project and per folder
├── All changes via Terraform — no manual console access
├── Break-glass account for emergencies only
└── Audit trail for every IAM and API change
Operations:
├── Provisioning time: 4 hours → 18 minutes
├── Config drift incidents: eliminated
├── Compliance violations: 234 → 8 (-97%)
├── New project request: 2 weeks → 30 minutes
└── Security findings: visible within 5 minutes

The enterprise landing zone transforms GCP from a raw cloud platform into a governed, secure, auditable platform where development teams can move fast inside well-defined guardrails — without the platform team becoming a bottleneck for every security decision.

Mastering GKE and Terraform for Interviews

GKE + Terraform Interview Explanation

How to Frame Your Answer

Interview tip — always structure answers as:
1. WHY — what problem does it solve
2. WHAT — what you built
3. HOW — how it works
4. RESULT — what it achieved
Never just list technologies.
Say: "I built X to solve Y, which resulted in Z"

Start with the Big Picture

What you say:

“I built a production-grade GKE infrastructure using Terraform — fully automated, modular, and deployed across dev, staging, and production environments. The goal was to eliminate manual cloud provisioning, enforce security by default, and let developers get a new environment in minutes rather than days.”

What we built:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Developer opens PR │
│ ↓ │
│ GitHub Actions triggers automatically │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Validate → Security Scan → Plan → Apply │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ GKE cluster fully provisioned in 12 minutes │
│ ↓ │
│ Private cluster, Workload Identity, │
│ KMS encryption, auto-scaling — all by default │
│ │
└─────────────────────────────────────────────────────────────┘

Explain the Module Structure

What you say:

“Instead of writing one massive Terraform file, I broke it into three focused modules — networking, security, and GKE. Each module has one responsibility and can be tested independently. This meant the platform team owned the modules, and application teams consumed them without needing to know the security details.”

WHY modules matter:
Without modules: With modules:
───────────────── ─────────────
1 giant main.tf networking/ → VPC, subnets, NAT
2000 lines of code security/ → IAM, KMS, SA
Hard to test gke/ → cluster, node pools
Hard to reuse
Copy-paste between envs App team just does:
Security gets missed module "gke" {
source = "./modules/gke"
env = "production"
}
— all security built in

Explain Networking — Why Private Cluster

What you say:

“The cluster is private — nodes have no public IP addresses. All outbound traffic goes through Cloud NAT, and the Kubernetes API is only accessible from authorized networks like our VPN and bastion host. This massively reduces the attack surface.”

Public cluster (avoid):
Internet → anyone can reach K8s API
→ nodes have public IPs
→ brute force, CVE exploitation risk
Private cluster (what we built):
Internet
└──▶ Authorized Networks only (office VPN, bastion)
K8s API (172.16.0.0/28 — internal only)
Worker Nodes (no public IPs)
Cloud NAT → outbound to internet
(pull images, call APIs)
# What makes it private — explain these two lines
private_cluster_config {
enable_private_nodes = true # nodes get no public IP
enable_private_endpoint = false # API accessible via authorized nets
master_ipv4_cidr_block = "172.16.0.0/28"
}
master_authorized_networks_config {
cidr_blocks {
cidr_block = "10.0.0.0/8" # internal traffic
display_name = "internal"
}
cidr_blocks {
cidr_block = "203.0.113.0/24" # office VPN
display_name = "office-vpn"
}
}

Explain VPC-Native Networking

What you say:

“We use VPC-native networking with secondary IP ranges — one range for pods and one for services. This means pods get real VPC IP addresses, so there’s no NAT between pods and GCP services like Cloud SQL or Pub/Sub. It also enables better network policies and visibility.”

VPC-Native (alias IPs):
Subnet: 10.0.0.0/20 ← nodes live here
Pods: 10.4.0.0/14 ← pods get IPs from here
Services: 10.0.16.0/20 ← ClusterIP services
Why it matters:
├── Pods are first-class VPC citizens
├── Firewall rules apply directly to pods
├── No double-NAT overhead
├── Cloud SQL can whitelist pod IPs directly
└── VPC flow logs show pod-level traffic

Explain Security — Workload Identity

What you say:

“The biggest security win was Workload Identity. Before, teams would create service account key files, store them in Kubernetes secrets, and rotate them manually — which is risky and error-prone. With Workload Identity, pods automatically get a GCP identity without any key files. The binding is cryptographic and managed by Google.”

WITHOUT Workload Identity (bad):
─────────────────────────────────
1. Create GCP service account key file (JSON)
2. Store key in K8s secret
3. Mount secret into pod
4. App reads key file from disk
5. Remember to rotate every 90 days
6. Key can be stolen from secret store
Risk: leaked key = compromised GCP account
WITH Workload Identity (what we built):
────────────────────────────────────────
1. K8s Service Account (KSA) created
2. KSA annotated with GCP SA email
3. Pod uses KSA
4. GCP metadata server issues short-lived token
5. No key files anywhere
6. Token auto-rotates every hour
Risk: nothing to steal
# How it works in Terraform
# GKE cluster has Workload Identity enabled
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Node pool uses GKE_METADATA mode
workload_metadata_config {
mode = "GKE_METADATA" # intercepts metadata requests
}
# GCP SA allows K8s SA to impersonate it
resource "google_service_account_iam_member" "wi_binding" {
service_account_id = google_service_account.app.name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:myproject.svc.id.goog[production/app-ksa]"
# ↑ This K8s SA in this namespace gets this GCP identity
}

Explain Security — KMS Encryption

What you say:

“All Kubernetes secrets stored in etcd are encrypted using a customer-managed key in Cloud KMS. This means even if someone gained access to the etcd data directly, they couldn’t read the secrets without also having access to the KMS key — and we control that separately with different IAM permissions.”

Without KMS:
etcd stores K8s secrets → base64 encoded only
Anyone with etcd access reads secrets in plain text
With KMS (what we built):
K8s Secret created
K8s API Server encrypts it with KMS key
Encrypted blob stored in etcd
Even raw etcd access shows encrypted data
KMS key is:
├── Separate from cluster IAM
├── Rotated every 90 days automatically
├── Audited — every decrypt operation logged
└── prevent_destroy = true — can't accidentally delete

Explain Node Pools Strategy

What you say:

“We have three node pools serving different purposes. The system pool runs cluster components like monitoring and ingress controllers — it’s tainted so application pods don’t land there. The application pool runs business workloads on stable on-demand machines. The spot pool handles batch jobs and can scale to zero — this alone saved about 40% on compute costs.”

Node Pool Strategy:
┌──────────────────────────────────────────────────────┐
│ SYSTEM POOL │
│ n2-standard-2, on-demand, taint: CriticalAddonsOnly │
│ Runs: Prometheus, Ingress, Cert-manager, etc. │
│ → Isolated from app workloads │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ APPLICATION POOL │
│ n2-standard-8, on-demand, min=3 max=50 │
│ Runs: Production APIs, databases, services │
│ → Stable, always available, no eviction │
└──────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ SPOT POOL │
│ n2-standard-4, spot VMs, min=0 max=20 │
│ Runs: batch jobs, ML training, data processing │
│ → 60-90% cheaper, can be preempted │
│ → Tainted — only tolerating pods land here │
└──────────────────────────────────────────────────────┘
Cost impact:
Before: all on-demand → $8,000/month
After: mixed pools → $4,800/month (-40%)
# Interviewer might ask: how does the taint work?
# Node taint — repels pods
taint {
key = "cloud.google.com/gke-spot"
value = "true"
effect = "NO_SCHEDULE" # pods won't land here unless...
}
# Pod toleration — allows landing on spot nodes
tolerations:
- key: "cloud.google.com/gke-spot"
operator: Equal
value: "true"
effect: NoSchedule # ...they explicitly tolerate it

Explain Autoscaling

What you say:

“We have two levels of autoscaling. Horizontal Pod Autoscaler scales pods when CPU or memory is high. Cluster Autoscaler scales nodes when pods can’t be scheduled because there’s no capacity. Together they handle traffic spikes automatically and scale down during quiet periods to save cost.”

Two-level autoscaling:
Traffic spike hits:
HPA detects high CPU on pods
HPA adds more pods
No nodes available for new pods
Cluster Autoscaler detects pending pods
Cluster Autoscaler adds new node
Pods schedule on new node
Traffic returns to normal
HPA removes excess pods
Cluster Autoscaler removes empty node (after 10 min)
Result: zero manual intervention
cost scales with actual usage

Explain the CI/CD Pipeline

What you say:

“The deployment pipeline has four stages — validate, security scan, plan, apply. Every PR triggers a plan so the team can see exactly what will change before merging. Security scanning with Checkov blocks the pipeline if it finds critical misconfigurations. Production requires a manual approval gate — two senior engineers must approve before the apply runs.”

PR opened:
┌─────────────────────────────────────────────────────┐
│ 1. VALIDATE (30 seconds) │
│ terraform fmt -check │
│ terraform validate │
│ Fails immediately on syntax errors │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 2. SECURITY SCAN (2 minutes) │
│ Checkov — 2000+ IaC checks │
│ tfsec — Terraform security scanner │
│ Fails on CRITICAL/HIGH findings │
│ Blocks deploy if public storage, open ports, etc.│
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ 3. PLAN (3 minutes) │
│ terraform plan for each environment │
│ Posts plan output as PR comment │
│ Team reviews: what will change? │
└──────────────────────┬──────────────────────────────┘
↓ (PR merged to main)
┌─────────────────────────────────────────────────────┐
│ 4. APPLY │
│ Dev → auto-apply │
│ Staging → 1 approval required │
│ Production → 2 approvals + manual gate │
└─────────────────────────────────────────────────────┘

Explain OIDC Authentication

What you say:

“The pipeline uses OIDC — no GCP credentials stored in GitHub secrets. GitHub proves its identity to GCP using a short-lived cryptographic token. GCP validates the token against our Workload Identity Federation configuration, checks it’s coming from our specific repo and branch, then issues a short-lived access token for that job only.”

Traditional (risky):
Store GCP service account key in GitHub secret
→ Long-lived key, could be leaked
→ Must rotate manually
→ If leaked, attacker has permanent access
OIDC (what we use):
GitHub job starts
GitHub OIDC provider issues signed JWT
JWT contains: repo, branch, workflow, actor
Terraform job sends JWT to GCP
GCP validates: trusted issuer? correct repo?
GCP issues 1-hour access token
Terraform uses token to create/update resources
Token expires when job ends
Zero credentials stored anywhere
Zero rotation needed
Zero attack surface if GitHub account compromised

Explain State Management

What you say:

“Terraform state is stored in a GCS bucket with versioning enabled. This means the state file is shared across the team — anyone can run Terraform and they’re working with the same view of the world. Versioning acts as a backup — if something goes wrong we can restore a previous state version.”

Why remote state matters:
Local state (bad):
Alice runs terraform apply → state on Alice's laptop
Bob runs terraform apply → state on Bob's laptop
→ Two different views of reality
→ Duplicate resources, conflicts
Remote state in GCS (what we built):
Alice runs terraform apply → reads/writes GCS
Bob runs terraform apply → reads same GCS
→ Single source of truth
→ State locking prevents simultaneous applies
→ Versioned — roll back if corrupted
State bucket config:
versioning: on ← backup every state change
encryption: CMEK ← encrypted with our KMS key
uniform access: on ← no ACLs, just IAM
public access: blocked ← never public

Common Interview Questions on This Topic


Q: Why Terraform over gcloud CLI scripts?

“Scripts are imperative — they tell you HOW to create something. Terraform is declarative — you tell it WHAT you want. If a script fails halfway, you have partial infrastructure and unclear state. Terraform tracks state, knows what exists, and only changes what’s different. Scripts don’t handle drift — Terraform can detect and fix it. Also Terraform plans show you exactly what will change before it happens — a gcloud script gives you no preview.”


Q: How do you handle Terraform state locking?

“GCS backend supports state locking natively via Cloud Storage object locking. When Terraform runs, it writes a lock file. If two applies run simultaneously, the second one detects the lock and fails with a clear error rather than corrupting state. We also set a timeout so locks don’t get stuck if a pipeline dies mid-run.”


Q: What happens if a node pool update requires node replacement?

“GKE handles this through surge upgrades — configured as max_surge=1, max_unavailable=0. It adds one new node with the new configuration, waits for it to be ready, drains one old node, then repeats. With PodDisruptionBudgets set on workloads, no service goes below its minimum replicas during the process. Zero downtime.”

upgrade_settings {
strategy = "SURGE"
max_surge = 1 # add 1 extra node during upgrade
max_unavailable = 0 # never reduce below desired count
}

Q: How do you manage secrets in your GKE workloads?

“Three layers. First, GCP Secret Manager stores the actual secrets — never in code or Terraform variables. Second, External Secrets Operator syncs them from Secret Manager into Kubernetes secrets automatically, with a 1-hour refresh. Third, pods reference Kubernetes secrets as environment variables or volume mounts — never as plain text in YAML. The whole chain is encrypted — KMS at rest in etcd, TLS in transit.”


Q: How do you handle cluster upgrades?

“We’re on the REGULAR release channel — Google automatically upgrades the control plane. For node pools, auto_upgrade=true handles it during our maintenance window, Saturday to Sunday 2-6 AM. We have a maintenance exclusion for high-traffic periods like Black Friday. The surge upgrade strategy ensures zero downtime. We test new versions in dev first since dev is on RAPID channel — so dev gets updates weeks before production.”


Q: What would you do differently if starting again?

“Two things. First, I’d use Terraform workspaces or Terragrunt from day one to reduce the environment config duplication — our three tfvars files have a lot of overlap. Second, I’d implement drift detection earlier — a daily GitHub Actions cron job running terraform plan and alerting if it detects changes. We added it later but it should have been day one because manual changes to the cluster are the biggest source of incidents.”


One-Line Summaries for Each Component

If interviewer asks "explain X in one sentence":
VPC: "Private network that isolates our cluster
from the public internet"
Private cluster: "Nodes have no public IPs — attackers can't
reach them directly even if they find them"
Workload Identity:"Pods prove their identity cryptographically
— no key files that can be stolen"
KMS encryption: "Even if someone steals etcd, they can't read
our secrets without the encryption key"
Node pools: "Different machine types for different jobs —
spot VMs for batch saves 40% on cost"
Cluster Autoscaler:"Automatically adds nodes when pods can't
schedule — removes them when idle"
OIDC auth: "GitHub proves who it is to GCP without
storing any credentials"
Remote state: "Single source of truth for what Terraform
thinks exists in the cloud"
Modules: "Reusable, tested building blocks so every
team gets secure infra without knowing the details"
Release channel: "Google auto-upgrades our cluster on a
schedule we control"

The Killer Answer Structure

When asked “Tell me about your GKE Terraform setup”, use this structure:

30-second version:
"I built a modular Terraform platform that provisions
production-grade GKE clusters — private networking,
Workload Identity, KMS encryption — all by default.
Provisioning time went from 4 hours to 12 minutes,
and config drift dropped from hundreds of incidents
per month to near zero."
2-minute version:
Add: module structure, three node pools, CI/CD pipeline,
OIDC auth, one specific technical challenge you solved
5-minute version:
Add: specific decisions you made and why, alternatives
you considered, what you'd do differently, metrics

The key is always anchor to outcomes — faster provisioning, better security posture, reduced drift, lower cost. Interviewers remember stories and numbers, not YAML.

Terraform Configuration for GKE: Best Practices

Deploy GKE with Terraform

Project Structure

gke-terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── versions.tf
├── terraform.tfvars
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── gke/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── security/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── environments/
├── dev.tfvars
├── staging.tfvars
└── production.tfvars

Step 1 — Versions and Provider Config

# versions.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.0"
}
}
# Remote state backend
backend "gcs" {
bucket = "mycompany-terraform-state"
prefix = "gke/production"
}
}
provider "google" {
project = var.project_id
region = var.region
}
provider "google-beta" {
project = var.project_id
region = var.region
}
# K8s provider — uses GKE cluster output
provider "kubernetes" {
host = "https://${module.gke.endpoint}"
token = data.google_client_config.default.access_token
cluster_ca_certificate = base64decode(
module.gke.ca_certificate
)
}
data "google_client_config" "default" {}

Step 2 — Variables

# variables.tf
variable "project_id" {
description = "GCP Project ID"
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "us-central1"
}
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Must be dev, staging, or production"
}
}
variable "cluster_name" {
description = "GKE cluster name"
type = string
}
# Networking
variable "vpc_cidr" {
description = "VPC CIDR range"
type = string
default = "10.0.0.0/16"
}
variable "subnet_cidr" {
description = "Subnet CIDR"
type = string
default = "10.0.0.0/20"
}
variable "pods_cidr" {
description = "Pod IP range"
type = string
default = "10.4.0.0/14"
}
variable "services_cidr" {
description = "Services IP range"
type = string
default = "10.0.16.0/20"
}
variable "master_cidr" {
description = "Control plane CIDR"
type = string
default = "172.16.0.0/28"
}
# Node pools
variable "node_pools" {
description = "Node pool configurations"
type = map(object({
machine_type = string
min_nodes = number
max_nodes = number
disk_size_gb = number
disk_type = string
spot = bool
taints = list(object({
key = string
value = string
effect = string
}))
labels = map(string)
}))
}
# Security
variable "authorized_networks" {
description = "Networks authorized to access K8s API"
type = list(object({
cidr_block = string
display_name = string
}))
default = []
}
variable "enable_private_endpoint" {
description = "Disable public K8s API endpoint"
type = bool
default = false
}
# Labels
variable "labels" {
description = "Labels applied to all resources"
type = map(string)
default = {}
}
# terraform.tfvars
project_id = "mycompany-prod"
region = "us-central1"
environment = "production"
cluster_name = "prod-cluster"
vpc_cidr = "10.0.0.0/16"
subnet_cidr = "10.0.0.0/20"
pods_cidr = "10.4.0.0/14"
services_cidr = "10.0.16.0/20"
master_cidr = "172.16.0.0/28"
node_pools = {
# System components pool
system = {
machine_type = "n2-standard-2"
min_nodes = 1
max_nodes = 3
disk_size_gb = 50
disk_type = "pd-ssd"
spot = false
taints = [{
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}]
labels = { pool = "system" }
}
# Application workloads
application = {
machine_type = "n2-standard-4"
min_nodes = 2
max_nodes = 20
disk_size_gb = 100
disk_type = "pd-ssd"
spot = false
taints = []
labels = { pool = "application" }
}
# Spot pool for batch workloads
spot = {
machine_type = "n2-standard-4"
min_nodes = 0
max_nodes = 10
disk_size_gb = 100
disk_type = "pd-ssd"
spot = true
taints = [{
key = "cloud.google.com/gke-spot"
value = "true"
effect = "NO_SCHEDULE"
}]
labels = { pool = "spot" }
}
}
authorized_networks = [
{
cidr_block = "10.0.0.0/8"
display_name = "internal"
},
{
cidr_block = "203.0.113.0/24"
display_name = "office-vpn"
}
]
labels = {
environment = "production"
team = "platform"
managed_by = "terraform"
cost_center = "engineering"
}

Step 3 — Networking Module

# modules/networking/main.tf
# ── VPC ──────────────────────────────────────────────────────
resource "google_compute_network" "vpc" {
name = "${var.cluster_name}-vpc"
project = var.project_id
auto_create_subnetworks = false
routing_mode = "GLOBAL"
description = "VPC for GKE cluster ${var.cluster_name}"
}
# ── Subnet ───────────────────────────────────────────────────
resource "google_compute_subnetwork" "subnet" {
name = "${var.cluster_name}-subnet"
project = var.project_id
region = var.region
network = google_compute_network.vpc.id
ip_cidr_range = var.subnet_cidr
private_ip_google_access = true # reach GCP APIs privately
# Secondary ranges for pods and services
secondary_ip_range {
range_name = "pods"
ip_cidr_range = var.pods_cidr
}
secondary_ip_range {
range_name = "services"
ip_cidr_range = var.services_cidr
}
# VPC Flow Logs
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
}
# ── Cloud Router ─────────────────────────────────────────────
resource "google_compute_router" "router" {
name = "${var.cluster_name}-router"
project = var.project_id
region = var.region
network = google_compute_network.vpc.id
bgp {
asn = 64514
}
}
# ── Cloud NAT (outbound internet for private nodes) ──────────
resource "google_compute_router_nat" "nat" {
name = "${var.cluster_name}-nat"
project = var.project_id
router = google_compute_router.router.name
region = var.region
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
log_config {
enable = true
filter = "ERRORS_ONLY"
}
}
# ── Firewall Rules ────────────────────────────────────────────
# Allow internal traffic within VPC
resource "google_compute_firewall" "allow_internal" {
name = "${var.cluster_name}-allow-internal"
project = var.project_id
network = google_compute_network.vpc.id
allow {
protocol = "tcp"
ports = ["0-65535"]
}
allow {
protocol = "udp"
ports = ["0-65535"]
}
allow {
protocol = "icmp"
}
source_ranges = [
var.subnet_cidr,
var.pods_cidr,
var.services_cidr
]
description = "Allow internal VPC traffic"
}
# Allow GCP health checks (required for load balancers)
resource "google_compute_firewall" "allow_health_checks" {
name = "${var.cluster_name}-allow-health-checks"
project = var.project_id
network = google_compute_network.vpc.id
allow {
protocol = "tcp"
ports = ["10256", "8080", "8443"]
}
source_ranges = [
"35.191.0.0/16", # GCP LB health check range
"130.211.0.0/22" # GCP LB health check range
]
target_tags = ["gke-${var.cluster_name}"]
description = "Allow GCP load balancer health checks"
}
# Deny all ingress by default
resource "google_compute_firewall" "deny_all_ingress" {
name = "${var.cluster_name}-deny-all-ingress"
project = var.project_id
network = google_compute_network.vpc.id
priority = 65534
deny {
protocol = "all"
}
source_ranges = ["0.0.0.0/0"]
description = "Default deny all ingress"
}
# modules/networking/outputs.tf
output "network_id" {
value = google_compute_network.vpc.id
}
output "network_name" {
value = google_compute_network.vpc.name
}
output "subnet_id" {
value = google_compute_subnetwork.subnet.id
}
output "subnet_name" {
value = google_compute_subnetwork.subnet.name
}
output "pods_range_name" {
value = "pods"
}
output "services_range_name" {
value = "services"
}

Step 4 — Security Module

# modules/security/main.tf
# ── Service Account for GKE Nodes ────────────────────────────
resource "google_service_account" "gke_nodes" {
account_id = "${var.cluster_name}-nodes"
display_name = "GKE Node Service Account"
project = var.project_id
description = "Minimal SA for GKE nodes"
}
# Minimal permissions for nodes
resource "google_project_iam_member" "node_permissions" {
for_each = toset([
"roles/logging.logWriter", # write logs
"roles/monitoring.metricWriter", # write metrics
"roles/monitoring.viewer", # read monitoring
"roles/stackdriver.resourceMetadata.writer",
"roles/artifactregistry.reader", # pull images
])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.gke_nodes.email}"
}
# ── KMS Key for etcd encryption ───────────────────────────────
resource "google_kms_key_ring" "gke" {
name = "${var.cluster_name}-keyring"
project = var.project_id
location = var.region
}
resource "google_kms_crypto_key" "etcd" {
name = "${var.cluster_name}-etcd-key"
key_ring = google_kms_key_ring.gke.id
rotation_period = "7776000s" # 90 days
lifecycle {
prevent_destroy = true
}
}
# Allow GKE to use KMS key
resource "google_kms_crypto_key_iam_member" "gke_kms" {
crypto_key_id = google_kms_crypto_key.etcd.id
role = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
member = "serviceAccount:service-${data.google_project.project.number}@container-engine-robot.iam.gserviceaccount.com"
}
data "google_project" "project" {
project_id = var.project_id
}
# ── Workload Identity Service Accounts ────────────────────────
# Example: Workload Identity for app teams
resource "google_service_account" "workload_identity_sa" {
for_each = var.workload_identity_bindings
account_id = each.key
display_name = each.value.display_name
project = var.project_id
}
resource "google_project_iam_member" "workload_sa_permissions" {
for_each = {
for binding in flatten([
for sa_name, sa_config in var.workload_identity_bindings : [
for role in sa_config.roles : {
key = "${sa_name}-${role}"
sa_name = sa_name
role = role
}
]
]) : binding.key => binding
}
project = var.project_id
role = each.value.role
member = "serviceAccount:${google_service_account.workload_identity_sa[each.value.sa_name].email}"
}
resource "google_service_account_iam_member" "workload_identity_binding" {
for_each = var.workload_identity_bindings
service_account_id = google_service_account.workload_identity_sa[each.key].name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.project_id}.svc.id.goog[${each.value.namespace}/${each.value.ksa_name}]"
}
# modules/security/outputs.tf
output "node_sa_email" {
value = google_service_account.gke_nodes.email
}
output "kms_key_id" {
value = google_kms_crypto_key.etcd.id
}
output "workload_sa_emails" {
value = {
for name, sa in google_service_account.workload_identity_sa :
name => sa.email
}
}

Step 5 — GKE Module

# modules/gke/main.tf
locals {
cluster_name = var.cluster_name
labels = merge(var.labels, {
managed_by = "terraform"
environment = var.environment
})
}
# ── GKE Cluster ──────────────────────────────────────────────
resource "google_container_cluster" "primary" {
provider = google-beta
name = local.cluster_name
project = var.project_id
location = var.region # regional for HA
# Remove default node pool
remove_default_node_pool = true
initial_node_count = 1
# Networking
network = var.network_id
subnetwork = var.subnet_id
# VPC-native — required for private clusters
ip_allocation_policy {
cluster_secondary_range_name = var.pods_range_name
services_secondary_range_name = var.services_range_name
}
# ── Private Cluster ─────────────────────────────────────
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = var.enable_private_endpoint
master_ipv4_cidr_block = var.master_cidr
master_global_access_config {
enabled = true # access from any region
}
}
# ── Authorized Networks ──────────────────────────────────
master_authorized_networks_config {
dynamic "cidr_blocks" {
for_each = var.authorized_networks
content {
cidr_block = cidr_blocks.value.cidr_block
display_name = cidr_blocks.value.display_name
}
}
gcp_public_cidrs_access_enabled = false
}
# ── Workload Identity ────────────────────────────────────
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# ── Security ─────────────────────────────────────────────
binary_authorization {
evaluation_mode = var.environment == "production" ? "PROJECT_SINGLETON_POLICY_ENFORCE" : "DISABLED"
}
# etcd encryption using KMS
database_encryption {
state = "ENCRYPTED"
key_name = var.kms_key_id
}
# ── Addons ───────────────────────────────────────────────
addons_config {
# HTTP load balancing (required for Ingress)
http_load_balancing {
disabled = false
}
# Horizontal Pod Autoscaling
horizontal_pod_autoscaling {
disabled = false
}
# CSI driver for persistent disks
gce_persistent_disk_csi_driver_config {
enabled = true
}
# GCS FUSE driver
gcs_fuse_csi_driver_config {
enabled = true
}
# DNS config
dns_cache_config {
enabled = true
}
}
# ── Network Policy ───────────────────────────────────────
network_policy {
enabled = true
provider = "CALICO"
}
datapath_provider = "ADVANCED_DATAPATH" # eBPF via Cilium
# ── Logging and Monitoring ───────────────────────────────
logging_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS",
"APISERVER",
"SCHEDULER",
"CONTROLLER_MANAGER"
]
}
monitoring_config {
enable_components = [
"SYSTEM_COMPONENTS",
"WORKLOADS",
"APISERVER",
"SCHEDULER",
"CONTROLLER_MANAGER",
"STORAGE",
"HPA",
"POD",
"DAEMONSET",
"DEPLOYMENT",
"STATEFULSET"
]
managed_prometheus {
enabled = true
}
advanced_datapath_observability_config {
enable_metrics = true
enable_relay = true
}
}
# ── Release Channel ──────────────────────────────────────
release_channel {
channel = var.release_channel
}
# ── Maintenance Window ───────────────────────────────────
maintenance_policy {
recurring_window {
start_time = "2024-01-01T02:00:00Z"
end_time = "2024-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
maintenance_exclusion {
exclusion_name = "black-friday"
start_time = "2024-11-25T00:00:00Z"
end_time = "2024-12-02T00:00:00Z"
exclusion_options {
scope = "NO_UPGRADES"
}
}
}
# ── Cluster Autoscaling ──────────────────────────────────
cluster_autoscaling {
enabled = true
autoscaling_profile = "OPTIMIZE_UTILIZATION"
resource_limits {
resource_type = "cpu"
minimum = 4
maximum = 200
}
resource_limits {
resource_type = "memory"
minimum = 16
maximum = 800
}
auto_provisioning_defaults {
service_account = var.node_sa_email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
management {
auto_repair = true
auto_upgrade = true
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
disk_size = 100
disk_type = "pd-ssd"
}
}
# ── Node Pool Defaults ───────────────────────────────────
node_pool_defaults {
node_config_defaults {
logging_variant = "MAX_THROUGHPUT"
}
}
# ── Security posture ─────────────────────────────────────
security_posture_config {
mode = "BASIC"
vulnerability_mode = "VULNERABILITY_BASIC"
}
resource_labels = local.labels
lifecycle {
ignore_changes = [
initial_node_count,
node_pool,
]
prevent_destroy = var.environment == "production" ? true : false
}
depends_on = [
var.node_sa_email
]
}
# ── Node Pools ────────────────────────────────────────────────
resource "google_container_node_pool" "pools" {
for_each = var.node_pools
name = each.key
project = var.project_id
cluster = google_container_cluster.primary.name
location = var.region
# Autoscaling
autoscaling {
min_node_count = each.value.min_nodes
max_node_count = each.value.max_nodes
location_policy = "BALANCED" # spread across zones
}
# Auto-repair and upgrade
management {
auto_repair = true
auto_upgrade = true
}
# Surge upgrade — zero downtime
upgrade_settings {
strategy = "SURGE"
max_surge = 1
max_unavailable = 0
}
node_config {
machine_type = each.value.machine_type
disk_size_gb = each.value.disk_size_gb
disk_type = each.value.disk_type
spot = each.value.spot
image_type = "COS_CONTAINERD"
service_account = var.node_sa_email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
# Workload Identity
workload_metadata_config {
mode = "GKE_METADATA"
}
# Shielded nodes
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
# Taints
dynamic "taint" {
for_each = each.value.taints
content {
key = taint.value.key
value = taint.value.value
effect = taint.value.effect
}
}
# Labels
labels = merge(local.labels, each.value.labels)
# Network tags
tags = [
"gke-${local.cluster_name}",
"gke-${local.cluster_name}-${each.key}"
]
# Metadata
metadata = {
disable-legacy-endpoints = "true"
}
# Resource labels for cost tracking
resource_labels = merge(local.labels, {
pool = each.key
spot = tostring(each.value.spot)
})
}
lifecycle {
ignore_changes = [
node_config[0].resource_labels,
]
}
}
# modules/gke/outputs.tf
output "cluster_name" {
description = "GKE cluster name"
value = google_container_cluster.primary.name
}
output "endpoint" {
description = "Cluster endpoint"
value = google_container_cluster.primary.endpoint
sensitive = true
}
output "ca_certificate" {
description = "Cluster CA certificate"
value = google_container_cluster.primary.master_auth[0].cluster_ca_certificate
sensitive = true
}
output "cluster_id" {
description = "Cluster resource ID"
value = google_container_cluster.primary.id
}
output "workload_identity_pool" {
description = "Workload Identity pool"
value = "${var.project_id}.svc.id.goog"
}
output "node_pool_names" {
description = "Node pool names"
value = [for np in google_container_node_pool.pools : np.name]
}
output "get_credentials_command" {
description = "Command to get cluster credentials"
value = "gcloud container clusters get-credentials ${google_container_cluster.primary.name} --region ${var.region} --project ${var.project_id}"
}

Step 6 — Root Module

# main.tf
# ── Networking ────────────────────────────────────────────────
module "networking" {
source = "./modules/networking"
project_id = var.project_id
region = var.region
cluster_name = var.cluster_name
subnet_cidr = var.subnet_cidr
pods_cidr = var.pods_cidr
services_cidr = var.services_cidr
labels = var.labels
}
# ── Security ──────────────────────────────────────────────────
module "security" {
source = "./modules/security"
project_id = var.project_id
region = var.region
cluster_name = var.cluster_name
environment = var.environment
workload_identity_bindings = {
"app-sa" = {
display_name = "Application Service Account"
namespace = "production"
ksa_name = "app-ksa"
roles = [
"roles/storage.objectViewer",
"roles/secretmanager.secretAccessor"
]
}
}
}
# ── GKE Cluster ───────────────────────────────────────────────
module "gke" {
source = "./modules/gke"
project_id = var.project_id
region = var.region
cluster_name = var.cluster_name
environment = var.environment
# Networking
network_id = module.networking.network_id
subnet_id = module.networking.subnet_id
pods_range_name = module.networking.pods_range_name
services_range_name = module.networking.services_range_name
master_cidr = var.master_cidr
# Security
kms_key_id = module.security.kms_key_id
node_sa_email = module.security.node_sa_email
authorized_networks = var.authorized_networks
enable_private_endpoint = var.enable_private_endpoint
# Node pools
node_pools = var.node_pools
release_channel = "REGULAR"
labels = var.labels
depends_on = [
module.networking,
module.security
]
}
# outputs.tf
output "cluster_name" {
description = "GKE cluster name"
value = module.gke.cluster_name
}
output "cluster_endpoint" {
description = "Cluster API endpoint"
value = module.gke.endpoint
sensitive = true
}
output "get_credentials" {
description = "Command to configure kubectl"
value = module.gke.get_credentials_command
}
output "workload_identity_pool" {
description = "Workload Identity pool"
value = module.gke.workload_identity_pool
}
output "node_sa_email" {
description = "Node service account email"
value = module.security.node_sa_email
}

Step 7 — Environment Configs

# environments/dev.tfvars
project_id = "mycompany-dev"
environment = "dev"
cluster_name = "dev-cluster"
region = "us-central1"
subnet_cidr = "10.10.0.0/20"
pods_cidr = "10.10.16.0/20"
services_cidr = "10.10.32.0/20"
master_cidr = "172.16.0.32/28"
node_pools = {
application = {
machine_type = "n2-standard-2" # smaller for dev
min_nodes = 0 # scale to zero
max_nodes = 5
disk_size_gb = 50
disk_type = "pd-standard" # cheaper disk
spot = true # spot VMs in dev
taints = []
labels = { pool = "application", env = "dev" }
}
}
authorized_networks = [
{
cidr_block = "0.0.0.0/0"
display_name = "all-for-dev"
}
]
enable_private_endpoint = false
labels = {
environment = "dev"
team = "platform"
managed_by = "terraform"
}
# environments/production.tfvars
project_id = "mycompany-prod"
environment = "production"
cluster_name = "prod-cluster"
region = "us-central1"
subnet_cidr = "10.0.0.0/20"
pods_cidr = "10.4.0.0/14"
services_cidr = "10.0.16.0/20"
master_cidr = "172.16.0.0/28"
node_pools = {
system = {
machine_type = "n2-standard-2"
min_nodes = 1
max_nodes = 3
disk_size_gb = 50
disk_type = "pd-ssd"
spot = false
taints = [{
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}]
labels = { pool = "system" }
}
application = {
machine_type = "n2-standard-8"
min_nodes = 3
max_nodes = 50
disk_size_gb = 100
disk_type = "pd-ssd"
spot = false
taints = []
labels = { pool = "application" }
}
spot = {
machine_type = "n2-standard-4"
min_nodes = 0
max_nodes = 20
disk_size_gb = 100
disk_type = "pd-ssd"
spot = true
taints = [{
key = "cloud.google.com/gke-spot"
value = "true"
effect = "NO_SCHEDULE"
}]
labels = { pool = "spot" }
}
}
authorized_networks = [
{
cidr_block = "10.0.0.0/8"
display_name = "internal"
},
{
cidr_block = "203.0.113.0/24"
display_name = "office-vpn"
}
]
enable_private_endpoint = false
labels = {
environment = "production"
team = "platform"
managed_by = "terraform"
cost_center = "engineering"
criticality = "high"
}

Step 8 — GitHub Actions CI/CD

# .github/workflows/terraform-gke.yml
name: Deploy GKE
on:
push:
branches: [main]
paths:
- 'gke-terraform/**'
pull_request:
branches: [main]
paths:
- 'gke-terraform/**'
workflow_dispatch:
inputs:
environment:
description: Target environment
required: true
type: choice
options: [dev, staging, production]
action:
description: Terraform action
required: true
type: choice
options: [plan, apply, destroy]
env:
TF_VERSION: "1.6.0"
WORKING_DIR: gke-terraform
jobs:
# ── Validate ────────────────────────────────────────────────
validate:
name: Validate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Format check
run: terraform fmt -check -recursive
working-directory: ${{ env.WORKING_DIR }}
- name: Validate
run: |
terraform init -backend=false
terraform validate
working-directory: ${{ env.WORKING_DIR }}
# ── Security scan ───────────────────────────────────────────
security:
name: Security Scan
runs-on: ubuntu-latest
needs: validate
steps:
- uses: actions/checkout@v4
- name: Checkov scan
uses: bridgecrewio/checkov-action@master
with:
directory: ${{ env.WORKING_DIR }}
framework: terraform
soft_fail: false
- name: tfsec scan
uses: aquasecurity/tfsec-action@v1.0.0
with:
working_directory: ${{ env.WORKING_DIR }}
# ── Plan ────────────────────────────────────────────────────
plan:
name: Plan ${{ matrix.environment }}
runs-on: ubuntu-latest
needs: [validate, security]
strategy:
matrix:
environment: [dev, staging, production]
permissions:
contents: read
id-token: write
pull-requests: write
steps:
- uses: actions/checkout@v4
- id: auth
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: terraform-sa@${{ secrets[format('{0}_PROJECT_ID', upper(matrix.environment))] }}.iam.gserviceaccount.com
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
run: |
terraform init \
-backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}" \
-backend-config="prefix=gke/${{ matrix.environment }}"
working-directory: ${{ env.WORKING_DIR }}
- name: Terraform Plan
id: plan
run: |
terraform plan \
-var-file="environments/${{ matrix.environment }}.tfvars" \
-out=tfplan-${{ matrix.environment }} \
-no-color \
2>&1 | tee plan-output.txt
echo "exitcode=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
working-directory: ${{ env.WORKING_DIR }}
- name: Upload plan
uses: actions/upload-artifact@v4
with:
name: tfplan-${{ matrix.environment }}
path: ${{ env.WORKING_DIR }}/tfplan-${{ matrix.environment }}
retention-days: 1
- name: Post plan to PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync(
'${{ env.WORKING_DIR }}/plan-output.txt',
'utf8'
);
const maxLen = 65000;
const truncated = plan.length > maxLen
? plan.substring(0, maxLen) + '\n... truncated'
: plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan — \`${{ matrix.environment }}\`
\`\`\`
${truncated}
\`\`\``
});
# ── Apply Dev ───────────────────────────────────────────────
apply-dev:
name: Apply Dev
runs-on: ubuntu-latest
needs: plan
if: github.ref == 'refs/heads/main'
environment: dev
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- id: auth
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.DEV_TF_SA }}
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Download plan
uses: actions/download-artifact@v4
with:
name: tfplan-dev
path: ${{ env.WORKING_DIR }}
- name: Terraform Init
run: |
terraform init \
-backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}" \
-backend-config="prefix=gke/dev"
working-directory: ${{ env.WORKING_DIR }}
- name: Terraform Apply
run: terraform apply -auto-approve tfplan-dev
working-directory: ${{ env.WORKING_DIR }}
- name: Configure kubectl
run: |
${{ steps.apply.outputs.get_credentials }}
- name: Verify cluster
run: |
kubectl get nodes
kubectl get pods -A
# ── Apply Production (requires approval) ────────────────────
apply-production:
name: Apply Production
runs-on: ubuntu-latest
needs: apply-dev
environment: production # approval gate
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- id: auth
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.PROD_TF_SA }}
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Download plan
uses: actions/download-artifact@v4
with:
name: tfplan-production
path: ${{ env.WORKING_DIR }}
- name: Terraform Init
run: |
terraform init \
-backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}" \
-backend-config="prefix=gke/production"
working-directory: ${{ env.WORKING_DIR }}
- name: Terraform Apply
run: terraform apply -auto-approve tfplan-production
working-directory: ${{ env.WORKING_DIR }}
- name: Notify deployment
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": "✅ GKE Production cluster deployed successfully\nCluster: prod-cluster\nBy: ${{ github.actor }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Step 9 — Deploy and Verify

# ── Initial setup ────────────────────────────────────────────
# Create state bucket
gsutil mb -p mycompany-prod \
-l us-central1 \
gs://mycompany-terraform-state
# Enable versioning on state bucket
gsutil versioning set on \
gs://mycompany-terraform-state
# Enable required GCP APIs
gcloud services enable \
container.googleapis.com \
compute.googleapis.com \
iam.googleapis.com \
cloudkms.googleapis.com \
artifactregistry.googleapis.com \
monitoring.googleapis.com \
logging.googleapis.com \
--project mycompany-prod
# ── Deploy dev ───────────────────────────────────────────────
cd gke-terraform
terraform init \
-backend-config="bucket=mycompany-terraform-state" \
-backend-config="prefix=gke/dev"
terraform plan \
-var-file="environments/dev.tfvars" \
-out=tfplan-dev
terraform apply tfplan-dev
# ── Get cluster credentials ──────────────────────────────────
gcloud container clusters get-credentials dev-cluster \
--region us-central1 \
--project mycompany-dev
# ── Verify cluster ───────────────────────────────────────────
kubectl get nodes
# NAME STATUS ROLES AGE
# gke-dev-cluster-application-xxx Ready <none> 2m
# gke-dev-cluster-application-yyy Ready <none> 2m
kubectl get pods -A
# All system pods should be Running
# Check node pool details
kubectl get nodes -L cloud.google.com/gke-nodepool,topology.kubernetes.io/zone
# Verify Workload Identity
kubectl create serviceaccount test-ksa -n default
kubectl annotate serviceaccount test-ksa \
iam.gke.io/gcp-service-account=app-sa@mycompany-dev.iam.gserviceaccount.com
# ── Verify private cluster ───────────────────────────────────
kubectl get nodes -o wide
# EXTERNAL-IP should show <none> for private nodes
# ── Check cluster security ───────────────────────────────────
gcloud container clusters describe dev-cluster \
--region us-central1 \
--format="yaml(masterAuth,networkConfig,privateClusterConfig)"
# ── Destroy dev when done ────────────────────────────────────
terraform destroy \
-var-file="environments/dev.tfvars" \
-auto-approve

Terraform State Management

# List state resources
terraform state list
# Show specific resource
terraform state show module.gke.google_container_cluster.primary
# Move resource (refactoring)
terraform state mv \
module.gke.google_container_node_pool.pools[\"app\"] \
module.gke.google_container_node_pool.pools[\"application\"]
# Import existing cluster
terraform import \
module.gke.google_container_cluster.primary \
projects/mycompany-prod/locations/us-central1/clusters/prod-cluster
# Remove from state (without destroying)
terraform state rm \
module.gke.google_container_node_pool.pools[\"old-pool\"]
# Backup state
gsutil cp \
gs://mycompany-terraform-state/gke/production/default.tfstate \
./backup-$(date +%Y%m%d).tfstate

Common Issues and Fixes

# Issue 1 — API not enabled
# Error: googleapi: Error 403: ... has not been used in project
gcloud services enable container.googleapis.com
# Issue 2 — Insufficient permissions
# Error: Error creating Cluster: googleapi: Error 403
gcloud projects add-iam-policy-binding mycompany-prod \
--member="serviceAccount:terraform-sa@mycompany-prod.iam.gserviceaccount.com" \
--role="roles/container.admin"
# Issue 3 — KMS key permission
# Error: PERMISSION_DENIED: The caller does not have permission
gcloud kms keys add-iam-policy-binding etcd-key \
--keyring=prod-cluster-keyring \
--location=us-central1 \
--member="serviceAccount:service-123@container-engine-robot.iam.gserviceaccount.com" \
--role="roles/cloudkms.cryptoKeyEncrypterDecrypter"
# Issue 4 — Node pool update requires replacement
# Use lifecycle ignore_changes or blue/green node pool strategy
terraform plan -target=module.gke.google_container_node_pool.pools
# Issue 5 — Cluster stuck deleting
# Remove protection and retry
terraform state rm module.gke.google_container_cluster.primary
gcloud container clusters delete prod-cluster \
--region us-central1 \
--project mycompany-prod

This gives you a production-ready GKE deployment — private cluster with VPC-native networking, Workload Identity, KMS encryption, multiple node pools, and a full CI/CD pipeline with security scanning, environment promotion, and approval gates.