Enterprise RAG: Streamlining Internal AI on GCP

What is RAG?

Retrieval-Augmented Generation (RAG) = give an LLM access to your private data at query time, so it answers based on your documents — not just its training data.

User Question → Search your knowledge base → Feed relevant docs to LLM → Grounded Answer

GCP-Native RAG Architecture (Full Stack)

┌─────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ (Web App / Slack Bot / Internal Portal) │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API LAYER │
│ Cloud Run / Cloud Functions │
└──────┬───────────────┬──────────────────┬───────────────────┘
↓ ↓ ↓
┌────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Retrieval │ │ LLM Layer │ │ Auth & Security │
│ Engine │ │ (Vertex AI)│ │ (IAM / IAP) │
└────────────┘ └─────────────┘ └──────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ VECTOR STORE │
│ Vertex AI Vector Search / AlloyDB / pgvector │
└──────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ KNOWLEDGE BASE (Raw Docs) │
│ GCS Buckets │ BigQuery │ Drive │ Confluence │ Jira │
└─────────────────────────────────────────────────────────────┘

GCP Services Mapping

RAG ComponentGCP Service
Document StorageCloud Storage (GCS)
Embedding ModelVertex AI Embeddings (text-embedding-005)
Vector StoreVertex AI Vector Search or AlloyDB pgvector
LLMVertex AI Gemini 1.5 Pro / Flash
OrchestrationCloud Run, Cloud Functions, or Vertex AI Pipelines
Document parsingDocument AI
Data ingestion pipelineDataflow / Cloud Composer (Airflow)
Metadata & structured dataBigQuery
Auth & access controlIAM, Identity-Aware Proxy (IAP)
MonitoringCloud Logging, Cloud Monitoring, Vertex AI Model Monitoring
Secret managementSecret Manager

Phase 1 — Document Ingestion Pipeline

[ Raw Documents ]
GCS / Drive / Confluence / SharePoint
[ Document AI ] ← OCR, form parsing, table extraction
[ Chunking & Cleaning ] ← Split into ~512 token chunks with overlap
[ Vertex AI Embeddings ] ← text-embedding-005 → vector per chunk
[ Vector Store ]
Vertex AI Vector Search (managed) or AlloyDB + pgvector (flexible)
[ Metadata → BigQuery ] ← source, timestamp, doc_id, chunk_id

Chunking Strategy (Critical for Quality)

StrategyBest for
Fixed size (512 tokens, 20% overlap)General documents
Semantic chunkingMixed-content docs
Sentence-levelFAQs, support docs
Section/header-basedStructured docs (manuals, wikis)
Parent-child chunkingRetrieve child, return parent context

Phase 2 — Retrieval Engine

# Simplified RAG retrieval flow on GCP
def retrieve(query: str, top_k: int = 5):
# 1. Embed the user query
query_embedding = vertexai_embed(query) # text-embedding-005
# 2. Vector similarity search
results = vector_search.find_neighbors(
embedding=query_embedding,
num_neighbors=top_k
)
# 3. Optional: Re-rank results
reranked = rerank(query, results) # Vertex AI Ranking API
# 4. Fetch full chunk text from GCS / BigQuery
chunks = fetch_chunks(reranked)
return chunks

Retrieval Techniques (Use in Combination)

TechniqueWhat it does
Dense retrievalVector similarity (semantic search)
Sparse retrievalBM25 keyword search
Hybrid searchDense + sparse combined (best quality)
Re-rankingVertex AI Ranking API re-orders top results
HyDELLM generates hypothetical answer → embed that for retrieval
Multi-query retrievalLLM generates N query variants → retrieve for all

Phase 3 — Generation (LLM Layer)

def generate_answer(query: str, chunks: list):
context = "\n\n".join([c.text for c in chunks])
prompt = f"""
You are an internal AI assistant for Acme Corp.
Answer ONLY based on the provided context.
If the answer is not in the context, say "I don't have that information."
Always cite the source document.
CONTEXT:
{context}
QUESTION:
{query}
ANSWER:
"""
response = gemini_pro.generate_content(prompt)
return response.text

Gemini Models on Vertex AI

ModelBest for
Gemini 1.5 ProComplex reasoning, long documents (1M context)
Gemini 1.5 FlashFast, cost-efficient responses
Gemini 1.0 ProSimpler Q&A tasks
Claude on VertexAlternative via Model Garden

Phase 4 — API & Serving Layer

Cloud Run (containerized FastAPI)
├── POST /chat → RAG query endpoint
├── POST /ingest → Trigger document ingestion
├── GET /sources → List indexed documents
└── GET /health → Health check

Cloud Run is ideal because:

  • Serverless, scales to zero
  • Fast cold starts
  • Easy CI/CD via Cloud Build
  • Integrates with IAP for auth

Phase 5 — Internal AI Assistant UI

Options for the frontend:

OptionBest for
Cloud Run + React/Next.jsCustom internal portal
Slack BotTeams already using Slack
Google Chat BotGoogle Workspace shops
Vertex AI Agent BuilderNo-code, managed RAG UI
Looker / Data Studio embedAnalytics-heavy teams

Enterprise-Grade Features

1. Access Control (Critical)

IAM Roles → control who can call the RAG API
IAP → protect the web UI (Google SSO)
Document-level ACL → filter retrieved chunks by user's permissions
VPC Service Controls → isolate all GCP services in a perimeter

2. Observability Stack

Cloud Logging → all query logs, errors
Cloud Monitoring → latency, throughput, error rate dashboards
BigQuery → store all Q&A pairs for analysis
Vertex AI Evals → measure answer quality over time

3. Guardrails

Vertex AI Safety Filters → block harmful outputs
Grounding checks → ensure answer comes from retrieved context
Confidence scoring → flag low-confidence answers for human review
Citation enforcement → always return source doc + page

Full GCP RAG Stack — Production Setup

┌─ INGESTION (Batch + Real-time) ──────────────────────────────┐
│ Cloud Composer (Airflow) → Document AI → Embeddings → VectorDB│
└──────────────────────────────────────────────────────────────┘
┌─ SERVING ────────────────────────────────────────────────────┐
│ Cloud Run (FastAPI RAG service) │
│ ├── Vertex AI Vector Search (retrieval) │
│ ├── Vertex AI Ranking API (re-rank) │
│ └── Gemini 1.5 Pro (generation) │
└──────────────────────────────────────────────────────────────┘
┌─ FRONTEND ───────────────────────────────────────────────────┐
│ Next.js on Cloud Run + IAP (Google SSO) │
│ or Slack / Google Chat Bot │
└──────────────────────────────────────────────────────────────┘
┌─ OBSERVABILITY ──────────────────────────────────────────────┐
│ Cloud Logging → BigQuery → Looker Dashboard │
└──────────────────────────────────────────────────────────────┘

Vertex AI Agent Builder (Managed RAG — Fastest Path)

If you want to skip building from scratch, GCP offers a fully managed RAG solution:

  1. Upload docs to GCS
  2. Create a Data Store in Agent Builder
  3. Create an Agent and attach the data store
  4. Deploy — get a chat UI + API instantly

Great for POCs and internal tools where customization isn’t critical.


Cost Optimization Tips

TipSaving
Use Gemini Flash for simple Q&A~10x cheaper than Pro
Cache frequent queries (Memorystore/Redis)Reduce LLM calls
Batch embed documents overnightLower embedding costs
Limit top_k retrieval chunksReduce context = less tokens
Use committed use discounts on VertexUp to 20% off

RAG Quality Evaluation

Always measure these metrics:

MetricWhat it measures
FaithfulnessIs the answer grounded in retrieved docs?
Answer RelevanceDoes it actually answer the question?
Context PrecisionAre retrieved chunks relevant?
Context RecallDid retrieval find all needed info?

Tools: RAGAS framework, Vertex AI Evaluation Service, custom BigQuery dashboards.


Timeline for Enterprise RAG on GCP

PhaseTimelineDeliverable
POC1–2 weeksAgent Builder + sample docs
MVP4–6 weeksCloud Run RAG API + basic UI
Production8–12 weeksFull pipeline, auth, monitoring
OptimizationOngoingEval loop, fine-tuning, cost control

This is a battle-tested architecture used by enterprises running internal knowledge assistants, HR bots, IT support agents, and compliance Q&A systems on GCP.

Leave a comment