Peering strategy and traffic flow

In Microsoft Azure, networking strategy revolves around how you connect virtual networks (VNets) and how data actually travels between them. Understanding this is key to building a scalable, secure environment.


1. VNet Peering: The Fundamentals

VNet Peering connects two Azure Virtual Networks over the Microsoft backbone network. Once peered, the two networks behave as one for connectivity purposes.

  • Regional Peering: Connects VNets within the same Azure region.
  • Global Peering: Connects VNets across different Azure regions.

Key Characteristics:

  • Low Latency: Traffic stays on the private Microsoft fiber; it never touches the public internet.
  • Performance: Offers the same high-bandwidth connection as if the resources were in the same network.
  • No Gateway Required: Unlike VPNs, peering doesn’t require a virtual network gateway unless you are “chaining” transit traffic.

2. Common Peering Strategies

Hub-and-Spoke Topology

This is the “gold standard” for enterprise architecture.

  • The Hub: A central VNet that hosts shared services (Firewalls, ExpressRoute gateways, DNS).
  • The Spokes: Individual VNets (e.g., for different departments or apps) that peer with the Hub.
  • The Benefit: It centralizes security and reduces costs by sharing expensive resources like an Azure Firewall.

Mesh Topology

Every VNet is peered directly to every other VNet.

  • Use Case: Small environments with very few VNets (3 or less).
  • The Downside: It becomes a management nightmare as you scale, because peering is not transitive. If VNet A is peered with B, and B is peered with C, A and C cannot talk to each other unless you peer them directly or use a “Transit” setup.

3. Understanding Traffic Flow

How data moves depends heavily on your User Defined Routes (UDRs) and Gateway Transit settings.

Default Flow (Direct)

In a simple peering, traffic flows directly from VM-A to VM-B. Azure handles the routing automatically via “System Routes.”

Hub-and-Spoke with Transit

If you want Spokes to communicate with each other or the internet through the Hub, you must configure:

  1. Allow Gateway Transit: Set on the Hub peering.
  2. Use Remote Gateways: Set on the Spoke peering.
  3. Network Virtual Appliance (NVA): Usually a Firewall in the Hub. You must create a UDR on the Spoke subnets to “force” traffic to the Firewall’s IP address instead of letting it go direct.

Traffic Flow Table

ScenarioPathLogic
VNet A to Peered VNet BDirectSystem Route (Peering)
Spoke A to Spoke BSpoke A → Hub (Firewall) → Spoke BUser Defined Route (UDR)
On-Prem to SpokeOn-Prem → Hub Gateway → SpokeGateway Transit enabled

4. Crucial Constraints

  • Overlapping Address Space: You cannot peer two VNets if their IP address ranges overlap.
  • Transitivity: Peering is strictly 1:1. To move traffic “through” a VNet to another destination, you must use a Network Virtual Appliance (like Azure Firewall) or Azure Route Server.

In a multi-region Azure architecture, the strategy shifts from connecting individual workloads to connecting entire regional ecosystems. As of 2026, the industry standard has moved toward high-automation models to manage the complexity of cross-region routing.

There are two primary ways to architect this: Global Hub-and-Spoke and Azure Virtual WAN.


1. Global Hub-and-Spoke (Self-Managed)

In this model, you replicate the Hub-and-Spoke design in every region where you have a footprint and then link the hubs together.

The Strategy

  • Regional Hubs: Each region (e.g., East US, West Europe) has its own Hub VNet containing a local Azure Firewall and Gateway.
  • Hub Peering: You connect the Hubs using Global VNet Peering.
  • Full Mesh vs. Chain: Ideally, all Hubs are peered to each other (Full Mesh of Hubs) to ensure the lowest latency and avoid “bottlenecking” through a third region.

Traffic Flow

  • Inter-Region Spoke-to-Spoke: To go from Spoke A (Region 1) to Spoke B (Region 2), the traffic flow is:Spoke AHub 1 (Firewall)Global PeeringHub 2 (Firewall)Spoke B.
  • Routing Logic: This requires meticulous User-Defined Routes (UDRs) on every spoke subnet to ensure traffic knows the “Next Hop” is the local Hub Firewall for remote region ranges.

2. Azure Virtual WAN (Microsoft-Managed)

Virtual WAN (vWAN) is a managed networking service that replaces the manual Hub-and-Spoke with a “plug-and-play” global transit architecture.

The Strategy

  • The Virtual Hub: Microsoft manages the Hub VNet for you. You don’t manage the underlying VMs or scaling; you just toggle features (VPN, ExpressRoute, Firewall).
  • Automated Peering: When you connect a Spoke VNet to a Virtual Hub, the routing is automatically propagated.
  • Any-to-Any Connectivity: By default, all spokes connected to any hub in the vWAN can talk to each other globally via the Microsoft backbone.

Traffic Flow

  • Simplified Path: Traffic moves from a Spoke to its local Virtual Hub, which then uses the Global Mesh (automatically maintained by Azure) to reach a Spoke in another region.
  • Managed Routing: You don’t need to write dozens of UDRs. The vWAN “Hub Router” handles BGP propagation across regions automatically.

Key Differences for 2026

FeatureGlobal Hub-and-SpokeAzure Virtual WAN
EffortManual (UDRs, Peering, NVA configs)Managed (Point-and-click/Policy)
TransitivityNot native (must use Firewall/NVA)Native (Any-to-Any transit)
ScaleLimits at ~500 peerings per HubScales to thousands of VNets
CostCheaper for small, simple setupsBetter ROI for large, complex global builds
RoutingStatic (UDR-based)Dynamic (BGP-based)

Important Change: Default Outbound Access (March 2026)

As of March 31, 2026, Azure has retired “default outbound access” for new VNets. In a multi-region setup, you can no longer rely on VMs simply “finding” the internet.

The Strategy Update: You must now explicitly define an egress path. In a multi-region architecture, this is typically done by routing all regional internet traffic through the local Hub’s Azure Firewall or by attaching an Azure NAT Gateway to each spoke subnet for high-performance, non-inspected traffic.

Would you like to see a sample User-Defined Route (UDR) configuration for a cross-region Hub-and-Spoke setup, or are you leaning more toward the managed Virtual WAN approach?

Since you’re interested in the “how-to,” let’s look at the User-Defined Route (UDR) configuration for a cross-region Hub-and-Spoke setup. This is the manual way to “force” traffic through your security stack.

Scenario

  • Region 1 (East US): Hub VNet (10.1.0.0/16) with a Firewall at 10.1.0.4.
  • Region 2 (West US): Hub VNet (10.2.0.0/16) with a Firewall at 10.2.0.4.
  • Spoke A (in East US): Subnet range 10.1.10.0/24.

1. The Route Table Logic

To get Spoke A (East US) to talk to a Spoke in West US, the packet needs instructions. Without a UDR, the packet will try to go directly via Global Peering, but if you have a Firewall, it will be blocked or dropped because there is no return path.

Example UDR for Spoke A (East US)

You would create a Route Table and associate it with the subnets in Spoke A:

Route NameAddress PrefixNext Hop TypeNext Hop AddressPurpose
To-West-US10.2.0.0/16Virtual Appliance10.1.0.4Sends West US traffic to the Local Hub Firewall first.
To-Internet0.0.0.0/0Virtual Appliance10.1.0.4Forces all web traffic through the Firewall (Egress).
Local-Spoke-Traffic10.1.10.0/24VNet LocalNoneKeeps traffic within the same subnet local.

2. The Traffic Flow (Step-by-Step)

When a VM in Spoke A (East US) sends a packet to Spoke B (West US):

  1. Source Check: The VM looks at its Route Table. It sees the 10.2.0.0/16 prefix points to 10.1.0.4 (Local Hub Firewall).
  2. First Hop: Packet travels to the East US Firewall. The Firewall inspects it against your Network Rules.
  3. Global Transit: The Firewall sees the destination is in West US. It sends the packet across the Global VNet Peering to the West US Hub.
  4. Second Hop: The West US Firewall receives the packet, inspects it again (if desired), and forwards it to the destination VM in the West US Spoke.
  5. Return Path: Symmetric routing is critical. The West US Spoke must have a mirrored UDR pointing East US traffic (10.1.0.0/16) to the West US Hub Firewall (10.2.0.4).

3. Comparison: Manual UDR vs. Virtual WAN (vWAN)

If writing these tables for 50 spokes sounds like a headache, that’s exactly why Azure Virtual WAN exists.

FeatureManual Hub-and-Spoke (UDR)Azure Virtual WAN
Routing TableYou manually create/update every UDR.Automated. Routes are propagated via BGP.
TransitYou must configure NVAs/Firewalls to route.Native. The Virtual Hub is a transit router.
ComplexityHigh (risk of “Route Leaks” or loops).Low (Microsoft manages the routing mesh).
ScalingHard to manage beyond 10-15 VNets.Designed for 100s of VNets globally.

Pro-Tip: The “Gateway Subnet” Exception

Never associate a UDR with the GatewaySubnet in your Hub unless you are very experienced with BGP. Doing so can “break” the connection between your On-Premises VPN and your Azure VNets by creating circular routing loops.

Given the complexity of managing cross-region UDRs, are you currently leaning toward building this manually for more granular control, or does the automation of Virtual WAN sound like a better fit for your scale?

Hub-and-Spoke architecture

The Hub-and-Spoke architecture is the gold standard for enterprise networking in Azure. It’s designed to centralize shared resources while providing isolation for individual workloads, effectively balancing security with scalability.

Think of it like an airport: the Hub is the main terminal (security, baggage, customs), and the Spokes are the individual gates where the planes (your apps) live.


1. The Hub: The “Central Command”

The Hub is a Virtual Network (VNet) that acts as the central point of connectivity. It typically contains resources shared by the entire organization.

  • Azure Firewall / NVA: All traffic entering or leaving the network passes through here for inspection.
  • VPN / ExpressRoute Gateway: Connects your on-premises data center to the Azure environment.
  • Shared Services: Domain Controllers (ADDS), DNS servers, or specialized management tools.
  • Azure Bastion: Provides secure RDP/SSH access to VMs without exposing public IPs.

2. The Spokes: Isolated Workloads

Spokes are separate VNets used to host specific workloads, such as a production environment, a dev/test environment, or a specific business application.

  • Isolation: Spokes do not communicate with each other by default. This “Zero Trust” approach ensures that if one spoke is compromised, the threat is contained.
  • Connectivity: Each spoke is connected to the Hub via VNet Peering.
  • Resource Management: Spokes are often managed by different teams but still rely on the Hub for security and connectivity.

3. How Traffic Flows (Routing)

The “magic” of the Hub-and-Spoke model lies in how data moves through the network.

  • Spoke-to-Internet: Traffic is usually forced through the Hub’s Firewall using User Defined Routes (UDRs). This ensures no application can “talk” to the internet without being inspected.
  • Spoke-to-Spoke: If Spoke A needs to talk to Spoke B, the traffic must travel to the Hub, be inspected by the Firewall, and then be routed back out to the destination spoke.
  • On-Premises-to-Spoke: The Hub acts as the gateway. External traffic hits the VPN/ExpressRoute in the Hub and is then routed to the correct Spoke.

4. Why Use This Alignment?

FeatureBenefit
Cost EfficiencyYou only pay for one Firewall and one VPN Gateway in the Hub, rather than putting them in every Spoke.
SecurityCentralized “choke point” for security policies and traffic monitoring.
ScalabilityYou can add new Spokes easily without redesigning the core network.
GovernanceCentral IT can manage the Hub (security), while App Teams manage their own Spokes (agility).

5. The “Virtual WAN” Evolution

For very large global deployments, Microsoft offers Azure Virtual WAN. This is essentially “Hub-and-Spoke as a Service.” It automates the peering and routing, making it easier to manage hundreds of spokes across different geographic regions.

To implement this for a new migration

If you are starting a new migration, aligning your Hub-and-Spoke architecture with the Azure Landing Zone (ALZ) framework is the best move. It ensures you don’t just build a “network,” but a governed ecosystem that can grow from 10 VMs to 1,000 without a redesign.

Here is the blueprint for aligning a new migration to the Hub-and-Spoke model in 2026.


1. The Hierarchy (Management Groups)

Before touching the network, you must organize your subscriptions. Alignment starts with Management Groups (MGs).

  • Interim/Sandbox MG: For initial migration testing.
  • Platform MG: Holds your Hub subscription (Connectivity, Identity, Management).
  • Landing Zones MG: Holds your Spoke subscriptions, organized by archetype (e.g., “Corp” for internal apps, “Online” for public apps).

2. Connectivity: Hub vs. Spoke Strategy

The Hub (Centralized Services)

In a new migration, the Hub is your “Landing Strip.” It should be deployed first and contain:

  • Azure Firewall: Acts as the central security guard for all “North-South” (Internet) and “East-West” (Spoke-to-Spoke) traffic.
  • Private DNS Resolver: Critical for migrations. It ensures your migrated Azure VMs can still resolve names of servers remaining on-premises.
  • Gateway Subnet: Where your VPN or ExpressRoute lands to connect your old data center to the new cloud environment.

The Spokes (Workload Isolation)

Each application or business unit gets its own Spoke VNet.

  • VNet Peering: Connect each Spoke to the Hub.
  • Gateways: Spokes should not have their own VPN gateways. They “use the remote gateway” in the Hub to reach on-premises.

3. The Migration “Paved Path”

To make the migration repeatable, use a Subscription Vending Machine approach.

StepActionWhy?
1. StandardizeUse Infrastructure as Code (IaC) (Bicep or Terraform).Avoids “snowflake” configurations that are hard to support.
2. SecureApply Azure Policy at the Landing Zone MG level.Automatically prevents Spokes from creating Public IPs or bypassing the Firewall.
3. RouteDeploy User Defined Routes (UDRs) in every Spoke subnet.Forces all traffic (0.0.0.0/0) to the Hub Firewall for inspection.
4. MigrateUse Azure Migrate or Azure Site Recovery (ASR).Replicates your on-prem VMs directly into the designated Spoke VNets.

4. Key 2026 Best Practices

  • Identity-First Security: Don’t just rely on the network. Use Microsoft Entra ID (formerly Azure AD) and Managed Identities for all migrated apps.
  • Private Link Everything: Instead of opening ports, use Private Endpoints in your Spokes to connect to PaaS services (like SQL or Storage) privately.
  • Observability: Connect all Spokes to a central Log Analytics Workspace in the Hub’s “Management” subscription for a single pane of glass during the migration cutover.

5. Pro-Tip: Use the “ALZ Accelerator”

Microsoft provides a Landing Zone Accelerator (available in the Azure Portal or via GitHub). It allows you to deploy a fully compliant Hub-and-Spoke foundation in about 20 minutes. It handles the management groups, policies, and core networking for you, so you can focus on moving your data rather than building the pipes.

Warning: During a migration, watch out for IP Address Overlap. Ensure the CIDR ranges you choose for your Azure Spokes do not conflict with your existing on-premises network or you won’t be able to peer them.

Are you planning to move physical/virtual servers (IaaS) or are you refactoring directly into platform services (PaaS)?

VNet Peering

VNet Peering

Q: What is the difference between regional and global VNet peering? Are there any restrictions with global peering?

Regional VNet peering connects two VNets within the same Azure region. Global VNet peering connects VNets across different Azure regions.

Restrictions with global peering:

  • Basic Load Balancer — Resources behind a Basic Load Balancer in one VNet cannot be reached from a globally peered VNet. Standard Load Balancer works fine.
  • Latency — Global peering crosses region boundaries, so latency is higher than regional peering. You need to account for this in latency-sensitive workloads.
  • Cost — Global peering incurs data transfer charges in both directions, whereas regional peering charges are lower.
  • No transitive routing — Same as regional peering, traffic does not flow transitively through a peered VNet without additional configuration.

Q: Can peered VNets communicate transitively by default? How would you work around this?

No — transitive routing is not supported natively in VNet peering. If Spoke A is peered to the Hub, and Spoke B is peered to the Hub, Spoke A cannot reach Spoke B directly through the Hub by default.

To work around this, you have two main options:

  1. Azure Firewall or NVA in the Hub — Route traffic from Spoke A through the Hub firewall, which then forwards it to Spoke B. This requires User Defined Routes (UDRs) on both Spokes pointing their traffic to the firewall’s private IP as the next hop. This is the most common enterprise approach and has the added benefit of traffic inspection.
  2. Azure Virtual WAN — Virtual WAN supports transitive routing natively, making it a cleaner option when you have many Spokes and don’t want to manage UDRs manually.

Q: Spoke A and Spoke B are peered to the Hub. Can Spoke A reach Spoke B? What needs to be in place?

Not by default. To enable this:

  • Deploy Azure Firewall (or an NVA) in the Hub VNet
  • Create a UDR on Spoke A’s subnet with a route: destination = Spoke B’s address space, next hop = Azure Firewall private IP
  • Create a mirror UDR on Spoke B’s subnet: destination = Spoke A’s address space, next hop = Azure Firewall private IP
  • Configure Azure Firewall network rules to allow the traffic between Spoke A and Spoke B
  • Enable “Use Remote Gateway” or “Allow Gateway Transit” on the peering connections as needed for routing to propagate correctly

This gives you transitive connectivity with centralized inspection — a core benefit of Hub-and-Spoke.


Q: When would you choose VNet peering over VPN Gateway or ExpressRoute for VNet-to-VNet connectivity?

  • VNet Peering — Best for Azure-to-Azure connectivity. It uses the Microsoft backbone, offers the lowest latency, highest throughput, and is the simplest to configure. Use it whenever both VNets are in Azure.
  • VPN Gateway (VNet-to-VNet) — Used when you need encrypted tunnels between VNets, or when connecting across different Azure tenants/subscriptions where peering may be complex. Higher latency and limited throughput compared to peering.
  • ExpressRoute — Used for on-premises to Azure connectivity over a private, dedicated circuit. Not typically used for VNet-to-VNet unless traffic must flow through on-premises for compliance or inspection reasons.

In short: always prefer peering for Azure-to-Azure, and reserve VPN/ExpressRoute for hybrid or cross-tenant scenarios.

AZ DNS

DNS Architecture

Q: Can you explain the difference between Azure Public DNS and Azure Private DNS Zones, and when you would use each?

Azure Public DNS is used to host publicly resolvable domain names — for example, resolving http://www.yourcompany.com from the internet. Anyone on the internet can query it.

Azure Private DNS Zones, on the other hand, are only resolvable within a VNet or linked VNets. They are used for internal name resolution — for example, resolving a private endpoint for a storage account like mystorageaccount.privatelink.blob.core.windows.net from inside your network, without exposing it publicly.

You use Public DNS when you need external-facing resolution, and Private DNS Zones when you need secure, internal name resolution for resources that should never be reachable from the internet.


Q: How does DNS resolution work for a VM inside a VNet — what is the default behavior, and when would you override it?

By default, Azure provides a built-in DNS resolver at the special IP 168.63.129.16. Every VM in a VNet uses this address automatically. It can resolve Azure-internal hostnames and any Private DNS Zones linked to that VNet.

You would override this default when:

  • You need to resolve on-premises hostnames from Azure (hybrid scenarios)
  • You need conditional forwarding to route specific domain queries to specific DNS servers
  • You are using a centralized custom DNS server (e.g., a DNS forwarder VM or Azure DNS Private Resolver) to control and log all DNS traffic across the environment

In those cases, you configure a custom DNS server address at the VNet level, pointing VMs to your centralized resolver instead.


Q: What is conditional forwarding, and how would you set it up to resolve on-premises domain names from Azure?

Conditional forwarding is a DNS rule that says: “For queries matching this specific domain, forward them to this specific DNS server instead of resolving them normally.”

For example, if your on-premises domain is corp.contoso.local, you would configure your Azure DNS resolver to forward any query for corp.contoso.local to your on-premises DNS server IP.

The setup typically looks like this:

  • Deploy Azure DNS Private Resolver with an outbound endpoint in your Hub VNet
  • Create a DNS forwarding ruleset with a rule: corp.contoso.local → forward to on-premises DNS IP
  • Associate the ruleset with the relevant VNets
  • Ensure the on-premises DNS server can be reached over ExpressRoute or VPN

Q: A client reports that their Azure VM cannot resolve a private endpoint hostname. What are the first things you check?

I would systematically check the following:

  1. Private DNS Zone linkage — Is the Private DNS Zone (e.g., privatelink.blob.core.windows.net) linked to the VNet the VM is in? Without the link, the zone is invisible to that VNet.
  2. A record presence — Does the Private DNS Zone actually contain an A record pointing to the private endpoint’s IP?
  3. Custom DNS configuration — If the VNet uses a custom DNS server, is that server forwarding queries for privatelink.* domains to Azure’s resolver (168.63.129.16)? This is a very common misconfiguration.
  4. nslookup / dig from the VM — Run nslookup <hostname> on the VM to see what IP is being returned. If it returns the public IP instead of the private IP, the DNS zone is not being picked up correctly.
  5. Network connectivity — Even if DNS resolves correctly, confirm NSG rules and routing aren’t blocking traffic to the private endpoint IP.

Q: How would you use Azure DNS Private Resolver, and how does it differ from a traditional DNS forwarder running on a VM?

Azure DNS Private Resolver is a fully managed, highly available DNS service that handles inbound and outbound DNS resolution without requiring you to manage VMs.

  • The inbound endpoint allows on-premises clients to send DNS queries into Azure and resolve Private DNS Zones — something that wasn’t possible before without a forwarder VM.
  • The outbound endpoint with forwarding rulesets allows Azure VMs to forward specific domain queries (e.g., on-premises domains) to external DNS servers.

Compared to a forwarder VM, DNS Private Resolver is:

  • Fully managed — no patching, no VM maintenance, no availability concerns
  • Scalable — handles high query volumes automatically
  • Integrated — natively understands Azure Private DNS Zones without extra configuration
  • More secure — no need to open management ports on a VM

The main reason teams still use forwarder VMs is legacy architecture or specific advanced configurations not yet supported by Private Resolver.


🔵 VNet Peering

Q: What is the difference between regional and global VNet peering? Are there any restrictions with global peering?

Regional VNet peering connects two VNets within the same Azure region. Global VNet peering connects VNets across different Azure regions.

Restrictions with global peering:

  • Basic Load Balancer — Resources behind a Basic Load Balancer in one VNet cannot be reached from a globally peered VNet. Standard Load Balancer works fine.
  • Latency — Global peering crosses region boundaries, so latency is higher than regional peering. You need to account for this in latency-sensitive workloads.
  • Cost — Global peering incurs data transfer charges in both directions, whereas regional peering charges are lower.
  • No transitive routing — Same as regional peering, traffic does not flow transitively through a peered VNet without additional configuration.

Q: Can peered VNets communicate transitively by default? How would you work around this?

No — transitive routing is not supported natively in VNet peering. If Spoke A is peered to the Hub, and Spoke B is peered to the Hub, Spoke A cannot reach Spoke B directly through the Hub by default.

To work around this, you have two main options:

  1. Azure Firewall or NVA in the Hub — Route traffic from Spoke A through the Hub firewall, which then forwards it to Spoke B. This requires User Defined Routes (UDRs) on both Spokes pointing their traffic to the firewall’s private IP as the next hop. This is the most common enterprise approach and has the added benefit of traffic inspection.
  2. Azure Virtual WAN — Virtual WAN supports transitive routing natively, making it a cleaner option when you have many Spokes and don’t want to manage UDRs manually.

Q: Spoke A and Spoke B are peered to the Hub. Can Spoke A reach Spoke B? What needs to be in place?

Not by default. To enable this:

  • Deploy Azure Firewall (or an NVA) in the Hub VNet
  • Create a UDR on Spoke A’s subnet with a route: destination = Spoke B’s address space, next hop = Azure Firewall private IP
  • Create a mirror UDR on Spoke B’s subnet: destination = Spoke A’s address space, next hop = Azure Firewall private IP
  • Configure Azure Firewall network rules to allow the traffic between Spoke A and Spoke B
  • Enable “Use Remote Gateway” or “Allow Gateway Transit” on the peering connections as needed for routing to propagate correctly

This gives you transitive connectivity with centralized inspection — a core benefit of Hub-and-Spoke.


Q: When would you choose VNet peering over VPN Gateway or ExpressRoute for VNet-to-VNet connectivity?

  • VNet Peering — Best for Azure-to-Azure connectivity. It uses the Microsoft backbone, offers the lowest latency, highest throughput, and is the simplest to configure. Use it whenever both VNets are in Azure.
  • VPN Gateway (VNet-to-VNet) — Used when you need encrypted tunnels between VNets, or when connecting across different Azure tenants/subscriptions where peering may be complex. Higher latency and limited throughput compared to peering.
  • ExpressRoute — Used for on-premises to Azure connectivity over a private, dedicated circuit. Not typically used for VNet-to-VNet unless traffic must flow through on-premises for compliance or inspection reasons.

In short: always prefer peering for Azure-to-Azure, and reserve VPN/ExpressRoute for hybrid or cross-tenant scenarios.


🔵 Hub-and-Spoke Network Design

Q: Explain the Hub-and-Spoke topology. What lives in the Hub, and what lives in the Spokes?

Hub-and-Spoke is a network design pattern where a central VNet (the Hub) acts as the connectivity and security backbone, and multiple Spoke VNets connect to it via peering.

The Hub hosts shared, centralized services:

  • Azure Firewall or NVA for traffic inspection and internet egress control
  • VPN Gateway or ExpressRoute Gateway for on-premises connectivity
  • Azure DNS Private Resolver
  • Bastion for secure VM access
  • Shared monitoring and logging infrastructure

The Spokes host workload-specific resources:

  • Application VMs, AKS clusters, App Services
  • Databases and storage
  • Each Spoke is isolated — it can only communicate outside its boundary through the Hub, which enforces security policies

This model gives you centralized governance and security without duplicating shared services in every workload environment.


Q: How do you enforce traffic inspection through the Hub for Spoke-to-internet traffic?

  • Deploy Azure Firewall in the Hub VNet
  • On each Spoke subnet, create a UDR with a default route: 0.0.0.0/0 → next hop = Azure Firewall private IP
  • This forces all outbound internet traffic from Spoke VMs through the firewall before it exits
  • On the Hub, configure Azure Firewall application and network rules to define what traffic is allowed out
  • Enable Azure Firewall DNS proxy if you want centralized DNS logging as well

For Spoke-to-Spoke, additional UDRs point inter-spoke traffic to the firewall as described earlier.


Q: A new business unit needs to be onboarded into your existing Hub-and-Spoke architecture. Walk me through the steps.

  1. IP planning — Allocate a non-overlapping address space for the new Spoke VNet from the enterprise IP plan
  2. Create the Spoke VNet — Deploy it in the appropriate subscription under the correct Management Group
  3. Establish peering — Create bidirectional peering between the new Spoke and the Hub (allow gateway transit on Hub side, use remote gateway on Spoke side if needed)
  4. Configure UDRs — Apply route tables on the Spoke subnets to direct internet and cross-spoke traffic through the Hub firewall
  5. DNS configuration — Point the Spoke VNet’s DNS settings to the centralized DNS Private Resolver in the Hub
  6. Firewall rules — Add rules in Azure Firewall to permit the business unit’s required traffic flows
  7. Azure Policy — Ensure the new subscription inherits enterprise policies (e.g., no public IPs, required tags, allowed regions)
  8. Private DNS Zone links — Link relevant Private DNS Zones to the new Spoke VNet for private endpoint resolution
  9. Connectivity testing — Validate DNS resolution, internet egress, and any required on-premises connectivity

🔵 Landing Zones & Enterprise Network Governance

Q: What is an Azure Landing Zone, and how does networking fit into it?

An Azure Landing Zone is a pre-configured, governed Azure environment that provides the foundation for hosting workloads securely and at scale. It is designed following Microsoft’s Cloud Adoption Framework (CAF) and covers identity, governance, security, networking, and management.

Networking is one of the most critical components. In the CAF Landing Zone model:

  • A Connectivity subscription hosts the Hub VNet, gateways, firewall, and DNS infrastructure
  • Landing Zone subscriptions host Spoke VNets for individual workloads or business units
  • All networking is governed centrally — workload teams cannot create arbitrary public IPs or peer VNets outside the approved architecture
  • Azure Policy enforces these constraints automatically

Q: What role do Azure Policy and Management Groups play in enforcing network governance?

Management Groups create a hierarchy of subscriptions (e.g., Root → Platform → Landing Zones → Business Units). Policies applied at a Management Group level automatically inherit down to all subscriptions beneath it.

Azure Policy enforces guardrails such as:

  • Deny creation of public IP addresses in Spoke subscriptions
  • Require all VNets to use a specific custom DNS server
  • Deny VNet peering unless it connects to the approved Hub
  • Enforce NSG association on every subnet
  • Require private endpoints for PaaS services like Storage and SQL

Together, they ensure that even if a workload team has Contributor access to their subscription, they cannot violate the network architecture — the policies block non-compliant actions automatically.


Q: How would you manage IP address space allocation across multiple subscriptions to avoid conflicts?

This is an area where discipline and tooling are both essential:

  • Centralized IP plan — Maintain a master IP address management (IPAM) document or tool (e.g., Azure’s native IPAM feature in preview, or third-party tools like Infoblox or NetBox) that tracks all allocated ranges across subscriptions
  • Non-overlapping ranges per Spoke — Assign each Landing Zone a dedicated, non-overlapping CIDR block from a master supernet (e.g., 10.0.0.0/8 split into /16 per region, then /24 per Spoke)
  • Azure Policy — Use policy to deny VNet creation if the address space conflicts with known ranges or falls outside the approved supernet
  • Automation — When onboarding new Landing Zones via Pulumi or other IaC, automatically pull the next available range from the IPAM system rather than relying on manual assignment

🔵 Hybrid DNS Resolution

Q: On-premises clients need to resolve privatelink.blob.core.windows.net. What DNS architecture changes are needed?

This is one of the most common hybrid DNS challenges. By default, privatelink.blob.core.windows.net resolves to a public IP from on-premises. To make it resolve to the private endpoint IP, you need:

On the Azure side:

  • Create a Private DNS Zone for privatelink.blob.core.windows.net and link it to the Hub VNet
  • Ensure the private endpoint A record is registered in the zone
  • Deploy Azure DNS Private Resolver with an inbound endpoint in the Hub VNet — this gives on-premises clients a routable IP to send DNS queries into Azure

On the on-premises side:

  • Configure your on-premises DNS server with a conditional forwarder: privatelink.blob.core.windows.net → forward to the DNS Private Resolver inbound endpoint IP
  • Ensure the inbound endpoint IP is reachable over ExpressRoute or VPN from on-premises

Result: On-premises clients query their local DNS → conditional forwarder redirects to Azure DNS Private Resolver → Private Resolver checks the linked Private DNS Zone → returns the private endpoint IP → traffic flows privately over ExpressRoute/VPN.


Q: You’re migrating from a custom DNS forwarder VM to Azure DNS Private Resolver. How do you ensure zero DNS disruption?

  1. Deploy Private Resolver in parallel — Set up the inbound and outbound endpoints and configure forwarding rulesets to mirror the existing forwarder VM’s rules exactly
  2. Test thoroughly — Validate resolution of all key domains (on-premises, private endpoints, public) from test VMs pointing to the new resolver
  3. Staged migration — Update the custom DNS server setting on VNets one at a time, starting with non-production VNets, monitoring for any resolution failures
  4. Update on-premises conditional forwarders — Once Azure-side is validated, update on-premises DNS to point to the Private Resolver inbound endpoint instead of the old forwarder VM IP
  5. Monitor — Use Azure Monitor and DNS metrics on the Private Resolver to confirm query volumes are healthy
  6. Decommission the VM — Only after all VNets and on-premises forwarders are updated and validated, remove the forwarder VM

The key principle is run both in parallel, migrate gradually, and never cut over until validation is complete.


Enterprise RAG Pipeline & Internal AI Assistant Azure Ecosystem: ADF, ADLS Gen2, Databricks, AI Search, OpenAI


1. The Project Title

Enterprise RAG Pipeline & Internal AI Assistant Azure Ecosystem: ADF, ADLS Gen2, Databricks, AI Search, OpenAI


2. Impact-Driven Bullet Points

Use the C-A-R (Context-Action-Result) method. Choose 3-4 from this list:

  • Architecture: Architected and deployed a multi-stage data lake (Medallion Architecture) using ADLS Gen2 and Terraform, reducing data fragmentation across internal departments.
  • Orchestration: Developed automated Azure Data Factory (ADF) pipelines with event-based triggers to ingest and preprocess 5,000+ internal documents (PDF/Office) with 99% reliability.
  • AI Engineering: Built a Databricks processing engine to perform recursive character chunking and vector embedding using text-embedding-3-large, optimizing retrieval context for a GPT-4o powered chatbot.
  • Search Optimization: Implemented Hybrid Search (Vector + Keyword) and Semantic Ranking in Azure AI Search, improving answer relevance by 35% compared to traditional keyword-only search.
  • Security & Governance: Integrated Microsoft Entra ID and ACL-based Security Trimming to ensure the AI assistant respects document-level permissions, preventing unauthorized data exposure.
  • Cost Management: Optimized cloud spend by 40% through Databricks Serverless compute and automated ADLS Lifecycle Management policies (Hot-to-Cold tiering).

3. Skills Section (Keywords for ATS)

  • Cloud & Data: Azure Data Factory (ADF), ADLS Gen2, Azure Databricks, Spark (PySpark), Medallion Architecture, Delta Lake.
  • AI & Search: Retrieval-Augmented Generation (RAG), Azure AI Search, Vector Databases, Semantic Ranking, Hybrid Retrieval.
  • LLMs: Azure OpenAI (GPT-4o), Embeddings, Prompt Engineering, LangChain/LlamaIndex.
  • DevOps/IaC: Terraform, Azure DevOps (CI/CD), Managed Identities, Unity Catalog.

4. The “Interview Hook”

In your Professional Summary or Project Description, add one sentence that proves you know the real-world challenges of AI:

“Implemented a production-ready RAG system that solves for LLM hallucinations by enforcing strict citation requirements and PII redaction during the ingestion phase.”


Pro-Tip for 2026:

Hiring managers currently care deeply about “Day 2 Operations.” If they ask about this project in an interview, mention how you monitored it for Cost (Azure Budgets) and Quality (using an evaluation framework like Ragas or Azure AI Content Safety). This proves you aren’t just a “tutorial follower” but a production-ready engineer.

To deploy Azure Databricks using Terraform, you need to set up three main components: a Resource Group, the Databricks Workspace, and the Network Security Group (optional but recommended).

Below is a clean, modular example. This configuration uses the “standard” pricing tier, which is usually sufficient for internal RAG testing.

1. The Terraform Configuration (main.tf)

Terraform

# 1. Define the Provider
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0" # Use the latest stable 3.x version
}
}
}
provider "azurerm" {
features {}
}
# 2. Create a Resource Group
resource "azurerm_resource_group" "rg" {
name = "rg-databricks-internal-rag"
location = "East US"
}
# 3. Create the Azure Databricks Workspace
resource "azurerm_databricks_workspace" "example" {
name = "dbw-internal-ai-processor"
resource_group_name = azurerm_resource_group.rg.name
location = azurerm_resource_group.rg.location
sku = "premium" # Premium is required for Unity Catalog & Security Trimming
# Managed Resource Group is where Databricks puts its internal VMs/Storage
managed_resource_group_name = "rg-dbw-managed-resources"
tags = {
Environment = "Dev"
Project = "Internal-RAG"
}
}
# 4. Output the Workspace URL (You'll need this to log in)
output "databricks_host" {
value = azurerm_databricks_workspace.example.workspace_url
}

2. Crucial Architecture Notes for Terraform

  • The Managed Resource Group: Databricks is a “Managed Service.” When you run the code above, Azure will automatically create a second resource group (e.g., rg-dbw-managed-resources). You should not manually delete items inside that group; Terraform and Databricks manage them.
  • SKU Selection:
    • Standard: Cheaper, but lacks advanced security.
    • Premium: Required if you want to use Unity Catalog or User-level Permissions, which are essential for your internal RAG system’s security.
  • VNet Injection (Advanced): If your company requires the Databricks clusters to stay entirely off the public internet, you would need to add a custom_parameters block to the resource to “inject” it into your own Virtual Network.

3. How to Deploy

  1. Initialize: Download the Azure providers.Bashterraform init
  2. Plan: Preview what will be created.Bashterraform plan
  3. Apply: Deploy to Azure (you will be prompted to type yes).Bashterraform apply

4. Next Steps for your RAG Pipeline

Once this workspace is deployed, you will likely want to use Terraform to create Personal Access Tokens (PATs) or Service Principals so that Azure Data Factory (ADF) has permission to trigger Databricks jobs.

II . Add the ADLS Storage Account and ADF to this same Terraform script so they are all connected automatically

This is the “Enterprise Starter Kit” in Terraform. This code creates the resource group, the ADLS Gen2 storage (the reservoir), the Data Factory (the pipes), and the Databricks Workspace (the engine), then connects them so they can talk to each other.

The Integrated Terraform Script

Terraform

# 1. Resource Group
resource "azurerm_resource_group" "main" {
name = "rg-internal-rag-prod"
location = "East US"
}
# 2. ADLS Gen2 Storage (The Warehouse)
resource "azurerm_storage_account" "adls" {
name = "stdatalakeprod001" # Must be unique globally
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
is_hns_enabled = true # THIS TURNS IT INTO ADLS GEN2
}
resource "azurerm_storage_data_lake_gen2_filesystem" "raw" {
name = "bronze-raw"
storage_account_id = azurerm_storage_account.adls.id
}
# 3. Azure Data Factory (The Orchestrator)
resource "azurerm_data_factory" "adf" {
name = "adf-internal-rag-orchestrator"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
identity {
type = "SystemAssigned" # Allows ADF to talk to others without passwords
}
}
# 4. Databricks Workspace (The Processor)
resource "azurerm_databricks_workspace" "dbw" {
name = "dbw-internal-ai-engine"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "premium" # Required for security trimming/Unity Catalog
}
# 5. CONNECTING THEM: ADF Linked Service to Databricks
# This tells ADF how to trigger Databricks jobs
resource "azurerm_data_factory_linked_service_azure_databricks" "adf_to_dbw" {
name = "LS_Databricks_Connection"
data_factory_id = azurerm_data_factory.adf.id
adb_domain = "https://${azurerm_databricks_workspace.dbw.workspace_url}"
# Using Managed Identity for security (no keys stored!)
msi_work_space_resource_id = azurerm_databricks_workspace.dbw.id
new_cluster_config {
node_type = "Standard_DS3_v2"
cluster_version = "13.3.x-scala2.12"
min_number_of_workers = 1
max_number_of_workers = 2
}
}

Why this setup is “RAG-Ready”:

  • is_hns_enabled = true: This is the most critical line. Without it, you just have a slow, flat blob storage. This enables the high-speed folder structure for your data lake.
  • Managed Identity (SystemAssigned): Notice there are no passwords or “Access Keys” in this code. ADF uses its own identity to log into Databricks. This is the gold standard for internal security.
  • sku = "premium": In an internal environment, you need “Premium” so you can enable Unity Catalog. This ensures that if a user doesn’t have permission to see an HR file in the Data Lake, the Databricks AI code won’t be able to read it either.
  • Automatic Cluster: The new_cluster_config block tells Data Factory to spin up a Databricks cluster only when it’s needed and shut it down immediately after, saving your department significant money.

How to use this:

  1. Save this as main.tf.
  2. Run terraform init and terraform apply.
  3. Once finished, go to the Azure Portal and you will see your entire data ecosystem ready to be loaded with documents.

III. Terraform code to deploy the Azure AI Search service to complete the RAG stack

To complete your Internal RAG stack, we need to add Azure AI Search. In 2026, this service is the “Vector Database” that allows your chatbot to find the right paragraphs within your internal documents.

The Final Terraform Piece: Azure AI Search

Add this block to your existing main.tf file. It deploys the Search service with Semantic Ranking enabled—which is the “secret sauce” for high-accuracy internal bots.

Terraform

# 6. Azure AI Search Service
resource "azurerm_search_service" "search" {
name = "ais-internal-rag-lookup"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
# 'basic' or 'standard' is required for Semantic Ranker and Vector Search
sku = "basic"
# Enabling Semantic Search (Standard plan for the ranker)
semantic_search_sku = "standard"
# Identity for secure connection to ADLS and OpenAI
identity {
type = "SystemAssigned"
}
tags = {
Project = "Internal-RAG"
}
}
# Output the Search Admin Key (for your App/Chatbot)
output "search_admin_key" {
value = azurerm_search_service.search.primary_key
sensitive = true
}
output "search_endpoint" {
value = "https://${azurerm_search_service.search.name}.search.windows.net"
}

How it all fits together (The Finished Product)

You now have a fully automated infrastructure code that creates:

  1. ADLS Gen2: Where your raw internal PDFs and Docs are stored.
  2. Azure Data Factory: The engine that “watches” for new files and tells the system to update.
  3. Azure Databricks: The brain that cleans data and can perform complex AI “chunking” of your documents.
  4. Azure AI Search: The searchable index that your chatbot queries to get answers.

2026 Pro-Tip: “Integrated Vectorization”

In 2026, you don’t actually need to write Python code to turn text into vectors. You can configure Azure AI Search to connect directly to your ADLS (using the adlsgen2 data source type) and it will:

  • Automatically “crack” the PDFs.
  • Chunk the text.
  • Call Azure OpenAI to create vectors.
  • Keep the index updated whenever a file in ADLS changes.

Final Step for Security

Since this is an Internal system, remember to go into the Azure Portal (or add to Terraform) a Role Assignment. You must give the Search Service’s Managed Identity the role of Storage Blob Data Reader on your ADLS account so it can “read” the files to index them.

IV. Terraform code to add the LLM to this stack as well

To finish the “Brain” of your RAG system, we will add Azure OpenAI. This provides the Embedding Model (to turn text into math/vectors) and the GPT Model (to actually talk to your employees).

In 2026, the best practice is to deploy these as “Account Deployments” within a single Azure OpenAI resource.


7. Azure OpenAI Resource & Models

Add this to your main.tf. This script creates the AI service and deploys two specific models:

  1. text-embedding-3-large: For high-accuracy vector search.
  2. gpt-4o: The “Omni” model for fast, reasoning-based chat.

Terraform

# 7. Azure OpenAI Service
resource "azurerm_cognitive_account" "openai" {
name = "oai-internal-rag-brain"
location = "East US" # Check availability for gpt-4o in your region
resource_group_name = azurerm_resource_group.main.name
kind = "OpenAI"
sku_name = "S0"
identity {
type = "SystemAssigned"
}
}
# Deploy the Embedding Model (The "Encoder")
resource "azurerm_cognitive_deployment" "embedding" {
name = "text-embedding-3-large"
cognitive_account_id = azurerm_cognitive_account.openai.id
model {
format = "OpenAI"
name = "text-embedding-3-large"
version = "1"
}
sku {
name = "Standard"
capacity = 50 # Tokens per minute (adjust based on need)
}
}
# Deploy the Chat Model (The "Talker")
resource "azurerm_cognitive_deployment" "gpt4" {
name = "gpt-4o"
cognitive_account_id = azurerm_cognitive_account.openai.id
model {
format = "OpenAI"
name = "gpt-4o"
version = "2024-05-13"
}
sku {
name = "Standard"
capacity = 30
}
}
# Output the OpenAI Endpoint
output "openai_endpoint" {
value = azurerm_cognitive_account.openai.endpoint
}

The Completed 2026 RAG Infrastructure Map

Now that your Terraform is complete, here is how the data flows through the resources you just built:

  1. Ingest: ADF detects a new PDF in ADLS Gen2 (Bronze).
  2. Process: Databricks chunks the text and uses the OpenAI Embedding Model to vectorize it.
  3. Index: The vectors and text are pushed into Azure AI Search.
  4. Query: An employee asks a question in your Chatbot UI.
  5. Retrieve: Azure AI Search finds the relevant chunks.
  6. Augment/Generate: Your app sends the chunks + the question to GPT-4o, which gives the final answer.

One Final Security Step (The “Handshake”)

For an Internal system, you must ensure Azure AI Search has permission to talk to Azure OpenAI to handle “Integrated Vectorization.”

You can add this Terraform block to grant that permission automatically:

Terraform

# Give AI Search permission to use OpenAI
resource "azurerm_role_assignment" "search_to_openai" {
scope = azurerm_cognitive_account.openai.id
role_definition_name = "Cognitive Services OpenAI User"
principal_id = azurerm_search_service.search.identity[0].principal_id
}

Final Checklist for Deployment:

  • [ ] Region: Ensure all services (especially OpenAI and AI Search) are in the same region to minimize latency and data egress costs.
  • [ ] Quotas: Check your Azure subscription for “Cognitive Services” quotas; sometimes new subscriptions have a limit on GPT-4o capacity.
  • [ ] Cost: Remember to turn off your Databricks clusters when not in use—they are the most expensive part of this stack!

V. Infrastructure is defined, the Python/Spark code you’ll need inside Databricks to start processing those internal documents

To get your internal RAG system running, you need a Databricks Notebook that acts as the bridge between your documents in ADLS and the searchable index in Azure AI Search.

In 2026, the most efficient way to do this is using Spark for parallel processing and the langchain-azure-ai library for the AI logic.


1. Library Installation

Run this in the first cell of your Databricks notebook to install the 2026-standard libraries.

Python

%pip install -U langchain-azure-ai azure-search-documents azure-identity pypdf
dbutils.library.restartPython()

2. The Ingestion Script (PySpark)

This script performs the “Heavy Lifting”: it reads PDFs from your Bronze folder, chunks them, and sends them to the AI Search Index.

Python

import os
from pyspark.sql.functions import col, pandas_udf
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_azure_ai.embeddings import AzureOpenAIEmbeddings
from langchain_azure_ai.vectorstores import AzureAISearch
# --- Configuration ---
STORAGE_PATH = "abfss://bronze-raw@stdatalakeprod001.dfs.core.windows.net/hr-manuals/"
SEARCH_ENDPOINT = "https://ais-internal-rag-lookup.search.windows.net"
SEARCH_KEY = dbutils.secrets.get(scope="rag-scope", key="search-admin-key")
# 1. Load Data from ADLS
# Using Spark to list all PDF files in the lake
df = spark.read.format("binaryFile").option("pathGlobFilter", "*.pdf").load(STORAGE_PATH)
# 2. Extract and Chunk Text
# (Simplification: In prod, use 'spark-pdf' or 'Azure AI Document Intelligence')
def process_pdf(content):
import io
from pypdf import PdfReader
reader = PdfReader(io.BytesIO(content))
text = ""
for page in reader.pages:
text += page.extract_text()
# Split into 1000-character chunks with overlap for context
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
return splitter.split_text(text)
# 3. Create Embeddings & Push to Azure AI Search
embeddings = AzureOpenAIEmbeddings(
azure_deployment="text-embedding-3-large",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"]
)
vector_store = AzureAISearch(
threading=True,
azure_search_endpoint=SEARCH_ENDPOINT,
azure_search_key=SEARCH_KEY,
index_name="internal-docs-index",
embedding_function=embeddings.embed_query
)
# Convert PDF data to chunks and add to Vector Store
for row in df.collect():
chunks = process_pdf(row.content)
# Metadata helps with "Security Trimming" later
metadata = [{"source": row.path, "id": f"{row.path}_{i}"} for i in range(len(chunks))]
vector_store.add_texts(chunks, metadatas=metadata)
print("Indexing Complete!")

3. Key 2026 Features in this Code

  • RecursiveCharacterTextSplitter: This ensures that chunks don’t break in the middle of a word, which helps the LLM maintain “contextual integrity.”
  • text-embedding-3-large: We use the latest 2026 embedding model because it supports MRL (Matryoshka Representation Learning), allowing you to store smaller vectors without losing accuracy—saving you money on storage.
  • Managed Identity: In a real production setup, you would replace SEARCH_KEY with DefaultAzureCredential() so no keys are ever visible in the code.

4. How the Chatbot Queries this Data

Once that script runs, your “Search Index” is full. When a user asks a question, your backend app (likely a Python Web App or Azure Function) runs this simple logic:

Python

# The "Retriever"
query = "What is the policy on maternity leave?"
docs = vector_store.similarity_search(
query,
k=3,
search_type="hybrid" # 2026 standard: Vector + Keyword
)
# The "Generator"
# Pass 'docs' to GPT-4o to get the final human-friendly answer

Summary of your Pipeline

  1. ADF detects the file.
  2. Databricks (using the code above) reads the file.
  3. Databricks chunks it and calculates the math (embeddings).
  4. Azure AI Search stores the result.

Your infrastructure and code are now fully aligned, setting up the User Interface (like a Streamlit or React app) to actually talk to this bot

For the final piece of your internal RAG system, you need a user-friendly interface. In 2026, there are two primary paths: the “Pro-Code” way (custom and flexible) and the “Low-Code” way (fast and integrated).


Option 1: The “Low-Code” Way (Azure AI Foundry)

If you want a professional-grade internal portal in under 10 minutes, use the Azure AI Foundry (formerly AI Studio) “Deploy to Web App” feature.

  1. Go to Azure AI Foundry: Open your project.
  2. Open Playground: Select the “Chat” playground.
  3. Setup RAG: Under “Add your data,” select your Azure AI Search index.
  4. Deploy: Click the Deploy to button and select “as a web app”.
    • Azure will automatically provision an Azure App Service.
    • It creates a beautiful, ChatGPT-like interface that is pre-connected to your data.
    • Security: It integrates with Microsoft Entra ID out of the box, so only your employees can log in.

Option 2: The “Pro-Code” Way (Python + Streamlit)

If you want to customize the branding or add specific internal tools (like a “Feedback” button), Streamlit is the industry standard for 2026.

The app.py logic:

Python

import streamlit as st
from langchain_azure_ai.chat_models import AzureChatOpenAI
from langchain_azure_ai.vectorstores import AzureAISearch
from langchain_azure_ai.embeddings import AzureOpenAIEmbeddings
st.set_page_config(page_title="Internal HR Assistant", layout="wide")
st.title("🤖 Internal Knowledge Bot")
# 1. Initialize Connection
embeddings = AzureOpenAIEmbeddings(azure_deployment="text-embedding-3-large")
vector_store = AzureAISearch(
azure_search_endpoint=st.secrets["SEARCH_ENDPOINT"],
index_name="internal-docs-index",
embedding_function=embeddings.embed_query
)
llm = AzureChatOpenAI(azure_deployment="gpt-4o", temperature=0)
# 2. Chat Interface
if "messages" not in st.session_state:
st.session_state.messages = []
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
if prompt := st.chat_input("Ask me about company policy..."):
st.session_state.messages.append({"role": "user", "content": prompt})
# 3. The RAG Step: Retrieve relevant chunks
docs = vector_store.similarity_search(prompt, k=3)
context = "\n".join([d.page_content for d in docs])
# 4. The RAG Step: Generate Answer
full_prompt = f"Use the following context to answer: {context}\n\nQuestion: {prompt}"
response = llm.invoke(full_prompt).content
with st.chat_message("assistant"):
st.markdown(response)
with st.expander("View Sources"):
for d in docs:
st.write(f"- {d.metadata['source']}")
st.session_state.messages.append({"role": "assistant", "content": response})

3. Critical 2026 UX Features for Internal Bots

To make your bot actually useful for employees, ensure your UI includes:

  • Citations: Never show an answer without a link to the source document in ADLS. This builds trust and allows employees to verify the info.
  • “I Don’t Know” Guardrail: Explicitly tell the LLM in the system prompt: “If the answer is not in the provided documents, say you don’t know. Do not make up facts.”
  • Feedback Loop: Add a Thumbs Up/Down on every response. You can pipe this feedback back into ADF to log which documents are confusing the AI.

Summary of your Journey

You have successfully moved from raw data to a live AI assistant:

  1. Infrastructure: Deployed via Terraform.
  2. Pipeline: Data moved by ADF into ADLS.
  3. Processing: Cleaned and Vectorized in Databricks.
  4. Retrieval: Indexed and ranked in Azure AI Search.
  5. Interface: Served via App Service or Streamlit.

VII. how to monitor the costs of this stack so you don’t get a surprise bill at the end of the month

Building a RAG system in 2026 is an exercise in consumption-based engineering. Because every PDF page processed and every chat message sent has a price tag, monitoring isn’t just “good hygiene”—it’s a requirement to keep the project alive.

Here is how to monitor and optimize the costs of your specific ADF-ADLS-Databricks-AI Search stack.


1. The “Big Three” Cost Drivers

In your internal RAG architecture, these three will typically account for 90% of your bill:

ServiceThe Cost Driver2026 Pro-Tip
DatabricksCluster uptime (DBUs)Use Serverless Compute for job runs. It scales to zero the second the processing finishes.
AI SearchSearch Units (SUs)Start with the Basic tier. Don’t move to Standard until your document count exceeds 15GB or 1 million chunks.
Azure OpenAIToken ConsumptionUse gpt-4o-mini for simple summarization and only use gpt-4o for complex reasoning to save up to 80% on tokens.

2. Setting Up “Hard” Guardrails (Azure Budgets)

Don’t wait for the monthly invoice. Set up an automated kill-switch.

  1. Create a Resource Group Budget: Put all your RAG resources (ADF, ADLS, etc.) in one Resource Group.
  2. Set Thresholds: * 50%: Send an email to the team.
    • 90%: Send a high-priority alert to the Manager.
    • 100% (The Nuclear Option): In 2026, you can trigger an Azure Automation Runbook that programmatically disables the Azure OpenAI API keys, instantly stopping further spending.

3. Optimization Checklist by Service

Azure Data Factory (ADF)

  • Data Integration Units (DIUs): When copying files from SharePoint/On-prem to ADLS, ADF defaults to 4 DIUs. For small internal docs, manually set this to 2 to halve the copy cost.
  • Avoid Over-Polling: Set your triggers to “Tumbling Window” or “Storage Event” rather than “Schedule” (e.g., checking every 1 minute) to reduce trigger run costs.

Azure Databricks

  • Auto-Termination: Ensure your clusters are set to terminate after 10 minutes of inactivity.
  • Photon Engine: Turn on the Photon query engine. While it costs slightly more per hour, it processes data so much faster that the total cost of the job is usually lower.

Azure Data Lake (ADLS)

  • Lifecycle Management: Set a policy to move files from Hot to Cold storage if they haven’t been accessed in 30 days. Your “Raw/Bronze” data almost never needs to be in the Hot tier.

Azure AI Search

  • Image Cracking: If your PDFs contain images, “Document Cracking” costs extra ($1 per 1,000 images). If you don’t need to “read” charts or photos, disable image extraction in your indexer settings.

4. 2026 “FinOps for AI” Dashboard

The most effective way to stay under budget is to build a Cost Visibility Dashboard.

  • Tagging: Ensure every resource in your Terraform code has a Project: Internal-RAG and Department: HR tag.
  • Cost Analysis: Use the Azure Cost Management tool to group by “Tag.” This allows you to show your stakeholders exactly how much the “HR Bot” is costing per department.

Summary

If you are just starting, your daily “Idle” cost (with everything turned on but no one using it) should be roughly $5–$10/day. If it’s higher, check if a Databricks cluster was left running or if you deployed a “Standard” tier AI Search unnecessarily.

Since we’ve covered the infrastructure, the code, the UI, and the costs, is there any specific part of the internal security (like preventing certain employees from seeing certain files) you’d like to dive into?

Azure Data Lake Storage (ADLS)

If ADF is the plumbing and Databricks is the engine, Azure Data Lake Storage (ADLS) Gen2 is the actual physical warehouse where everything is kept.

In 2026, it remains the standard for “Big Data” because it combines the cheap, limitless nature of Cloud Object Storage with the high-speed organization of a File System.


1. The Secret Sauce: Hierarchical Namespace (HNS)

Standard cloud storage (like Azure Blob or Amazon S3) is “flat.” If you have a file at /logs/2026/March/data.csv, the computer sees that whole string as one long name. To move a folder, it has to copy every single file inside it.

With ADLS Gen2, folders are “real” (Hierarchical Namespace).

  • Rename/Move: Renaming a folder with 10 million files is instantaneous because it just changes one reference, not 10 million files.
  • Performance: When a tool like Databricks or Spark asks for “all files in the March folder,” ADLS knows exactly where they are without searching through the entire lake.

2. The Storage Tiers (Cost Savings)

You don’t pay the same price for all data. ADLS allows you to move data between “Tiers” automatically based on how often you touch it:

  • Hot Tier: Highest cost to store, lowest cost to access. Use this for data you are actively processing in your RAG pipeline today.
  • Cool/Cold Tier: Lower storage cost, but you pay a fee to read it. Great for data from last month.
  • Archive Tier: Dirt cheap (pennies per GB). The data is “offline”—it can take a few hours to “rehydrate” it so you can read it again. Perfect for legal compliance backups.

3. Security (ACLs vs. RBAC)

For your Internal RAG system, this is the most important part of ADLS. It uses two layers of security:

  1. RBAC (Role-Based Access Control): Broad permissions (e.g., “John is a Storage Contributor”).
  2. ACLs (Access Control Lists): POSIX-style permissions. You can say “John can see the ‘Public’ folder, but only HR can see the ‘Salaries’ folder.” 2026 Update: Azure AI Search now “respects” these ACLs. If you index files from ADLS, the search results will automatically hide files that the logged-in user doesn’t have permission to see in the Data Lake.

4. ADLS Gen2 vs. Microsoft Fabric OneLake

You might hear about OneLake (the “OneDrive for data”). Here is how to tell them apart in 2026:

  • ADLS Gen2: The “Infrastructure” choice. You have full control over networking, encryption keys, and regions. Best for custom data engineering and Databricks heavy-lifters.
  • OneLake: The “SaaS” choice. It is actually built on top of ADLS, but it manages the folders and permissions for you automatically within Microsoft Fabric.

Summary Checklist

  • Format: Use Delta or Parquet for your “Silver” and “Gold” layers. These are compressed and optimized for AI and BI.
  • Structure: Always follow the Bronze -> Silver -> Gold folder structure to keep your lake from becoming a “data swamp.”
  • Access: Use Managed Identities so ADF and Databricks can talk to ADLS without you ever having to copy-paste a password or a secret key.

Are you planning to manage the ADLS folders yourself, or is your company moving toward a managed environment like Microsoft Fabric?

Azure Databricks

In 2026, Azure Databricks is much more than just a “data processing tool.” It is now positioned as a Data Intelligence Platform. While it’s still based on Apache Spark, it has evolved to use AI to help you manage your data, write your code, and govern your security.

Think of it as the high-performance engine of your data factory.


1. The Core Technology: Spark + Delta Lake

At its heart, Databricks does two things exceptionally well:

  • Apache Spark: A distributed computing engine. If you have 100TB of data, Databricks breaks it into 1,000 tiny pieces and processes them all at the same time across a “cluster” of computers.
  • Delta Lake: This is the storage layer that sits on top of your ADLS. it gives your “data lake” (files) the powers of a “database” (tables), allowing for things like Undo (Time Travel) and ACID transactions (ensuring data isn’t corrupted if a write fails).

2. New in 2026: The “Intelligence” Layer

The biggest shift recently is that Databricks now uses AI to run its own infrastructure:

  • Genie Code (formerly Databricks Assistant): An agentic AI built into the notebooks. You can type “Clean this table and create a vector index for my RAG bot,” and it will write and execute the Spark code for you.
  • Serverless Compute: You no longer need to “size” clusters (deciding how many CPUs/RAM). You just run your code, and Databricks instantly scales the hardware up or down, charging you only for the seconds the code is running.
  • Liquid Clustering: In the past, data engineers had to manually “partition” data to keep it fast. Now, Databricks uses AI to automatically reorganize data based on how you query it, making searches up to 12x faster.

3. How it fits your RAG System

For your internal chatbot, Databricks is the “Processor” that prepares your data for Azure AI Search:

  1. Parsing: It opens your internal PDFs/Word docs from ADLS.
  2. Chunking: It breaks the text into logical paragraphs.
  3. Embedding: It calls an LLM (like OpenAI) to turn those paragraphs into Vectors.
  4. Syncing: It pushes those vectors into your Search Index.

4. Databricks vs. The Competition (2026)

FeatureAzure DatabricksMicrosoft FabricAzure SQL
Best ForHeavy Data Engineering & AIBusiness Intelligence (BI)App Backend / Small Data
LanguagePython, SQL, Scala, RMostly SQL & Low-CodeSQL
Philosophy“Open” (Files in your ADLS)“SaaS” (Everything managed)“Relational” (Strict tables)
PowerUnlimited (Petabyte scale)High (Enterprise scale)Medium (GB to low TB)

5. Unity Catalog (The “Traffic Cop”)

In an internal setting, Unity Catalog is the most important part of Databricks. It provides a single place to manage permissions. If you grant a user access to a table in Databricks, those permissions follow the data even if it’s moved or mirrored into other services like Power BI or Microsoft Fabric.

Summary

  • Use ADF to move the data.
  • Use ADLS to store the data.
  • Use Databricks to do the “heavy thinking,” cleaning, and AI vectorization.
  • Use Azure SQL / AI Search to give the data to your users/bot.

Azure data ecosystem

In the Azure data ecosystem, these four services form the “Modern Data Stack.” They work together to move, store, process, and serve data. If you think of your data as water, this ecosystem is the plumbing, the reservoir, the filtration plant, and the tap.


1. ADLS Gen2 (The Reservoir)

Azure Data Lake Storage Gen2 is the foundation. It is a highly scalable, cost-effective storage space where you keep all your data—structured (tables), semi-structured (JSON/Logs), and unstructured (PDFs/Images).

  • Role: The single source of truth (Data Lake).
  • Key Feature: Hierarchical Namespace. Unlike standard “flat” cloud storage, it allows for folders and subfolders, which makes data access much faster for big data analytics.
  • 2026 Context: It serves as the “Bronze” (Raw) and “Silver” (Filtered) layers in a Medallion Architecture.

2. ADF (The Plumbing & Orchestrator)

Azure Data Factory is the glue. It doesn’t “own” the data; it moves it from point A to point B and tells other services when to start working.

  • Role: ETL/ELT Orchestration. It pulls data from on-premises servers or APIs and drops it into ADLS.
  • Key Feature: Low-code UI. You build “Pipelines” using a drag-and-drop interface.
  • Integration: It often has a “trigger” that tells Databricks: “I just finished moving the raw files to ADLS, now go clean them.”

3. Azure Databricks (The Filtration Plant)

Azure Databricks is where the heavy lifting happens. It is an Apache Spark-based platform used for massive-scale data processing, data science, and machine learning.

  • Role: Transformation & Analytics. It takes the messy data from ADLS and turns it into clean, aggregated “Gold” data.
  • Key Feature: Notebooks. Engineers write code (Python, SQL, Scala) in a collaborative environment.
  • 2026 Context: It is the primary engine for Vectorization in RAG systems—turning your internal documents into mathematical vectors for AI Search.

4. Azure SQL (The Tap)

Azure SQL Database (or Azure Synapse) is the final destination for business users. While ADLS is great for “big data,” it’s not the best for a quick dashboard or a mobile app.

  • Role: Data Serving. It stores the final, “Gold” level data that has been cleaned and structured.
  • Key Feature: High Performance for Queries. It is optimized for Power BI reports and standard business applications.
  • Usage: After Databricks cleans the data, it saves the final results into Azure SQL so the CEO can see a dashboard the next morning.

How they work together (The Flow)

StepServiceAction
1. IngestADFCopies logs from an on-prem server to the cloud.
2. StoreADLSHolds the raw .csv files in a “Raw” folder.
3. ProcessDatabricksReads the .csv, removes duplicates, and calculates monthly totals.
4. ServeAzure SQLThe cleaned totals are loaded into a SQL table.
5. VisualizePower BIConnects to Azure SQL to show a “Sales Revenue” chart.

Summary Table

ServicePrimary Skill NeededBest For…
ADFLogic / Drag-and-DropMoving data & scheduling tasks.
ADLSFolder OrganizationStoring massive amounts of any data type.
DatabricksPython / SQL / SparkComplex math, AI, and cleaning big data.
Azure SQLStandard SQLPowering apps and BI dashboards.

To explain the pipeline between these four, we use the Medallion Architecture. This is the industry-standard way to move data from a “raw” state to an “AI-ready” or “Business-ready” state.


Phase 1: Ingestion (The “Collector”)

  • Services: ADF + ADLS Gen2 (Bronze Folder)
  • The Action: ADF acts as the trigger. It connects to your external source (like an internal SAP system, a REST API, or a local SQL Server).
  • The Result: ADF “copies” the data exactly as it is—warts and all—into the Bronze container of your ADLS.
  • Why? You always keep a raw copy. If your logic fails later, you don’t have to go back to the source; you just restart from the Bronze folder.

Phase 2: Transformation (The “Refinery”)

  • Services: Databricks + ADLS Gen2 (Silver Folder)
  • The Action: ADF sends a signal to Databricks to start a “Job.” Databricks opens the raw files from the Bronze folder.
    • It filters out null values.
    • It fixes date formats (e.g., changing 01-03-26 to 2026-03-01).
    • It joins tables together.
  • The Result: Databricks writes this “clean” data into the Silver container of your ADLS, usually in Delta format (a high-performance version of Parquet).

Phase 3: Aggregation & Logic (The “Chef”)

  • Services: Databricks + ADLS Gen2 (Gold Folder)
  • The Action: Databricks runs a second set of logic. Instead of just cleaning data, it calculates things. It creates “Gold” tables like Monthly_Sales_Summary or Employee_Vector_Embeddings.
  • The Result: These high-value tables are stored in the Gold container. This data is now perfect.

Phase 4: Serving (The “Storefront”)

  • Services: Azure SQL
  • The Action: ADF runs one final “Copy Activity.” it takes the small, aggregated tables from the Gold folder in ADLS and pushes them into Azure SQL Database.
  • The Result: Your internal dashboard (Power BI) or your Chatbot’s metadata storage connects to Azure SQL. Because the data is already cleaned and summarized, the dashboard loads instantly.

The Complete Workflow Summary

StageData StateTool in ChargeWhere it Sits
IngestRaw / MessyADFADLS (Bronze)
CleanFiltered / StandardizedDatabricksADLS (Silver)
ComputeAggregated / Business LogicDatabricksADLS (Gold)
ServeFinal Tables / Ready for UIADFAzure SQL

How this connects to your RAG Chatbot:

In your specific case, Databricks is the MVP. It reads the internal PDFs from the Silver folder, uses an AI model to turn the text into Vectors, and then you can either store those vectors in Azure SQL (if they are small) or send them straight to Azure AI Search.

Building a data pipeline in Azure

Building a data pipeline in Azure using Azure Data Factory (ADF) and Azure Data Lake Storage (ADLS) is the “bread and butter” of modern cloud data engineering. Think of ADLS as your massive digital warehouse and ADF as the conveyor belts and robotic arms moving things around.

Here is the high-level workflow and the steps to get it running.


1. The Architecture

In a typical scenario, you move data from a source (like an on-premises SQL DB or an API) into ADLS, then process it.

Key Components:

  • Linked Services: Your “Connection Strings.” These store the credentials to talk to ADLS or your source.
  • Datasets: These point to specific folders or files within your Linked Service.
  • Pipelines: The logical grouping of activities (the workflow).
  • Activities: The individual actions (e.g., Copy Data, Databricks Notebook, Lookup).

2. Step-by-Step Implementation

Step 1: Set up the Storage (ADLS Gen2)

  1. In the Azure Portal, create a Storage Account.
  2. Crucial: Under the “Advanced” tab, ensure Hierarchical Namespace is enabled. This turns standard Blob storage into ADLS Gen2.
  3. Create a Container (e.g., raw-data).

Step 2: Create the Linked Service in ADF

  1. Open Azure Data Factory Studio.
  2. Go to the Manage tab (toolbox icon) > Linked Services > New.
  3. Search for Azure Data Lake Storage Gen2.
  4. Select your subscription and the storage account you created. Test the connection and click Create.

Step 3: Define your Datasets

You need a “Source” dataset (where data comes from) and a “Sink” dataset (where data goes).

  1. Go to the Author tab (pencil icon) > Datasets > New Dataset.
  2. Select Azure Data Lake Storage Gen2.
  3. Choose the format (Parquet and Delimited Text/CSV are most common).
  4. Point it to the specific file path in your ADLS container.

Step 4: Build the Pipeline

  1. In the Author tab, click the + icon > Pipeline.
  2. From the Activities menu, drag and drop the Copy Data activity onto the canvas.
  3. Source Tab: Select your source dataset.
  4. Sink Tab: Select your ADLS dataset.
  5. Mapping Tab: Click “Import Schemas” to ensure the columns align correctly.

3. Best Practices for ADLS Pipelines

  • Folder Structure: Use a “Medallion Architecture” (Bronze/Raw, Silver/Cleaned, Gold/Aggregated) within your ADLS containers to keep data organized.
  • Triggering: Don’t just run things manually. Use Schedule Triggers (time-based) or Storage Event Triggers (runs automatically when a file drops into ADLS).
  • Parameters: Avoid hardcoding file names. Use Parameters and Dynamic Content so one pipeline can handle multiple different files.

4. Example Formula for Dynamic Paths

If you want to organize your data by date automatically in ADLS, you can use a dynamic expression in the dataset path:

$$dataset().Directory = concat(‘raw/’, formatDateTime(utcNow(), ‘yyyy/MM/dd’))$$

This ensures that every time the pipeline runs, it creates a new folder for that day’s data.

Ingress

In Kubernetes, Ingress is an API object that acts as a “smart router” for your cluster. While a standard Service (like a LoadBalancer) simply opens a hole in the firewall for one specific app, Ingress allows you to consolidate many services behind a single entry point and route traffic based on the URL or path.

Think of it as the receptionist of an office building: instead of every employee having their own front door, everyone uses one main entrance, and the receptionist directs visitors to the correct room based on who they are looking for.


1. How Ingress Works

There are two distinct parts required to make this work:

  1. Ingress Resource: A YAML file where you define your “rules” (e.g., “Send all traffic for myapp.com/api to the api-service“).
  2. Ingress Controller: The actual software (like NGINX, HAProxy, or Traefik) that sits at the edge of your cluster, reads those rules, and physically moves the traffic. Kubernetes does not come with a controller by default; you must install one.

2. Key Capabilities

Ingress is much more powerful than a simple Port or LoadBalancer because it operates at Layer 7 (HTTP/HTTPS).

  • Host-based Routing: Route blue.example.com to the Blue Service and green.example.com to the Green Service using a single IP.
  • Path-based Routing: Route example.com/login to the Auth service and example.com/search to the Search service.
  • SSL/TLS Termination: You can handle your SSL certificates at the Ingress level so your individual application pods don’t have to deal with encryption/decryption.
  • Name-based Virtual Hosting: Supporting multiple domain names on the same IP address.

3. Ingress vs. LoadBalancer vs. NodePort

Choosing how to expose your app is a common point of confusion. Here is the breakdown:

MethodBest ForPros/Cons
NodePortTesting/DevOpens a high-range port (30000+) on every node. Very insecure for production.
LoadBalancerSingle AppsStandard way to get an external IP. Expensive if you have 50 apps, as you pay for 50 cloud load balancers.
IngressProductionMultiple apps share one Load Balancer. Extremely cost-effective and feature-rich.

4. What about OpenShift “Routes”?

Since you were asking about OCP earlier, it’s important to note that OpenShift has its own version called a Route.

  • Routes were created by Red Hat before the Kubernetes Ingress standard existed.
  • In modern OpenShift, an Ingress is automatically converted into a Route behind the scenes.
  • The main difference: Routes support “F5” and more complex HAProxy features natively, while Ingress is the industry-standard “portable” way to do it.

5. Basic Ingress Example

This YAML routes traffic to two different services based on the URL path:

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: main-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /billing
pathType: Prefix
backend:
service:
name: billing-service
port:
number: 80
- path: /inventory
pathType: Prefix
backend:
service:
name: inventory-service
port:
number: 80

To set up HTTPS for an Ingress, you need a TLS Secret. This secret acts as a container for your SSL certificate and private key, which the Ingress Controller (like NGINX or HAProxy) uses to encrypt the traffic before it reaches your pods.


1. Create the TLS Secret

First, you need your certificate files (tls.crt and tls.key). You can create a Kubernetes Secret from these files using the CLI:

Bash

kubectl create secret tls my-app-tls \
--cert=path/to/tls.crt \
--key=path/to/tls.key

Note: If you are using OpenShift, you can often skip this manual step by using Service Serving Certificates, which allow OpenShift to generate and manage the certificates for you automatically.


2. Update the Ingress YAML

Now, you tell the Ingress resource to use that secret for a specific hostname.

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: secure-ingress
annotations:
# This annotation tells the controller to redirect HTTP to HTTPS
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- myapp.example.com
secretName: my-app-tls # This must match the secret name created in Step 1
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-web-service
port:
number: 80

3. Automation with Cert-Manager

Manually updating certificates before they expire is a headache. Most production clusters use Cert-Manager.

Cert-Manager is an operator that talks to certificate authorities like Let’s Encrypt. You simply add an annotation to your Ingress, and Cert-Manager handles the rest:

The “Magic” Annotation:

YAML

metadata:
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"

Once you add this, Cert-Manager will:

  1. See the Ingress request.
  2. Reach out to Let’s Encrypt to verify you own the domain.
  3. Generate the tls.crt and tls.key.
  4. Create the Secret for you and renew it every 90 days automatically.

Summary Checklist for HTTPS

StepAction
1. CertificateObtain a CA-signed cert or use Let’s Encrypt.
2. SecretStore the cert/key in a kind: Secret (type kubernetes.io/tls).
3. Ingress SpecAdd the tls: section to your Ingress YAML.
4. DNSEnsure your domain points to the Ingress Controller’s IP.

To automate SSL certificates with Cert-Manager, you need a ClusterIssuer. This is a cluster-wide resource that tells Cert-Manager how to talk to a Certificate Authority (CA) like Let’s Encrypt.

Before you start, ensure the Cert-Manager Operator is installed in your cluster (in OpenShift, you can find this in the OperatorHub).


1. Create a ClusterIssuer (The “Account”)

This YAML defines your identity with Let’s Encrypt. It uses the ACME (Automated Certificate Management Environment) protocol.

YAML

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
# The ACME server address for Let's Encrypt production
server: https://acme-v02.api.letsencrypt.org/directory
# Your email address for expiration notices
email: admin@yourdomain.com
# Name of a secret used to store the ACME account private key
privateKeySecretRef:
name: letsencrypt-prod-account-key
# Enable the HTTP-01 challenge provider
solvers:
- http01:
ingress:
class: nginx # Or 'openshift-default' depending on your ingress controller

2. Update your Ingress to “Request” the Cert

Once the ClusterIssuer is created, you don’t need to manually create secrets anymore. You just “tag” your Ingress with an annotation. Cert-Manager will see this, perform the challenge, and create the secret for you.

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-secure-app
annotations:
# THIS IS THE TRIGGER: It links the Ingress to your ClusterIssuer
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- app.yourdomain.com
secretName: app-tls-cert # Cert-Manager will create this secret automatically
rules:
- host: app.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-service
port:
number: 80

3. How to verify it’s working

After you apply the Ingress, Cert-Manager creates a Certificate object and a Challenge object. You can track the progress:

  • Check the certificate status:kubectl get certificate(Look for READY: True)
  • Check the order status (if it’s stuck):kubectl get challenges
  • Check the secret:kubectl get secret app-tls-cert(If this exists, your site is now HTTPS!)

Why use Let’s Encrypt?

  1. Cost: It is 100% free.
  2. Trust: It is recognized by all major browsers (unlike self-signed certs).
  3. No Maintenance: Cert-Manager automatically renews the cert 30 days before it expires.

A Small Warning:

Let’s Encrypt has rate limits. If you are just testing, use the “Staging” URL (https://acme-staging-v02.api.letsencrypt.org/directory) first. Browsers will show a warning for staging certs, but you won’t get blocked for hitting limit thresholds while debugging.

When Cert-Manager fails to issue a certificate, it usually gets stuck in the Challenge phase. Let’s look at how to diagnose and fix the most common “Let’s Encrypt” roadblocks.


1. The Troubleshooting Command Chain

If your certificate isn’t appearing, follow this hierarchy to find where the “handshake” broke:

  • Check the Certificate status:oc get certificateIf READY is False, move to the next step.
  • Check the Order:oc get orderThe Order is the request sent to Let’s Encrypt. Look at the STATE column.
  • Check the Challenge (The most important step):oc get challengesIf a challenge exists, it means Let’s Encrypt is trying to verify your domain but can’t.
  • Describe the Challenge for the error message:oc describe challenge <challenge-name>

2. Common Failure Reasons

A. The “I Can’t See You” (Firewall/Network)

Let’s Encrypt uses the HTTP-01 challenge. It tries to reach http://yourdomain.com/.well-known/acme-challenge/<TOKEN>.

  • The Problem: Your firewall, Security Group (AWS/Azure), or OpenShift Ingress Controller is blocking Port 80.
  • The Fix: Ensure Port 80 is open to the public internet. Let’s Encrypt cannot verify your domain over Port 443 (HTTPS) because the certificate doesn’t exist yet!

B. DNS Record Mismatch

  • The Problem: Your DNS A record or CNAME for app.yourdomain.com hasn’t propagated yet or is pointing to the wrong Load Balancer IP.
  • The Fix: Use dig app.yourdomain.com or nslookup to ensure the domain points exactly to your Ingress Controller’s external IP.

C. Rate Limiting

  • The Problem: You’ve tried to issue the same certificate too many times in one week (Let’s Encrypt has a limit of 5 duplicate certs per week).
  • The Fix: Switch your ClusterIssuer to use the Staging URL (mentioned in the previous step) until your configuration is 100% correct, then switch back to Production.

3. Dealing with Internal/Private Clusters

If your OpenShift cluster is behind a VPN and not accessible from the public internet, the HTTP-01 challenge will always fail because Let’s Encrypt can’t “see” your pods.

The Solution: DNS-01 Challenge

Instead of a web check, Cert-Manager proves ownership by adding a temporary TXT record to your DNS provider (Route53, Cloudflare, Azure DNS).

Example DNS-01 Issuer (Route53):

YAML

spec:
acme:
solvers:
- dns01:
aws-route53:
region: us-east-1
hostedZoneID: Z123456789

Summary Checklist

  1. Is Port 80 open?
  2. Does DNS point to the cluster?
  3. Are you hitting Rate Limits?
  4. Is your Ingress Class correct in the Issuer?