Azure – Landing Zone

An Azure Landing Zone is the “plumbing and wiring” of your cloud environment. It is a set of best practices, configurations, and governance rules that ensure a subscription is ready to host workloads securely and at scale.

If you think of a workload (like a website or database) as a house, the Landing Zone is the city blockβ€”it provides the electricity, water, roads, and security so the house can function.


πŸ›οΈ The Conceptual Architecture

A landing zone follows a Hub-and-Spoke design, ensuring that common services (like firewalls and identity) aren’t repeated for every single application.

1. The Management Group Hierarchy

Instead of managing one giant subscription, you organize them into “folders” called Management Groups:

  • Platform: Contains the “Engine Room” (Identity, Management, and Connectivity).
  • Workloads (Landing Zones): Where your actual applications live (Production, Development, Sandbox).
  • Decommissioned: Where old subscriptions go to die while retaining data for audit.

πŸ—οΈ The 8 Critical Design Areas

When you build a landing zone, you must make decisions in these eight categories:

  1. Enterprise Agreement (EA) & Tenants: How you bill and manage the top-level account.
  2. Identity & Access Management (IAM): Setting up Microsoft Entra ID and RBAC.
  3. Network Topology: Designing the Hub-and-Spoke, VNet peering, and hybrid connectivity (VPN/ExpressRoute).
  4. Resource Organization: Establishing a naming convention and tagging strategy.
  5. Security: Implementing Defender for Cloud and Azure Policy.
  6. Management: Centralizing logging in a Log Analytics Workspace.
  7. Governance: Using Azure Policy to prevent “shadow IT” (e.g., “No VMs allowed outside of East US”).
  8. Deployment: Using Infrastructure as Code (Terraform, Bicep, or Pulumi) to deploy the environment.

πŸš€ Two Main Implementation Paths

A. “Platform” Landing Zone (The Hub)

This is the central infrastructure managed by your IT/Cloud Platform team.

  • Connectivity Hub: Contains Azure Firewall, VPN Gateway, and Private DNS Zones.
  • Identity: Dedicated subscription for Domain Controllers or Entra Domain Services.
  • Management: Centralized Log Analytics and Automation accounts.

B. “Application” Landing Zone (The Spoke)

This is a subscription handed over to a development team.

  • It comes pre-configured with network peering back to the Hub.
  • It has Policies already applied (e.g., “Encryption must be enabled on all disks”).
  • The dev team has “Contributor” rights to build their app, but they cannot break the underlying network or security rules.

πŸ› οΈ How do you actually deploy it?

Microsoft provides the “Accelerator”β€”a set of templates that allow you to deploy a fully functional enterprise-scale environment in a few clicks or via code.

  1. Portal-based: Use the “Azure Landing Zone Accelerator” in the portal.
  2. Bicep/Terraform: Use the official Azure/Terraform-azurerm-caf-enterprise-scale modules.

βœ… Why do it?

  • Scalability: You can add 100 subscriptions without manual setup.
  • Security: Guardrails are “baked in” from day one.
  • Cost Control: Centralized monitoring stops “orphan” resources from running up the bill.

Azure DNZ zone with autoregistration enabled,

Here’s what it means in plain terms:

The short version

When you link a Virtual Network to a Private DNS Zone with autoregistration enabled, Azure automatically maintains DNS records for every VM in that VNet. You don’t touch the DNS zone manually β€” Azure handles it for you.

What happens at each VM lifecycle event

When you link a virtual network with a private DNS zone with this setting enabled, a DNS record gets created for each virtual machine deployed in the virtual network. For each virtual machine, an address (A) record is created.

If autoregistration is enabled, Azure Private DNS updates DNS records whenever a virtual machine inside the linked virtual network is created, changes its IP address, or is deleted.

So the three automatic actions are:

  • VM created β†’ A record added (vm-web-01 β†’ 10.0.0.4)
  • VM IP changes β†’ A record updated automatically
  • VM deleted or deallocated β†’ A record removed from the zone

What powers it under the hood

The private zone’s records are populated by the Azure DHCP service β€” client registration messages are ignored. This means it’s the Azure platform doing the work, not the VM’s operating system. If you configure a static IP on the VM without using Azure’s DHCP, changes to the hostname or IP won’t be reflected in the zone.

Important limits to know

A specific virtual network can be linked to only one private DNS zone when automatic registration is enabled. You can, however, link multiple virtual networks to a single DNS zone.

Autoregistration works only for virtual machines. For all other resources like internal load balancers, you can create DNS records manually in the private DNS zone linked to the virtual network.

Also, autoregistration doesn’t support reverse DNS pointer (PTR) records.

The practical benefit

In a classic setup without autoregistration, every time a VM is deployed or its IP changes, someone has to go manually update the DNS zone. With autoregistration on, your VMs are always reachable by a friendly name like vm-web-01.internal.contoso.com from anywhere inside the linked VNet β€” with zero manual effort, and no stale records left behind after deletions.

AZ – IAM

Azure IAM is best understood as two interlocking systems working together. Let me show you the big picture first, then how a request actually flows through it.Azure IAM is built around one question answered in two steps: who are you? and what are you allowed to do? Those two steps map to two distinct systems that work together.


Pillar 1 β€” Microsoft Entra ID (formerly Azure Active Directory): identity

This is the authentication layer. It answers “who are you?” by verifying credentials and issuing a token. It manages every type of identity in Azure: human users, guest accounts, groups, service principals (for apps and automation), and managed identities (the zero-secret identity type where Azure owns the credential). It also enforces Conditional Access policies β€” rules that say things like “only allow login from compliant devices” or “require MFA when signing in from outside the corporate network.”

Pillar 2 β€” Azure RBAC (Role-Based Access Control): access

This is the authorization layer. It answers “what can you do?” once identity is proven. RBAC works through three concepts combined into a role assignment:

  • A security principal β€” the identity receiving the role (user, group, service principal, or managed identity)
  • A role definition β€” what actions are permitted (e.g., Owner, Contributor, Reader, or a custom role)
  • A scope β€” where the role applies, which follows a hierarchy: Management Group β†’ Subscription β†’ Resource Group β†’ individual Resource

A role assigned at a higher scope automatically inherits down. Give someone Reader on a subscription and they can read everything inside it.

The supporting tools

Three tools round out a mature IAM setup. PIM (Privileged Identity Management) implements just-in-time access β€” instead of being a permanent Owner, you request elevation for 2 hours, do the work, and the permission expires automatically. Access Reviews let you periodically re-validate who still needs access, cleaning up stale assignments. Azure Policy enforces guardrails at scale β€” for example, preventing anyone from assigning Owner at the subscription level without an approval workflow.

The core principle threading through all of it

Least privilege: grant the minimum role, at the narrowest scope, for the shortest duration. This is what PIM, custom roles, and resource-group-level assignments all support β€” shrinking the blast radius if any identity is ever compromised.

Types of ID are in Azure

Here’s the full breakdown:


πŸ† Most secure identity: Managed Identity

What makes managed identities uniquely secure is that no one knows the credentials β€” they are automatically created by Azure, including the credentials themselves. This eliminates the biggest risk in cloud security: leaked or hardcoded secrets. Managed identity replaces secrets such as access keys or passwords, and can also replace certificates or other forms of authentication for service-to-service dependencies.


How many identity types are there in Azure?

At a high level, there are two types of identities: human and machine/non-human identities. Machine/non-human identities consist of device and workload identities. In Microsoft Entra, workload identities are applications, service principals, and managed identities.

Breaking it down further, Azure has 4 main categories with several sub-types:

1. Human identities

  • User accounts (employees, admins)
  • Guest/B2B accounts (external partners)
  • Consumer/B2C accounts (end-users via social login)

2. Workload/machine identities

  • Managed Identity β€” most secure; no secrets to manage
    • System-assigned: tied to the lifecycle of an Azure resource; when the resource is deleted, Azure automatically deletes the service principal.
    • User-assigned: a standalone Azure resource that can be assigned to one or more Azure resources β€” the recommended type for Microsoft services.
  • Service Principal β€” three main types exist: Application service principal, Managed identity service principal, and Legacy service principal.

3. Device identities

  • Entra ID joined (corporate devices)
  • Hybrid joined (on-prem + cloud)
  • Entra registered / BYOD (personal devices)

Why prefer Managed Identity over Service Principal?

Microsoft Entra tokens expire every hour, reducing exposure risk compared to Personal Access Tokens which can last up to one year. Managed identities handle credential rotation automatically, and there is no need to store long-lived credentials in code or configuration. Service principals, by contrast, require you to manually rotate client secrets or certificates β€” a 2025 report highlighted that 23.77 million secrets were leaked on GitHub in 2024 alone, underscoring the risks of hardcoded credentials.

The rule of thumb: use Managed Identity whenever your workload runs inside Azure. Use a Service Principal only when you need to authenticate from outside Azure (CI/CD pipelines, on-premises systems, multi-cloud).

How to investigate spike in azure


Quick Decision Tree First

Do you know which service spiked?

  • Yes β†’ Skip to Step 3
  • No β†’ Start at Step 1

Step 1: Pinpoint the Spike in Cost Management

  • Azure Portal β†’ Cost Management β†’ Cost Analysis
  • Set view to Daily to find the exact day
  • Group by Service Name first β†’ tells you what spiked
  • Then group by Resource β†’ tells you which specific resource

Step 2: Narrow by Dimension

Keep drilling down by:

  • Resource Group
  • Resource type
  • Region (unexpected cross-region egress is a common hidden cost)
  • Meter (very granular β€” shows exactly what operation you’re being charged for)

Step 3: Go to the Offending Resource

Once you know what it is:

ServiceWhere to look
VM / VMSSCheck scaling events, uptime, instance count
StorageCheck blob transactions, egress, data written
Azure SQL / SynapseQuery history, DTU spikes, long-running queries
ADF (Data Factory)Pipeline run history β€” loops, retries, backfills
DatabricksCluster history β€” was a cluster left running?
App ServiceScale-out events, request volume
Azure FunctionsExecution count β€” was something stuck in a loop?

Step 4: Check Activity Log

  • Monitor β†’ Activity Log
  • Filter by the spike timeframe
  • Look for:
    • New resource deployments
    • Scaling events
    • Config changes
    • Who or what triggered it (user vs service principal)

This answers “what changed?”


Step 5: Check Azure Monitor Metrics

  • Go to the specific resource β†’ Metrics
  • Look at usage metrics around the spike time:
    • CPU / memory
    • Data in/out (egress is often the culprit)
    • Request count
    • DTU / vCore usage

Correlate the metric spike timeline with the cost spike timeline.


Step 6: Check Logs (Log Analytics / KQL)

If you have Log Analytics workspace connected:

// Example: Find expensive or long-running operations
AzureActivity
| where TimeGenerated between (datetime(2026-04-01) .. datetime(2026-04-11))
| where ActivityStatusValue == "Success"
| summarize count() by OperationNameValue, ResourceGroup
| order by count_ desc
// Check for VM scaling events
AzureActivity
| where OperationNameValue contains "virtualMachines"
| where TimeGenerated > ago(7d)
| project TimeGenerated, Caller, OperationNameValue, ResourceGroup

Step 7: Check for Common Culprits

These are the most frequent causes of unexpected spikes:

  • πŸ” Pipeline/job stuck in a loop (ADF, Functions, Logic Apps)
  • πŸ’Ύ Unexpected data egress (cross-region or internet-bound traffic)
  • πŸ“ˆ Auto-scaling that didn’t scale back down
  • πŸ—„οΈ Full table scan or bad query in SQL/Synapse
  • πŸ–₯️ VM or cluster left running after a job
  • πŸ“¦ Historical data backfill triggered accidentally
  • πŸ”„ Snapshot or backup policy changed

The Mental Model

Cost Analysis (when + what?)
β†’ Drill by dimension (which resource?)
β†’ Activity Log (what changed?)
β†’ Metrics (how did usage behave?)
β†’ Logs/KQL (why did it happen?)

The CIDR (Classless Inter-Domain Routing)

The CIDR (Classless Inter-Domain Routing) notation tells you two things: the starting IP address and the size of your network.

The number after the slash (e.g., /16, /24) represents how many bits are “locked” for the network prefix. Since an IPv4 address has 32 bits in total, you subtract the CIDR number from 32 to find how many bits are left for your “hosts” (the actual devices).


πŸ“ The “Rule of 32”

To calculate how many IPs you get, use this formula: $2^{(32 – \text{prefix})}$.

  • Higher number = Smaller network: /28 is a small room.
  • Lower number = Larger network: /16 is a massive warehouse.

Common Azure CIDR Sizes

CIDRTotal IPsAzure Usable IPs*Common Use Case
/1665,53665,531VNet Level: A massive space for a whole company’s environment.
/221,0241,019VNet Level: Good for a standard “Hub” network.
/24256251Subnet Level: Perfect for a standard Web or App tier.
/273227Service Subnet: Required for things like SQL Managed Instance.
/281611Micro-Subnet: Used for small things like Azure Bastion or Gateways.
/2983Minimum Size: The smallest subnet Azure allows.

🚫 The “Azure 5” (Critical)

In every subnet you create, Azure automatically reserves 5 IP addresses. You cannot use these for your VMs or Apps.

If you create a /28 (16 IPs), you only get 11 usable addresses.

  1. x.x.x.0: Network Address
  2. x.x.x.1: Default Gateway
  3. x.x.x.2 & x.x.x.3: Azure DNS mapping
  4. x.x.x.255: Broadcast Address

πŸ’‘ How to choose for your VNet?

When designing your Azure network, follow these two golden rules:

  1. Don’t go too small: It is very difficult to “resize” a VNet once it’s full of resources. It’s better to start with a /16 or /20 even if you only need a few IPs today.
  2. Plan for Peering: If you plan to connect VNet A to VNet B (Peering), their CIDR ranges must not overlap. If VNet A is 10.0.0.0/16, VNet B should be something completely different, like 10.1.0.0/16.

Pro Tip: Think of it like a T-shirt sizing guide.

  • Small: /24 (256 IPs)
  • Medium: /22 (1,024 IPs)
  • Large: /20 (4,096 IPs)
  • Enterprise: /16 (65,536 IPs)

AZ – Service Endpoints and Private Endpoints

While both Service Endpoints and Private Endpoints are designed to secure your traffic by keeping it on the Microsoft backbone network, they do so in very different ways.

The simplest way to remember the difference is: Service Endpoints secure a public entrance, while Private Endpoints build a private side door.


πŸ› οΈ Service Endpoints

Service Endpoints “wrap” your virtual network identity around an Azure service’s public IP.

  • The Connection: Your VM still talks to the Public IP of the service (e.g., 52.x.x.x), but Azure magically reroutes that traffic so it never leaves the Microsoft network.
  • Granularity: It is broad. If you enable a Service Endpoint for “Storage,” your subnet can now reach any storage account in that region via the backbone.
  • On-Premise: Does not work for on-premise users. A user in your office cannot use a Service Endpoint to reach a database over a VPN.
  • Cost: Completely Free.

πŸ”’ Private Endpoints (Powered by Private Link)

Private Endpoints actually “inject” a specific service instance into your VNet by giving it a Private IP address from your own subnet.

  • The Connection: Your VM talks to a Private IP (e.g., 10.0.0.5). To the VM, the database looks like just another server in the same room.
  • Granularity: Extremely high. The IP address is tied to one specific resource (e.g., only your “Production-DB”). You cannot use that same IP to reach a different database.
  • On-Premise: Fully supports on-premise connectivity via VPN or ExpressRoute. Your office can reach the database using its internal 10.x.x.x IP.
  • Cost: There is a hourly charge plus a fee for data processed (roughly $7-$8/month base + data).

πŸ“Š Comparison Table

FeatureService EndpointPrivate Endpoint
Destination IPPublic IP of the ServicePrivate IP from your VNet
DNS ComplexityNone (Uses public DNS)High (Requires Private DNS Zones)
GranularitySubnet to All Services in RegionSubnet to Specific Resource
On-Prem AccessNoYes (via VPN/ExpressRoute)
Data ExfiltrationPossible (if not restricted)Protected (bound to one instance)
CostFreePaid (Hourly + Data)

πŸš€ Which one should you use?

Use Service Endpoints if:

  • You have a simple setup and want to save money.
  • You only need to connect Azure-to-Azure (no on-premise users).
  • You don’t want to deal with the headache of managing Private DNS Zones.

Use Private Endpoints if:

  • Security is your #1 priority (Zero Trust).
  • You need to reach the service from your on-premise data center.
  • You must strictly prevent “Data Exfiltration” (ensuring employees can’t copy data from your VNet to their own personal storage accounts).
  • You are in a highly regulated industry (Finance, Healthcare, Government).

Expert Tip: In 2026, most enterprises have moved toward Private Endpoints as the standard. While they are more expensive and harder to set up (DNS is the biggest hurdle), they offer the “cleanest” security architecture for a hybrid cloud environment.

Azure Virtual Network (VNet) or its subnets are out of IP addresses

This is a classic “architectural corner” that many engineers find themselves in. When an Azure Virtual Network (VNet) or its subnets are out of IP addresses, you cannot simply “resize” a subnet that has active resources in it.

Here is the hierarchy of solutions, from the easiest to the most complex.


πŸ› οΈ Option 1: The “Non-Disruptive” Fix (Add Address Space)

In 2026, Azure allows you to expand a VNet without taking it down. You can add a Secondary Address Space to the VNet.

  1. Add a New Range: Go to the VNet > Address space and add a completely new CIDR block (e.g., if you used 10.0.0.0/24, add 10.1.0.0/24).
  2. Create a New Subnet: Create a new subnet (e.g., Subnet-2) within that new range.
  3. Deploy New Workloads: Direct all new applications or VMs to the new subnet.
  4. Sync Peerings: If this VNet is peered with others, you must click the Sync button on the peering configuration so the other VNets “see” the new IP range.

πŸ”„ Option 2: The “Migration” Fix (VNet Integration)

If your existing applications need more room to grow (scaling up) but their current subnet is full:

  1. Create a Parallel Subnet: Add a new, larger subnet to the VNet (assuming you have space in the address range).
  2. Migrate Resources: For VMs, you can actually change the subnet of a Network Interface (NIC) while the VM is stopped.
  3. App Services: If you are using VNet Integration for App Services, simply disconnect the integration and reconnect it to a new, larger subnet.

🌐 Option 3: The “Expansion” Fix (VNet Peering)

If you cannot add more address space to your current VNet (perhaps because it would overlap with your on-prem network), you can “spill over” into a second VNet.

  1. Create VNet-B: Set up a brand new VNet with its own IP range.
  2. Peer them: Use VNet Peering to connect VNet-A and VNet-B.
  3. Routing: Use Internal Load Balancers or Private Endpoints to bridge the gap between applications in both networks.

⚠️ Important “Gotchas” to Remember

  • The “Azure 5”: Remember that Azure reserves 5 IP addresses in every subnet (the first four and the last one). If you create a /29 subnet, you think you have 8 IPs, but you actually only have 3 usable ones.
  • Subnet Resizing: You cannot resize a subnet if it has any resources in it (even one dormant NIC). You must delete the resources or move them first.
  • NAT Gateway: In 2026, if you are running out of Public IPs for outbound traffic, attach an Azure NAT Gateway to your subnet. This allows up to 64,000 concurrent flows using a single public IP, preventing “SNAT Port Exhaustion.”

πŸ’‘ The “Pro” Recommendation:

If this is a production environment, use Option 1. Add a secondary address space (like 172.16.0.0/16 or 100.64.0.0/10 if you’re out of 10.x.x.x space) and start a new subnet. It’s the only way to get more IPs without a “stop-everything” maintenance window.

“peering” in Azure

When discussing “peering” in Azure, it’s important to clarify the context. Usually, this refers to VNet Peering (connecting virtual networks) or Direct Peering (which can refer to Azure Peering Service for optimized internet or ExpressRoute Direct for high-speed private fiber).

Here is what you need to consider for each to ensure a secure and performant design.


1. VNet Peering (Connecting VNets)

VNet Peering is the primary way to connect two Azure Virtual Networks. They behave as a single network using private IP addresses.

πŸ”‘ Key Considerations:

  • Address Space Overlap: CRITICAL. You cannot peer VNets if their IP address spaces (CIDR blocks) overlap. Plan your IP schema early; fixing an overlap later requires deleting and recreating the VNet.
  • Transitivity: VNet peering is not transitive. If VNet A is peered with VNet B, and VNet B is peered with VNet C, VNet A cannot talk to VNet C.
    • Solution: Use a Hub-and-Spoke model with an Azure Firewall/NVA or Azure Virtual WAN for transitive routing.
  • Gateway Transit: If VNet A has a VPN/ExpressRoute gateway, you can allow VNet B to use it.
    • Check: Enable “Allow gateway transit” on VNet A and “Use remote gateways” on VNet B.
  • Cost: Local peering (same region) is cheaper than Global peering (different regions). You are charged for both inbound and outbound data transfer on both sides of the peering.

2. Direct Peering (ExpressRoute Direct & Peering Service)

“Direct Peering” usually refers to ExpressRoute Direct, where you connect your own hardware directly to Microsoft’s edge routers at 10 Gbps or 100 Gbps.

πŸ”‘ Key Considerations:

  • Physical Connectivity: You are responsible for the “Last Mile” fiber from your data center to the Microsoft Peering Location.
  • SKU Selection: * Local: For traffic within the same geopolitical region (cheapest).
    • Standard: For traffic within the same continent.
    • Premium: Required for global connectivity and more than 10 VNet links.
  • Microsoft Peering vs. Private Peering:
    • Private Peering: Connects your on-prem network to your Azure VNets (internal traffic).
    • Microsoft Peering: Connects your on-prem network to Microsoft 365, Dynamics 365, and Azure Public PaaS services (Storage, SQL) over a private link.

3. Comparison Summary

ConsiderationVNet PeeringDirect Peering (ExpressRoute Direct)
Primary UseCloud-to-Cloud connectivity.On-Prem-to-Cloud (High Bandwidth).
MediumMicrosoft Global Backbone.Dedicated Physical Fiber + Backbone.
BandwidthLimited by VM/Gateway SKU.Up to 100 Gbps.
ComplexityLow (Point-and-click).High (Requires physical fiber/BGP).
SecurityEncapsulated in Azure backbone.Private, dedicated physical path.

🚦 Common Pitfall: Asymmetric Routing

If you have both a VNet Peering and an ExpressRoute circuit connecting the same two locations, Azure might send traffic out via the peering but receive it back via ExpressRoute.

The Fix: Use User-Defined Routes (UDRs) or BGP weights to ensure the “return” path matches the “outbound” path. Azure will prioritize VNet Peering routes over ExpressRoute routes by default if the address prefixes are the same.

Would you like a specific KQL query to monitor the latency or throughput between your peered networks?

“Traffic Spike” or perhaps a “Burst” in resource usage

It sounds like you are describing a “Traffic Spike” or perhaps a “Burst” in resource usage. Since there are no alerts, you are in “detective mode,” looking for a silent surge that hasn’t crossed a threshold yet but is clearly visible in your telemetry.

If you heard “frost,” you might be referring to “Cold Start” spikes (common in Serverless/Functions) or a “Request Spike.” Here are the steps to track down the source of a sudden surge in Azure:


πŸ” Step 1: Use Azure Monitor “Metrics Explorer”

Since you don’t have alerts, you need to visualize the spike to see its “shape.”

  1. Go to the resource (e.g., App Service, VM, or Load Balancer).
  2. Select Metrics from the left menu.
  3. Add the Request Count (for apps) or CPU/Network In (for VMs) metric.
  4. The Secret Step: Change the “Aggregation” to Count or Sum and look for the exact timestamp of the spike.
  5. Apply Splitting: Split the metric by “RemoteIP” or “Instance”. This tells you if the spike is coming from one specific user/IP or hitting one specific server.

πŸ•΅οΈ Step 2: Dig into Log Analytics (KQL)

If the metrics show a spike but not the “who,” you need the logs. This is where you find the “Source.”

  1. Go to Logs (Log Analytics Workspace).
  2. Run a query to find the top callers during that spike period.

Example KQL for App Gateways/Web Apps:

Code snippet

// Find the top 10 IP addresses causing the spike
AzureDiagnostics
| where TimeGenerated > datetime(2026-04-10T12:00:00Z) // Set to your spike time
| where Category == "ApplicationGatewayAccessLog"
| summarize RequestCount = count() by clientIP_s
| top 10 by RequestCount
  • Result: If one IP address has 50,000 requests while others have 10, you’ve found a bot or a misconfigured client.

🌐 Step 3: Check “Application Insights” (App Level)

If the spike is happening inside your application code (e.g., a “Cold Start” or a heavy API call):

  1. Go to Application Insights > Failures or Performance.
  2. Look at the “Top 10 Operations”.
  3. Check if a specific API endpoint (e.g., /api/export) suddenly jumped in volume.
  4. Use User Map to see if the traffic is coming from a specific geographic region (e.g., a sudden burst of traffic from a country you don’t usually service).

πŸ—ΊοΈ Step 4: Network Watcher (Infrastructure Level)

If you suspect the spike is at the “packet” level (like a DDoS attempt or a backup job gone rogue):

  1. Go to Network Watcher > NSG Flow Logs.
  2. Use Traffic Analytics. It provides a map showing which VNets or Public IPs are sending the most data.
  3. Check for “Flows”: It will show you the “Source Port” and “Destination Port.” If you see a spike on Port 22 (SSH) or 3389 (RDP), someone is likely trying to brute-force your VMs.

πŸ€– Step 5: Check for “Auto-Scaling” Events

Sometimes the “spike” isn’t a problem, but a reaction.

  1. Go to Activity Log.
  2. Filter for “Autoscale” events.
  3. If the spike happened exactly when a new instance was added, the “spike” might actually be the resource “warming up” (loading caches, etc.), which can look like a surge in CPU or Disk I/O.

Summary Checklist:

  • Metrics Explorer: To see when it happened and how big it was.
  • Log Analytics (KQL): To find the specific Client IP or User Agent.
  • Traffic Analytics: To see if it was a Network-level burst.
  • Activity Log: To see if any Manual Changes or Scaling occurred at that exact second.

A common real-world “mystery spike” case. Since you mentioned “frost spike” and “source space,” you are likely referring to a Cost Spike or a Request/Throughput Spike in your resource namespace.

If there are no alerts firing, it means the spike either didn’t hit a specific threshold or was too brief to trigger a standard “Static” alert.


πŸ—οΈ Step 1: Establish the “When” and “What”

First, you need to see the “DNA” of the spike using Azure Monitor Metrics.

  • Look at the Graph: Is it a “Square” spike (starts and stops abruptly, like a scheduled job)? Or a “Needle” spike (hits a peak and drops, like a bot attack)?
  • Identify the Resource: Go to Metrics Explorer and check:
    • For VMs: Percentage CPU or Network In/Out.
    • For Storage/SQL: Transactions or DTU Consumption.
    • For App Services: Requests or Data In.

πŸ” Step 2: Finding the Source (The Detective Work)

Since you don’t know where it came from, you use “Splitting” and “Filtering” in Metrics Explorer.

  1. Split by Instance/Role: If you have 10 servers, split by InstanceName. Does only one server show the spike? If yes, it’s a local process (like a hanging Windows Update or a log-rotation fail).
  2. Split by Operation: For Storage or SQL, split by API Name. Is it GetBlob? PutBlob? This tells you if you are reading too much or writing too much.
  3. Split by Remote IP: If your load balancer shows the spike, split by ClientIP. If one IP has 100x the traffic of others, you’ve found your source.

πŸ•΅οΈ Step 3: Deep Dive with Log Analytics (KQL)

Metrics only show numbers. Logs show names. You need to run a KQL query in your Log Analytics Workspace.

Query to find “Who is talking to me”:

Code snippet

// This finds the top 5 callers during the spike window
AzureDiagnostics
| where TimeGenerated > datetime(2026-04-10T12:00:00Z) // Use your spike time
| summarize RequestCount = count() by clientIp_s, requestUri_s
| top 5 by RequestCount
  • Result: This will literally list the IP address and the specific URL they were hitting.

πŸ’° Step 4: The “Cost” Investigation

If the spike is financial (a “Cost Spike”), you check Azure Cost Management.

  1. Cost Analysis: View cost by Resource. Did one specific Disk or Data Transfer cost jump?
  2. Check for “Orphaned” Resources: Sometimes a spike comes from a process that created 1,000 snapshots or temporary disks and forgot to delete them.

πŸ€– Step 5: Check the “Silent” Sources

If the metrics and logs don’t show an external attacker, check internal Azure “automated” sources:

  • Resource Graph: Check for “Change Tracking.” Did someone deploy code or change a firewall rule at that exact minute?
  • Backup/Recovery Services: A “huge spike” in disk I/O often aligns with a Storage Snapshot or an Azure Backup job starting.
  • Defender for Cloud: Even if you don’t have a “Metric Alert,” check the Security Alerts. Defender might have seen the spike and flagged it as “Suspicious PowerShell Activity” or “Port Scanning.”

βœ… Summary Checklist

StepActionTool
1. VisualizeSee the shape and duration of the spike.Metrics Explorer
2. IsolateSplit metrics by IP or Instance.Metrics Explorer
3. IdentifyRun a query to find the specific Client IP or User.Log Analytics (KQL)
4. CorrelateCheck if any “Deployments” happened at that time.Activity Log / Change Analysis
5. NetworkCheck for massive data transfers between regions.Network Watcher / Traffic Analytics

How to prevent this next time? Once you find the source, create a Dynamic Threshold Alert. Unlike static alerts, these use AI to learn your “normal” pattern and will fire if a spike looks “unusual,” even if it doesn’t hit a high maximum number.