How to investigate spike in azure


Quick Decision Tree First

Do you know which service spiked?

  • Yes → Skip to Step 3
  • No → Start at Step 1

Step 1: Pinpoint the Spike in Cost Management

  • Azure Portal → Cost Management → Cost Analysis
  • Set view to Daily to find the exact day
  • Group by Service Name first → tells you what spiked
  • Then group by Resource → tells you which specific resource

Step 2: Narrow by Dimension

Keep drilling down by:

  • Resource Group
  • Resource type
  • Region (unexpected cross-region egress is a common hidden cost)
  • Meter (very granular — shows exactly what operation you’re being charged for)

Step 3: Go to the Offending Resource

Once you know what it is:

ServiceWhere to look
VM / VMSSCheck scaling events, uptime, instance count
StorageCheck blob transactions, egress, data written
Azure SQL / SynapseQuery history, DTU spikes, long-running queries
ADF (Data Factory)Pipeline run history — loops, retries, backfills
DatabricksCluster history — was a cluster left running?
App ServiceScale-out events, request volume
Azure FunctionsExecution count — was something stuck in a loop?

Step 4: Check Activity Log

  • Monitor → Activity Log
  • Filter by the spike timeframe
  • Look for:
    • New resource deployments
    • Scaling events
    • Config changes
    • Who or what triggered it (user vs service principal)

This answers “what changed?”


Step 5: Check Azure Monitor Metrics

  • Go to the specific resource → Metrics
  • Look at usage metrics around the spike time:
    • CPU / memory
    • Data in/out (egress is often the culprit)
    • Request count
    • DTU / vCore usage

Correlate the metric spike timeline with the cost spike timeline.


Step 6: Check Logs (Log Analytics / KQL)

If you have Log Analytics workspace connected:

// Example: Find expensive or long-running operations
AzureActivity
| where TimeGenerated between (datetime(2026-04-01) .. datetime(2026-04-11))
| where ActivityStatusValue == "Success"
| summarize count() by OperationNameValue, ResourceGroup
| order by count_ desc
// Check for VM scaling events
AzureActivity
| where OperationNameValue contains "virtualMachines"
| where TimeGenerated > ago(7d)
| project TimeGenerated, Caller, OperationNameValue, ResourceGroup

Step 7: Check for Common Culprits

These are the most frequent causes of unexpected spikes:

  • 🔁 Pipeline/job stuck in a loop (ADF, Functions, Logic Apps)
  • 💾 Unexpected data egress (cross-region or internet-bound traffic)
  • 📈 Auto-scaling that didn’t scale back down
  • 🗄️ Full table scan or bad query in SQL/Synapse
  • 🖥️ VM or cluster left running after a job
  • 📦 Historical data backfill triggered accidentally
  • 🔄 Snapshot or backup policy changed

The Mental Model

Cost Analysis (when + what?)
→ Drill by dimension (which resource?)
→ Activity Log (what changed?)
→ Metrics (how did usage behave?)
→ Logs/KQL (why did it happen?)

Leave a comment