Quick Decision Tree First
Do you know which service spiked?
- Yes → Skip to Step 3
- No → Start at Step 1
Step 1: Pinpoint the Spike in Cost Management
- Azure Portal → Cost Management → Cost Analysis
- Set view to Daily to find the exact day
- Group by Service Name first → tells you what spiked
- Then group by Resource → tells you which specific resource
Step 2: Narrow by Dimension
Keep drilling down by:
- Resource Group
- Resource type
- Region (unexpected cross-region egress is a common hidden cost)
- Meter (very granular — shows exactly what operation you’re being charged for)
Step 3: Go to the Offending Resource
Once you know what it is:
| Service | Where to look |
|---|---|
| VM / VMSS | Check scaling events, uptime, instance count |
| Storage | Check blob transactions, egress, data written |
| Azure SQL / Synapse | Query history, DTU spikes, long-running queries |
| ADF (Data Factory) | Pipeline run history — loops, retries, backfills |
| Databricks | Cluster history — was a cluster left running? |
| App Service | Scale-out events, request volume |
| Azure Functions | Execution count — was something stuck in a loop? |
Step 4: Check Activity Log
- Monitor → Activity Log
- Filter by the spike timeframe
- Look for:
- New resource deployments
- Scaling events
- Config changes
- Who or what triggered it (user vs service principal)
This answers “what changed?”
Step 5: Check Azure Monitor Metrics
- Go to the specific resource → Metrics
- Look at usage metrics around the spike time:
- CPU / memory
- Data in/out (egress is often the culprit)
- Request count
- DTU / vCore usage
Correlate the metric spike timeline with the cost spike timeline.
Step 6: Check Logs (Log Analytics / KQL)
If you have Log Analytics workspace connected:
// Example: Find expensive or long-running operationsAzureActivity| where TimeGenerated between (datetime(2026-04-01) .. datetime(2026-04-11))| where ActivityStatusValue == "Success"| summarize count() by OperationNameValue, ResourceGroup| order by count_ desc
// Check for VM scaling eventsAzureActivity| where OperationNameValue contains "virtualMachines"| where TimeGenerated > ago(7d)| project TimeGenerated, Caller, OperationNameValue, ResourceGroup
Step 7: Check for Common Culprits
These are the most frequent causes of unexpected spikes:
- 🔁 Pipeline/job stuck in a loop (ADF, Functions, Logic Apps)
- 💾 Unexpected data egress (cross-region or internet-bound traffic)
- 📈 Auto-scaling that didn’t scale back down
- 🗄️ Full table scan or bad query in SQL/Synapse
- 🖥️ VM or cluster left running after a job
- 📦 Historical data backfill triggered accidentally
- 🔄 Snapshot or backup policy changed
The Mental Model
Cost Analysis (when + what?) → Drill by dimension (which resource?) → Activity Log (what changed?) → Metrics (how did usage behave?) → Logs/KQL (why did it happen?)