How to investigate spike in azure

Quick Decision Tree First

Do you know which service spiked?

Yes → Skip to Step 3
No → Start at Step 1

Step 1: Pinpoint the Spike in Cost Management

Azure Portal → Cost Management → Cost Analysis
Set view to Daily to find the exact day
Group by Service Name first → tells you what spiked
Then group by Resource → tells you which specific resource

Step 2: Narrow by Dimension

Keep drilling down by:

Resource Group
Resource type
Region (unexpected cross-region egress is a common hidden cost)
Meter (very granular — shows exactly what operation you’re being charged for)

Step 3: Go to the Offending Resource

Once you know what it is:

Service	Where to look
VM / VMSS	Check scaling events, uptime, instance count
Storage	Check blob transactions, egress, data written
Azure SQL / Synapse	Query history, DTU spikes, long-running queries
ADF (Data Factory)	Pipeline run history — loops, retries, backfills
Databricks	Cluster history — was a cluster left running?
App Service	Scale-out events, request volume
Azure Functions	Execution count — was something stuck in a loop?

Step 4: Check Activity Log

Monitor → Activity Log
Filter by the spike timeframe
Look for:
- New resource deployments
- Scaling events
- Config changes
- Who or what triggered it (user vs service principal)

This answers “what changed?”

Step 5: Check Azure Monitor Metrics

Go to the specific resource → Metrics
Look at usage metrics around the spike time:
- CPU / memory
- Data in/out (egress is often the culprit)
- Request count
- DTU / vCore usage

Correlate the metric spike timeline with the cost spike timeline.

Step 6: Check Logs (Log Analytics / KQL)

If you have Log Analytics workspace connected:

			
// Example: Find expensive or long-running operations
AzureActivity
| where TimeGenerated between (datetime(2026-04-01) .. datetime(2026-04-11))
| where ActivityStatusValue == "Success"
| summarize count() by OperationNameValue, ResourceGroup
| order by count_ desc

		

			
// Check for VM scaling events
AzureActivity
| where OperationNameValue contains "virtualMachines"
| where TimeGenerated > ago(7d)
| project TimeGenerated, Caller, OperationNameValue, ResourceGroup

		

Step 7: Check for Common Culprits

These are the most frequent causes of unexpected spikes:

🔁 Pipeline/job stuck in a loop (ADF, Functions, Logic Apps)
💾 Unexpected data egress (cross-region or internet-bound traffic)
📈 Auto-scaling that didn’t scale back down
🗄️ Full table scan or bad query in SQL/Synapse
🖥️ VM or cluster left running after a job
📦 Historical data backfill triggered accidentally
🔄 Snapshot or backup policy changed

The Mental Model

			
Cost Analysis (when + what?)
    → Drill by dimension (which resource?)
        → Activity Log (what changed?)
            → Metrics (how did usage behave?)
                → Logs/KQL (why did it happen?)

		

Infra Cloud Solutions

Leave a comment Cancel reply