Monitoring AKS in 2026 has moved beyond simple “health checks.” It now centers on a Unified Observability strategy that combines infrastructure metrics, high-cardinality logs, and application traces.
To provide top-tier support, you should propose a two-pronged approach: Azure Native for the platform and Managed Prometheus/Grafana for the microservices.
1. The 2026 Monitoring Stack (The “Big Three”)
| Tool | Purpose | What to Propose |
| Container Insights | Platform Health. Focuses on Inventory, Node CPU/Memory, and K8s Events. | “We’ll use this for the ‘Golden Signals’ of the cluster infrastructure.” |
| Managed Prometheus | Workload Metrics. High-resolution scraping of your microservices (Pod-level). | “This gives us deep visibility into app-specific metrics like ‘Orders Processed’.” |
| Managed Grafana | Visualization. The single pane of glass for both Azure and Prometheus data. | “I’ll build executive dashboards to show uptime and developer dashboards for debugging.” |
2. Advanced Log Management (Cost & Performance)
Log ingestion is usually the most expensive part of AKS support. As of 2026, Microsoft has introduced “Basic Logs” to help.
- Log Analytics (ContainerLogV2): Propose migrating to the
ContainerLogV2schema. It is faster and supports Basic Logs, which can reduce costs by up to 50-70% for high-volume logs you don’t query often. - Diagnostic Settings: Enable these to capture Control Plane logs (API Server, Scheduler). Without these, you are blind to why a cluster upgrade failed or who deleted a namespace.
3. Network Observability (The 2026 Add-on)
Microsoft recently standardized the Network Observability add-on. This is a great “upsell” for security-conscious clients.
- What it does: It tracks “East-West” traffic (pod-to-pod).
- The Pitch: “If Service A can’t talk to Service B, I can tell you in 30 seconds if it’s a network drop, a DNS failure, or a security policy block.”
4. Proactive Alerting Strategy
Don’t just alert on “CPU > 80%.” That leads to alert fatigue. Instead, propose Service Level Objective (SLO) based alerting:
- Latency Alerts: “Alert if 5% of requests take longer than 2 seconds.”
- OOMKill Detection: “Alert if any pod in the
productionnamespace is killed due to memory limits.” - Disk Pressure: “Alert if node local storage is at 85% to prevent the ‘DiskPressure’ taint from evicting pods.”
The “SRE” Proposal Snippet
If you want to present this to your company, try this:
“To ensure 99.9% availability for our microservices, I propose implementing the Azure Managed Observability Stack. By moving our logs to the V2 schema and implementing Managed Prometheus, we can reduce our monitoring costs by roughly 30% while gaining the ability to trace transactions across our entire microservice mesh. This transforms our support from ‘fixing breaks’ to ‘preventing downtime’.”
If you are only looking at logs, you are seeing what happened (the “post-mortem”), but you are missing where the bottleneck is and how the services interact.
To take your support to the next level, you should propose moving toward Distributed Tracing. This is the “holy grail” of microservices support.
1. The Missing Piece: Distributed Tracing
When a user says “the app is slow,” logs usually show a bunch of successful 200 OK messages across five different services. You have no way of knowing which of those five services added the 3-second delay.
The Solution: OpenTelemetry (OTel)
By implementing OpenTelemetry (the industry standard in 2026), you give every request a “Trace ID.”
- Trace: The entire journey of a request from the user’s click to the database and back.
- Span: The time spent inside a single microservice or database call.
2. Azure Application Insights (The Easy Win)
Since you are on AKS, the fastest way to get tracing is Application Insights.
- Application Map: This is a live, auto-generated visual of your entire microservice architecture. It shows which services are talking to which, and highlights red links where errors or high latency are occurring.
- No-Code Instrumentation: For many Linux/Docker apps (especially .NET, Java, and Python), you can enable tracing without changing a single line of code by using the Azure Monitor OpenTelemetry Distro or a “Sidecar” container.
3. How to Propose “Distributed Tracing” as a Service
This is a major value-add. Frame it as “Reducing Mean Time to Recovery (MTTR).”
The Pitch:
“Currently, when a performance issue occurs, we have to manually comb through logs across multiple containers to find the root cause. I propose implementing Distributed Tracing. This will give us a visual ‘Application Map’ of our microservices, allowing us to pinpoint exactly which service or database query is slowing down the system in seconds rather than hours.”
4. Practical Implementation Plan
If they say “Yes,” here is your 3-step rollout:
- Infrastructure: Add the Application Insights resource via your Terraform code.
- Instrumentation: Add an OpenTelemetry “Sidecar” to your Docker deployments (or use the Azure Monitor AKS add-on).
- Dashboarding: Create a “Latency Heatmap” in Azure Managed Grafana that pulls data from App Insights.
The “SRE” Comparison
To show your client you know your stuff, use this comparison:
- Logs: Tell you the “What” (e.g., “Database connection failed”).
- Metrics: Tell you the “When” (e.g., “CPU spiked at 3 PM”).
- Tracing: Tells you the “Where” (e.g., “The delay is happening specifically in the Authentication Service’s call to the User DB”).
To move from basic logs to full distributed tracing, we need to add Application Insights and a Log Analytics Workspace to your Terraform configuration.
By 2026, the standard is to use Workspace-based Application Insights, which stores all its data in a centralized Log Analytics workspace. This makes it easier to query both your system logs and your application traces in one place.
1. The Terraform Code (Infrastructure)
Add this to your Terraform files. This creates the monitoring “bucket” and the tracing “engine.”
Terraform
# 1. Create the Log Analytics Workspace (The storage)resource "azurerm_log_analytics_workspace" "aks_monitor" { name = "law-aks-prod-01" location = azurerm_resource_group.aks_rg.location resource_group_name = azurerm_resource_group.aks_rg.name sku = "PerGB2018" retention_in_days = 30}# 2. Create Application Insights (The tracing engine)resource "azurerm_application_insights" "aks_app_insights" { name = "ai-microservices-prod" location = azurerm_resource_group.aks_rg.location resource_group_name = azurerm_resource_group.aks_rg.name workspace_id = azurerm_log_analytics_workspace.aks_monitor.id application_type = "web"}# 3. Output the Connection String (Your apps need this to send data)output "app_insights_connection_string" { value = azurerm_application_insights.aks_app_insights.connection_string sensitive = true}
2. The “Last Mile”: Connecting your Docker Apps
Once the infrastructure is built, your microservices need to know where to send their traces. In 2026, we use Autoinstrumentation so you don’t have to change your code.
Step A: Enable the AKS Monitoring Add-on
In your azurerm_kubernetes_cluster resource, add this block:
Terraform
oms_agent {
log_analytics_workspace_id = azurerm_log_analytics_workspace.aks_monitor.id
msi_auth_for_monitoring_enabled = true
}
Step B: Inject the Connection String into your Pods
In your Kubernetes deployment YAML, add the Connection String as an environment variable (best practice: pull this from the Azure Key Vault we set up earlier):
YAML
env: - name: APPLICATIONINSIGHTS_CONNECTION_STRING value: "InstrumentationKey=xxxx-xxxx-xxxx;IngestionEndpoint=..."
3. What this looks like for the Client
Once this is running, you can show the client two powerful views in the Azure Portal:
- The Application Map: A live, visual diagram showing how Service A calls Service B. If a service turns red, it means it’s failing. If the line is thick/slow, it’s a bottleneck.
- End-to-End Transaction Details: You can click on a single failed request and see exactly where it died—whether it was a code error in the container or a slow query in the database.
The Proposal Pitch
“Currently, we have ‘blind spots’ between our services. By implementing this Terraform-backed tracing, we can move from reactive log-hunting to Visual Troubleshooting. We’ll be able to see exactly how requests flow through our Docker containers, allowing us to fix performance issues before they impact the end-user.”