March | 2026 | Infra Cloud Solutions

Here’s a breakdown of this topic across all three domains:

MCP + Kubernetes Management

What it looks like: An LLM agent connects to a Kubernetes MCP server that exposes kubectl operations as tools. The agent can then:

list_pods(namespace) → find failing pods
get_pod_logs(pod, namespace) → fetch logs
describe_deployment(name) → inspect rollout status
scale_deployment(name, replicas) → auto-scale
apply_manifest(yaml) → deploy changes

Real implementations:

kubectl-ai — natural language to kubectl commands
Robusta — AI-powered Kubernetes troubleshooting with MCP support
k8s-mcp-server — open-source MCP server wrapping the Kubernetes API
OpenShift + ACM — Red Hat is building AI-assisted cluster management leveraging MCP for tool standardization

Example agent workflow:

User: “Why is the payments service degraded?”

Agent → list_pods(namespace=”payments”)

→ get_pod_logs(pod=”payments-7f9b”, tail=100)

→ describe_deployment(“payments”)

→ LLM reasons: “OOMKilled — memory limit too low”

→ Proposes: patch_deployment(memory_limit=”1Gi”)

→ HITL: “Approve this change?” → Engineer approves

→ apply_patch() → monitors rollout → confirms healthy

MCP + Terraform Pipelines

What it looks like: A Terraform MCP server exposes infrastructure operations. The agent can plan, review, and apply infrastructure changes conversationally.

MCP tools exposed:

terraform_plan(module, vars) → generate and review a plan
terraform_apply(plan_id) → apply approved changes
terraform_state_show(resource) → inspect current state
terraform_output(name) → read output values
detect_drift() → compare actual vs declared state

Key use cases:

Drift detection agent: continuously checks for infrastructure drift and auto-raises PRs to correct it
Cost optimization agent: analyzes Terraform state, identifies oversized resources, proposes rightsizing
Compliance agent: scans Terraform plans against OPA/Sentinel policies before apply
PR review agent: reviews Terraform PRs, flags security misconfigs, suggests improvements

Example pipeline:

PR opened with Terraform changes

│

▼

MCP Terraform Agent

├── terraform_plan() → generates plan

├── scan_security(plan) → checks for open security groups, no encryption

├── estimate_cost(plan) → computes monthly cost delta

├── LLM summarizes: “This adds an unencrypted S3 bucket costing ~$12/mo”

└── Posts review comment to PR with findings + recommendations

📊MCP + Infrastructure Observability

What it looks like: Observability tools (Prometheus, Grafana, Loki, Datadog) are wrapped as MCP servers. The agent queries them in natural language and correlates signals across tools autonomously.

MCP tools exposed:

query_prometheus(promql, time_range) → fetch metrics
search_logs(query, service, time_range) → Loki/Elasticsearch
get_traces(service, error_only) → Jaeger/Tempo
list_active_alerts() → current firing alerts
get_dashboard(name) → Grafana snapshot
create_annotation(text, time) → mark events on dashboards

Key use cases:

Natural language observability: “Show me error rate for the checkout service in the last 30 mins” — no PromQL needed
Automated RCA: agent correlates metrics + logs + traces to pinpoint root cause
Alert noise reduction: agent groups related alerts, suppresses duplicates, and writes a single incident summary
Capacity planning: agent queries historical metrics, detects trends, forecasts when resources will be exhausted

🔗 How MCP Ties It All Together

The power of MCP is that a single agent can hold tools from all three domains simultaneously:

┌─────────────────────────────────────────────────────┐

│ LLM Agent │

│ (Claude / GPT-4o) │

└────────────────────┬────────────────────────────────┘

│ MCP

┌────────────┼────────────┐

▼ ▼ ▼

┌──────────────┐ ┌──────────┐ ┌──────────────────┐

│ Kubernetes │ │Terraform │ │ Observability │

│ MCP Server │ │ MCP Server│ │ MCP Server │

│ (kubectl, │ │(plan, │ │(Prometheus, Loki,│

│ Helm, ACM) │ │ apply, │ │ Grafana, Jaeger) │

└──────────────┘ │ drift) │ └──────────────────┘

└──────────┘

End-to-end scenario:

Observability MCP detects CPU spike on node pool
Agent queries Terraform MCP → finds node group is at max capacity
Agent queries Kubernetes MCP → confirms pods are pending due to insufficient nodes
Agent generates Terraform plan to scale node group from 3→5 nodes
HITL approval → Terraform apply → Kubernetes confirms new nodes joined
Agent posts incident summary to Slack with full audit trail