Here’s a breakdown of this topic across all three domains:
MCP + Kubernetes Management
What it looks like: An LLM agent connects to a Kubernetes MCP server that exposes kubectl operations as tools. The agent can then:
- list_pods(namespace) → find failing pods
- get_pod_logs(pod, namespace) → fetch logs
- describe_deployment(name) → inspect rollout status
- scale_deployment(name, replicas) → auto-scale
- apply_manifest(yaml) → deploy changes
Real implementations:
- kubectl-ai — natural language to kubectl commands
- Robusta — AI-powered Kubernetes troubleshooting with MCP support
- k8s-mcp-server — open-source MCP server wrapping the Kubernetes API
- OpenShift + ACM — Red Hat is building AI-assisted cluster management leveraging MCP for tool standardization
Example agent workflow:
User: “Why is the payments service degraded?”
Agent → list_pods(namespace=”payments”)
→ get_pod_logs(pod=”payments-7f9b”, tail=100)
→ describe_deployment(“payments”)
→ LLM reasons: “OOMKilled — memory limit too low”
→ Proposes: patch_deployment(memory_limit=”1Gi”)
→ HITL: “Approve this change?” → Engineer approves
→ apply_patch() → monitors rollout → confirms healthy
MCP + Terraform Pipelines
What it looks like: A Terraform MCP server exposes infrastructure operations. The agent can plan, review, and apply infrastructure changes conversationally.
MCP tools exposed:
- terraform_plan(module, vars) → generate and review a plan
- terraform_apply(plan_id) → apply approved changes
- terraform_state_show(resource) → inspect current state
- terraform_output(name) → read output values
- detect_drift() → compare actual vs declared state
Key use cases:
- Drift detection agent: continuously checks for infrastructure drift and auto-raises PRs to correct it
- Cost optimization agent: analyzes Terraform state, identifies oversized resources, proposes rightsizing
- Compliance agent: scans Terraform plans against OPA/Sentinel policies before apply
- PR review agent: reviews Terraform PRs, flags security misconfigs, suggests improvements
Example pipeline:
PR opened with Terraform changes
│
▼
MCP Terraform Agent
├── terraform_plan() → generates plan
├── scan_security(plan) → checks for open security groups, no encryption
├── estimate_cost(plan) → computes monthly cost delta
├── LLM summarizes: “This adds an unencrypted S3 bucket costing ~$12/mo”
└── Posts review comment to PR with findings + recommendations
📊MCP + Infrastructure Observability
What it looks like: Observability tools (Prometheus, Grafana, Loki, Datadog) are wrapped as MCP servers. The agent queries them in natural language and correlates signals across tools autonomously.
MCP tools exposed:
- query_prometheus(promql, time_range) → fetch metrics
- search_logs(query, service, time_range) → Loki/Elasticsearch
- get_traces(service, error_only) → Jaeger/Tempo
- list_active_alerts() → current firing alerts
- get_dashboard(name) → Grafana snapshot
- create_annotation(text, time) → mark events on dashboards
Key use cases:
- Natural language observability: “Show me error rate for the checkout service in the last 30 mins” — no PromQL needed
- Automated RCA: agent correlates metrics + logs + traces to pinpoint root cause
- Alert noise reduction: agent groups related alerts, suppresses duplicates, and writes a single incident summary
- Capacity planning: agent queries historical metrics, detects trends, forecasts when resources will be exhausted
🔗 How MCP Ties It All Together
The power of MCP is that a single agent can hold tools from all three domains simultaneously:
┌─────────────────────────────────────────────────────┐
│ LLM Agent │
│ (Claude / GPT-4o) │
└────────────────────┬────────────────────────────────┘
│ MCP
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌──────────────────┐
│ Kubernetes │ │Terraform │ │ Observability │
│ MCP Server │ │ MCP Server│ │ MCP Server │
│ (kubectl, │ │(plan, │ │(Prometheus, Loki,│
│ Helm, ACM) │ │ apply, │ │ Grafana, Jaeger) │
└──────────────┘ │ drift) │ └──────────────────┘
└──────────┘
End-to-end scenario:
- Observability MCP detects CPU spike on node pool
- Agent queries Terraform MCP → finds node group is at max capacity
- Agent queries Kubernetes MCP → confirms pods are pending due to insufficient nodes
- Agent generates Terraform plan to scale node group from 3→5 nodes
- HITL approval → Terraform apply → Kubernetes confirms new nodes joined
- Agent posts incident summary to Slack with full audit trail