Building a data pipeline in Azure

Building a data pipeline in Azure using Azure Data Factory (ADF) and Azure Data Lake Storage (ADLS) is the “bread and butter” of modern cloud data engineering. Think of ADLS as your massive digital warehouse and ADF as the conveyor belts and robotic arms moving things around.

Here is the high-level workflow and the steps to get it running.


1. The Architecture

In a typical scenario, you move data from a source (like an on-premises SQL DB or an API) into ADLS, then process it.

Key Components:

  • Linked Services: Your “Connection Strings.” These store the credentials to talk to ADLS or your source.
  • Datasets: These point to specific folders or files within your Linked Service.
  • Pipelines: The logical grouping of activities (the workflow).
  • Activities: The individual actions (e.g., Copy Data, Databricks Notebook, Lookup).

2. Step-by-Step Implementation

Step 1: Set up the Storage (ADLS Gen2)

  1. In the Azure Portal, create a Storage Account.
  2. Crucial: Under the “Advanced” tab, ensure Hierarchical Namespace is enabled. This turns standard Blob storage into ADLS Gen2.
  3. Create a Container (e.g., raw-data).

Step 2: Create the Linked Service in ADF

  1. Open Azure Data Factory Studio.
  2. Go to the Manage tab (toolbox icon) > Linked Services > New.
  3. Search for Azure Data Lake Storage Gen2.
  4. Select your subscription and the storage account you created. Test the connection and click Create.

Step 3: Define your Datasets

You need a “Source” dataset (where data comes from) and a “Sink” dataset (where data goes).

  1. Go to the Author tab (pencil icon) > Datasets > New Dataset.
  2. Select Azure Data Lake Storage Gen2.
  3. Choose the format (Parquet and Delimited Text/CSV are most common).
  4. Point it to the specific file path in your ADLS container.

Step 4: Build the Pipeline

  1. In the Author tab, click the + icon > Pipeline.
  2. From the Activities menu, drag and drop the Copy Data activity onto the canvas.
  3. Source Tab: Select your source dataset.
  4. Sink Tab: Select your ADLS dataset.
  5. Mapping Tab: Click “Import Schemas” to ensure the columns align correctly.

3. Best Practices for ADLS Pipelines

  • Folder Structure: Use a “Medallion Architecture” (Bronze/Raw, Silver/Cleaned, Gold/Aggregated) within your ADLS containers to keep data organized.
  • Triggering: Don’t just run things manually. Use Schedule Triggers (time-based) or Storage Event Triggers (runs automatically when a file drops into ADLS).
  • Parameters: Avoid hardcoding file names. Use Parameters and Dynamic Content so one pipeline can handle multiple different files.

4. Example Formula for Dynamic Paths

If you want to organize your data by date automatically in ADLS, you can use a dynamic expression in the dataset path:

$$dataset().Directory = concat(‘raw/’, formatDateTime(utcNow(), ‘yyyy/MM/dd’))$$

This ensures that every time the pipeline runs, it creates a new folder for that day’s data.

How to audit an MCP server

Auditing a Model Context Protocol (MCP) server in 2026 requires a shift from traditional web auditing to Agentic Security Auditing. Since an LLM acts as the user of your server, you must audit not just the code, but the “instructions” and “boundaries” presented to the AI.

Here is the professional workflow for conducting a comprehensive MCP server audit.


1. Discovery & Tool Inspection

The first step is to see exactly what the AI sees. A malicious agent or a prompt injection can only exploit what is exposed in the tool definitions.

  • Use the MCP Inspector: Run npx @modelcontextprotocol/inspector to launch a local GUI. Connect your server and inspect the Tools tab.
  • Audit Tool Descriptions: Check if the descriptions are too “helpful.”
    • Bad: “This tool runs any bash command.”
    • Good: “This tool lists files in the /public directory only.”
  • Schema Strictness: Ensure every tool uses strict JSON Schema. AI agents are prone to “hallucinating” extra arguments; your server should reject any input that doesn’t perfectly match the schema.

2. Static Analysis (The “Code” Audit)

Since most MCP servers are written in TypeScript or Python, use standard security scanners with MCP-specific rules.

  • Dependency Check: Use npm audit or pip-audit. MCP is a new ecosystem; many early community servers use outdated, vulnerable libraries.
  • Path Traversal Check: This is the #1 vulnerability in MCP (found in 80% of filesystem-based servers).
    • Audit Task: Search your code for fs.readFile or open(). Ensure user-provided paths are sanitized using path.resolve and checked against a “Root” directory.
  • Command Injection: If your tool executes shell commands (e.g., a Git or Docker tool), ensure inputs are passed as arrays, never as strings.
    • Vulnerable: exec("git log " + user_input)
    • Secure: spawn("git", ["log", user_input])

3. Runtime & Behavioral Auditing

In 2026, we use eBPF-based monitoring or MCP Gateways to watch what the server actually does during a session.

  • Sandbox Verification: Run the server in a restricted Docker container. Audit the Dockerfile to ensure it runs as a non-root user (USER node or USER python).
  • Network Egress Audit: Does your server need to talk to the internet? If it’s a “Local File” tool, use firewall rules (or Docker network flags) to block all outgoing traffic. This prevents “Data Exfiltration” where an AI is tricked into sending your files to a remote server.
  • AIVSS Scoring: Use the AI Vulnerability Scoring System (AIVSS) to rank findings. A “Prompt Injection” that leads to a file read is a High; a “Prompt Injection” that leads to a shell execution is Critical.

4. The 2026 Audit Checklist

If you are performing a formal audit, ensure you can check “Yes” to all of the following:

CategoryAudit Check
AuthenticationDoes the server require a token for every request (especially for HTTP transports)?
SanitizationAre all LLM-generated arguments validated against a regex or allowlist?
Least PrivilegeDoes the server only have access to the specific folders/APIs it needs?
Human-in-LoopAre “Write” or “Delete” actions flagged to require manual user approval in the client?
LoggingDoes the server log the User ID, Tool Name, and Arguments for every call?

5. Automated Auditing Tools

To speed up the process, you can use these 2026-standard tools:

  1. mcpserver-audit: A GitHub-hosted tool that scans MCP source code for common dangerous patterns (like unparameterized SQL or open shell calls).
  2. Trivy / Docker Scout: For scanning the container image where your MCP server lives.
  3. Semgrep (MCP Ruleset): Use specialized Semgrep rules designed to find “AI Injection” points in Model Context Protocol implementations.

Multi-Layered Test Plan.

To perform a professional audit of an MCP server in 2026, you should follow a Multi-Layered Test Plan. Since MCP servers act as “Resource Servers” in an agentic ecosystem, your audit must verify that a compromised or malicious AI cannot “break out” of its intended scope.

Here is a 5-step Security Test Plan for an MCP server.


1. Static Analysis: “The Code Review”

Before running the server, scan the source code for common “agent-trap” patterns.

  • Check for shell=True (Python) or exec() (Node.js): These are the most common entry points for Remote Code Execution (RCE).
    • Test: Ensure all CLI tools use argument arrays instead of string concatenation.
  • Path Traversal Audit: Look for any tool that takes a path or filename as an argument.
    • Test: Verify that the code uses path.resolve() and checks if the resulting path starts with an allowed root directory.
    • Common Fail: Using simple string .startsWith() without resolving symlinks first (CVE-2025-53109).

2. Manifest & Metadata Audit

The LLM “sees” your server through its JSON-RPC manifest. If your tool descriptions are vague, the LLM might misuse them.

  • Tool Naming: Ensure tool names use snake_case (e.g., get_user_data) for optimal tokenization and clarity.
  • Prompt Injection Resilience: Check if tool descriptions include “Safety instructions.”
    • Example: “This tool reads files. Safety: Never read files ending in .env or .pem.
  • Annotations: Verify that “destructive” tools (delete, update, send) are marked with destructiveHint: true. This triggers a mandatory confirmation popup in modern MCP clients like Cursor or Claude Desktop.

3. Dynamic “Fuzzing” (The AI Stress Test)

In 2026, we use tools like mcp-sec-audit to “fuzz” the server. This involves sending nonsensical or malicious JSON-RPC payloads to see how the server reacts.

Test ScenarioPayload ExampleExpected Result
Path Traversal{"path": "../../../etc/passwd"}403 Forbidden or Error: Invalid Path
Command Injection{"cmd": "ls; rm -rf /"}The server should treat ; as a literal string, not a command separator.
Resource ExhaustionCalling read_file 100 times in 1 second.Server should trigger Rate Limiting.

4. Sandbox & Infrastructure Audit

An MCP server should never “run naked” on your host machine.

  • Docker Isolation: Audit the Dockerfile. It should use a distroless or minimal image (like alpine) and a non-root user.
  • Network Egress: Use iptables or Docker network policies to block the MCP server from reaching the internet unless its specific function requires it (e.g., a “Web Search” tool).
  • Memory/CPU Limits: Ensure the container has cpus: 0.5 and memory: 512mb limits to prevent a “Looping AI” from crashing your host.

5. OAuth 2.1 & Identity Verification

If your MCP server is shared over a network (HTTP transport), it must follow the June 2025 MCP Auth Spec.

  • PKCE Implementation: Verify that the server requires Proof Key for Code Exchange (PKCE) for all client connections. This prevents “Authorization Code Interception.”
  • Scope Enforcement: If a user only authorized the read_only scope, ensure the server rejects calls to delete_record even if the token is valid.
  • Audit Logging: Every tool call must be logged with:
    1. The user_id who initiated it.
    2. The agent_id that generated the call.
    3. The exact arguments used.

Pro-Tooling for 2026

  • MCP Inspector: Use npx @modelcontextprotocol/inspector for a manual “sanity check” of your tools.
  • Snyk / Trivy: Run these against your MCP server’s repository to catch vulnerable 3rd-party dependencies.

Would you like me to help you write a “Safety Wrapper” in Python or TypeScript that automatically validates all file paths before your MCP server processes them?

MCP security

The Model Context Protocol (MCP) is a powerful “USB-C for AI,” but because it allows LLMs to execute code and access private data, it introduces unique security risks.

In 2026, security for MCP has moved beyond simple API keys to a Zero Trust architecture. Here are the best practices for securing your MCP implementation.


1. The “Human-in-the-Loop” (HITL) Requirement

The most critical defense is ensuring an AI never executes “side-effect” actions (writing, deleting, or sending data) without manual approval.

  • Tiered Permissions: Classify tools into read-only (safe) and sensitive (requires approval).
  • Explicit Confirmation: The MCP client must display the full command and all arguments to the user before execution. Never allow the AI to “hide” parameters.
  • “Don’t Ask Again” Risks: Avoid persistent “allowlists” for bash commands or file writes; instead, scope approvals to a single session or specific directory.

2. Secure Architecture & Isolation

Running an MCP server directly on your host machine is a major risk. If the AI is tricked into running a malicious command, it has the same permissions as you.

  • Containerization: Always run MCP servers in a Docker container or a WebAssembly (Wasm) runtime. This prevents “Path Traversal” attacks where an AI might try to read your ~/.ssh/ folder.
  • Least Privilege: Use a dedicated, unprivileged service account to run the server. If the tool only needs to read one folder, do not give it access to the entire drive.
  • Network Egress: Block the MCP server from accessing the public internet unless it’s strictly necessary for that tool’s function.

3. Defense Against Injection Attacks

MCP is vulnerable to Indirect Prompt Injection, where a malicious instruction is hidden inside data the AI reads (like a poisoned webpage or email).

  • Tool Description Sanitization: Attackers can “poison” tool descriptions to trick the AI into exfiltrating data. Regularly audit the descriptions of third-party MCP servers.
  • Input Validation: Treat all inputs from the LLM as untrusted. Use strict typing (Pydantic/Zod) and regex patterns to ensure the AI isn’t passing malicious flags to a bash command.
  • Semantic Rate Limiting: Use an MCP Gateway to kill connections if an agent attempts to call a “Read File” tool hundreds of times in a few seconds—a classic sign of data exfiltration.

4. Identity & Authentication (2026 Standards)

For remote or enterprise MCP setups, static API keys are no longer sufficient.

  • OAuth 2.1 + PKCE: This is the mandated standard for HTTP-based MCP. It ensures that tokens are bound to specific users and cannot be easily intercepted.
  • Token Scoping: Never use a single “Master Key.” Issue short-lived tokens that are scoped only to the specific MCP tools the user needs.
  • Separation of Roles: Keep your Authorization Server (which identifies the user) separate from your Resource Server (the MCP server). This makes auditing easier and prevents a breach of one from compromising the other.

5. Supply Chain Security

The “Rug Pull” is a common 2026 threat where a popular open-source MCP server is updated with malicious code (e.g., a BCC field added to an email tool).

  • Pin Versions: Never pull the latest version of an MCP server in production. Pin to a specific, audited version or hash.
  • Vetted Registries: Only use servers from trusted sources like the Official MCP Catalog or internally vetted company registries.
  • Audit Logs: Log every tool invocation, including who requested it, what the arguments were, and what the output was.

Summary Checklist for Developers

RiskMitigation
Data ExfiltrationDisable network access for local tools; use PII redaction.
Command InjectionUse argument arrays (parameterized) instead of shell strings.
Unauthorized AccessImplement OAuth 2.1 with scope-based tool control.
Lateral MovementSandbox servers in Docker/Wasm; limit filesystem access.

MCP + Kubernetes Management

Here’s a breakdown of this topic across all three domains:


MCP + Kubernetes Management

What it looks like: An LLM agent connects to a Kubernetes MCP server that exposes kubectl operations as tools. The agent can then:

  • list_pods(namespace) → find failing pods
  • get_pod_logs(pod, namespace) → fetch logs
  • describe_deployment(name) → inspect rollout status
  • scale_deployment(name, replicas) → auto-scale
  • apply_manifest(yaml) → deploy changes

Real implementations:

  • kubectl-ai — natural language to kubectl commands
  • Robusta — AI-powered Kubernetes troubleshooting with MCP support
  • k8s-mcp-server — open-source MCP server wrapping the Kubernetes API
  • OpenShift + ACM — Red Hat is building AI-assisted cluster management leveraging MCP for tool standardization

Example agent workflow:

User: “Why is the payments service degraded?”

Agent →  list_pods(namespace=”payments”)

      →  get_pod_logs(pod=”payments-7f9b”, tail=100)

      →  describe_deployment(“payments”)

      →  LLM reasons: “OOMKilled — memory limit too low”

      →  Proposes: patch_deployment(memory_limit=”1Gi”)

      →  HITL: “Approve this change?” → Engineer approves

      →  apply_patch() → monitors rollout → confirms healthy


MCP + Terraform Pipelines

What it looks like: A Terraform MCP server exposes infrastructure operations. The agent can plan, review, and apply infrastructure changes conversationally.

MCP tools exposed:

  • terraform_plan(module, vars) → generate and review a plan
  • terraform_apply(plan_id) → apply approved changes
  • terraform_state_show(resource) → inspect current state
  • terraform_output(name) → read output values
  • detect_drift() → compare actual vs declared state

Key use cases:

  • Drift detection agent: continuously checks for infrastructure drift and auto-raises PRs to correct it
  • Cost optimization agent: analyzes Terraform state, identifies oversized resources, proposes rightsizing
  • Compliance agent: scans Terraform plans against OPA/Sentinel policies before apply
  • PR review agent: reviews Terraform PRs, flags security misconfigs, suggests improvements

Example pipeline:

PR opened with Terraform changes

       │

       ▼

MCP Terraform Agent

  ├── terraform_plan() → generates plan

  ├── scan_security(plan) → checks for open security groups, no encryption

  ├── estimate_cost(plan) → computes monthly cost delta

  ├── LLM summarizes: “This adds an unencrypted S3 bucket costing ~$12/mo”

  └── Posts review comment to PR with findings + recommendations


📊MCP + Infrastructure Observability

What it looks like: Observability tools (Prometheus, Grafana, Loki, Datadog) are wrapped as MCP servers. The agent queries them in natural language and correlates signals across tools autonomously.

MCP tools exposed:

  • query_prometheus(promql, time_range) → fetch metrics
  • search_logs(query, service, time_range) → Loki/Elasticsearch
  • get_traces(service, error_only) → Jaeger/Tempo
  • list_active_alerts() → current firing alerts
  • get_dashboard(name) → Grafana snapshot
  • create_annotation(text, time) → mark events on dashboards

Key use cases:

  • Natural language observability: “Show me error rate for the checkout service in the last 30 mins” — no PromQL needed
  • Automated RCA: agent correlates metrics + logs + traces to pinpoint root cause
  • Alert noise reduction: agent groups related alerts, suppresses duplicates, and writes a single incident summary
  • Capacity planning: agent queries historical metrics, detects trends, forecasts when resources will be exhausted

🔗 How MCP Ties It All Together

The power of MCP is that a single agent can hold tools from all three domains simultaneously:

┌─────────────────────────────────────────────────────┐

│                   LLM Agent                         │

│              (Claude / GPT-4o)                      │

└────────────────────┬────────────────────────────────┘

                     │ MCP

        ┌────────────┼────────────┐

        ▼            ▼            ▼

┌──────────────┐ ┌──────────┐ ┌──────────────────┐

│  Kubernetes  │ │Terraform │ │  Observability   │

│  MCP Server  │ │ MCP Server│ │   MCP Server     │

│  (kubectl,   │ │(plan,    │ │(Prometheus, Loki,│

│   Helm, ACM) │ │ apply,   │ │ Grafana, Jaeger) │

└──────────────┘ │ drift)   │ └──────────────────┘

                 └──────────┘

End-to-end scenario:

  1. Observability MCP detects CPU spike on node pool
  2. Agent queries Terraform MCP → finds node group is at max capacity
  3. Agent queries Kubernetes MCP → confirms pods are pending due to insufficient nodes
  4. Agent generates Terraform plan to scale node group from 3→5 nodes
  5. HITL approval → Terraform apply → Kubernetes confirms new nodes joined
  6. Agent posts incident summary to Slack with full audit trail

Securing an HPE Ezmeral Data Fabric

Security

Securing an HPE Ezmeral Data Fabric (formerly MapR) Hadoop cluster involves implementing a multi-layered security strategy that covers authentication, authorization, encryption, and monitoring. Below is a comprehensive guide to securing your HPE Ezmeral Hadoop cluster:


1. Authentication

Implement strong authentication mechanisms to ensure that only authorized users and applications can access the cluster.

  • Kerberos Integration:
    • Use Kerberos for secure authentication of users and services.
    • Configure Kerberos key distribution centers (KDCs) and set up service principals for all Hadoop components.
  • LDAP/AD Integration:
    • Integrate the cluster with LDAP or Active Directory (AD) for centralized user authentication.
    • Use Pluggable Authentication Modules (PAM) to synchronize user credentials.
  • Token-based Authentication:
    • Enable token-based authentication for inter-service communication to enhance security and reduce Kerberos dependency.

2. Authorization

Implement role-based access control (RBAC) to manage user and application permissions.

  • Access Control Lists (ACLs):
    • Configure ACLs for Hadoop Distributed File System (HDFS), YARN, and other services.
    • Restrict access to sensitive data directories.
  • Apache Ranger Integration:
    • Use Apache Ranger for centralized authorization management.
    • Define fine-grained policies for HDFS, Hive, and other components.
  • Group-based Permissions:
    • Assign users to appropriate groups and define group-level permissions for ease of management.

3. Encryption

Protect data at rest and in transit to prevent unauthorized access.

  • Data-at-Rest Encryption:
    • Use dm-crypt/LUKS for disk-level encryption of storage volumes.
    • Enable HDFS Transparent Data Encryption (TDE) for encrypting data blocks.
  • Data-in-Transit Encryption:
    • Configure TLS/SSL for all inter-service communication.
    • Use certificates signed by a trusted certificate authority (CA).
  • Key Management:
    • Implement a secure key management system, such as HPE Ezmeral Data Fabric’s built-in key management service or an external solution like HashiCorp Vault.

4. Network Security

Restrict network access to the cluster and its services.

  • Firewall Rules:
    • Limit inbound and outbound traffic to required ports only.
    • Use network segmentation to isolate the Hadoop cluster.
  • Private Networking:
    • Deploy the cluster in a private network (e.g., VPC on AWS or Azure).
    • Use VPN or Direct Connect for secure remote access.
  • Gateway Nodes:
    • Restrict direct access to Hadoop cluster nodes by using gateway or edge nodes.

5. Auditing and Monitoring

Monitor cluster activity and audit logs to detect and respond to security incidents.

  • Log Management:
    • Enable and centralize audit logging for HDFS, YARN, Hive, and other components.
    • Use tools like Splunk, Elasticsearch, or Fluentd for log aggregation and analysis.
  • Intrusion Detection:
    • Deploy intrusion detection systems (IDS) or intrusion prevention systems (IPS) to monitor network traffic.
  • Real-time Alerts:
    • Set up alerts for anomalous activities using monitoring tools like Prometheus, Grafana, or Nagios.

6. Secure Cluster Configuration

Ensure that the cluster components are securely configured.

  • Hadoop Configuration Files:
    • Disable unnecessary services and ports.
    • Set secure defaults for core-site.xml, hdfs-site.xml, and yarn-site.xml.
  • Service Accounts:
    • Run Hadoop services under dedicated user accounts with minimal privileges.
  • Regular Updates:
    • Keep the Hadoop distribution and all dependencies updated with the latest security patches.

7. User Security Awareness

Educate users on secure practices.

  • Strong Passwords:
    • Enforce password complexity requirements and periodic password changes.
  • Access Reviews:
    • Conduct regular access reviews to ensure that only authorized users have access.
  • Security Training:
    • Provide security awareness training to users and administrators.

8. Backup and Disaster Recovery

Ensure the availability and integrity of your data.

  • Backup Policy:
    • Regularly back up metadata and critical data to secure storage.
  • Disaster Recovery:
    • Implement a disaster recovery plan with off-site replication.

9. Compliance

Ensure the cluster complies with industry standards and regulations.

  • Data Protection Regulations:
    • Adhere to GDPR, HIPAA, PCI DSS, or other relevant standards.
    • Implement data masking and anonymization where required.
  • Third-party Audits:
    • Conduct periodic security assessments and audits.

By following these practices, you can ensure a robust security posture for your HPE Ezmeral Hadoop cluster.