Building a data pipeline in Azure

Building a data pipeline in Azure using Azure Data Factory (ADF) and Azure Data Lake Storage (ADLS) is the “bread and butter” of modern cloud data engineering. Think of ADLS as your massive digital warehouse and ADF as the conveyor belts and robotic arms moving things around.

Here is the high-level workflow and the steps to get it running.


1. The Architecture

In a typical scenario, you move data from a source (like an on-premises SQL DB or an API) into ADLS, then process it.

Key Components:

  • Linked Services: Your “Connection Strings.” These store the credentials to talk to ADLS or your source.
  • Datasets: These point to specific folders or files within your Linked Service.
  • Pipelines: The logical grouping of activities (the workflow).
  • Activities: The individual actions (e.g., Copy Data, Databricks Notebook, Lookup).

2. Step-by-Step Implementation

Step 1: Set up the Storage (ADLS Gen2)

  1. In the Azure Portal, create a Storage Account.
  2. Crucial: Under the “Advanced” tab, ensure Hierarchical Namespace is enabled. This turns standard Blob storage into ADLS Gen2.
  3. Create a Container (e.g., raw-data).

Step 2: Create the Linked Service in ADF

  1. Open Azure Data Factory Studio.
  2. Go to the Manage tab (toolbox icon) > Linked Services > New.
  3. Search for Azure Data Lake Storage Gen2.
  4. Select your subscription and the storage account you created. Test the connection and click Create.

Step 3: Define your Datasets

You need a “Source” dataset (where data comes from) and a “Sink” dataset (where data goes).

  1. Go to the Author tab (pencil icon) > Datasets > New Dataset.
  2. Select Azure Data Lake Storage Gen2.
  3. Choose the format (Parquet and Delimited Text/CSV are most common).
  4. Point it to the specific file path in your ADLS container.

Step 4: Build the Pipeline

  1. In the Author tab, click the + icon > Pipeline.
  2. From the Activities menu, drag and drop the Copy Data activity onto the canvas.
  3. Source Tab: Select your source dataset.
  4. Sink Tab: Select your ADLS dataset.
  5. Mapping Tab: Click “Import Schemas” to ensure the columns align correctly.

3. Best Practices for ADLS Pipelines

  • Folder Structure: Use a “Medallion Architecture” (Bronze/Raw, Silver/Cleaned, Gold/Aggregated) within your ADLS containers to keep data organized.
  • Triggering: Don’t just run things manually. Use Schedule Triggers (time-based) or Storage Event Triggers (runs automatically when a file drops into ADLS).
  • Parameters: Avoid hardcoding file names. Use Parameters and Dynamic Content so one pipeline can handle multiple different files.

4. Example Formula for Dynamic Paths

If you want to organize your data by date automatically in ADLS, you can use a dynamic expression in the dataset path:

$$dataset().Directory = concat(‘raw/’, formatDateTime(utcNow(), ‘yyyy/MM/dd’))$$

This ensures that every time the pipeline runs, it creates a new folder for that day’s data.

Leave a comment