Azure Data Lake Storage (ADLS)

If ADF is the plumbing and Databricks is the engine, Azure Data Lake Storage (ADLS) Gen2 is the actual physical warehouse where everything is kept.

In 2026, it remains the standard for “Big Data” because it combines the cheap, limitless nature of Cloud Object Storage with the high-speed organization of a File System.


1. The Secret Sauce: Hierarchical Namespace (HNS)

Standard cloud storage (like Azure Blob or Amazon S3) is “flat.” If you have a file at /logs/2026/March/data.csv, the computer sees that whole string as one long name. To move a folder, it has to copy every single file inside it.

With ADLS Gen2, folders are “real” (Hierarchical Namespace).

  • Rename/Move: Renaming a folder with 10 million files is instantaneous because it just changes one reference, not 10 million files.
  • Performance: When a tool like Databricks or Spark asks for “all files in the March folder,” ADLS knows exactly where they are without searching through the entire lake.

2. The Storage Tiers (Cost Savings)

You don’t pay the same price for all data. ADLS allows you to move data between “Tiers” automatically based on how often you touch it:

  • Hot Tier: Highest cost to store, lowest cost to access. Use this for data you are actively processing in your RAG pipeline today.
  • Cool/Cold Tier: Lower storage cost, but you pay a fee to read it. Great for data from last month.
  • Archive Tier: Dirt cheap (pennies per GB). The data is “offline”—it can take a few hours to “rehydrate” it so you can read it again. Perfect for legal compliance backups.

3. Security (ACLs vs. RBAC)

For your Internal RAG system, this is the most important part of ADLS. It uses two layers of security:

  1. RBAC (Role-Based Access Control): Broad permissions (e.g., “John is a Storage Contributor”).
  2. ACLs (Access Control Lists): POSIX-style permissions. You can say “John can see the ‘Public’ folder, but only HR can see the ‘Salaries’ folder.” 2026 Update: Azure AI Search now “respects” these ACLs. If you index files from ADLS, the search results will automatically hide files that the logged-in user doesn’t have permission to see in the Data Lake.

4. ADLS Gen2 vs. Microsoft Fabric OneLake

You might hear about OneLake (the “OneDrive for data”). Here is how to tell them apart in 2026:

  • ADLS Gen2: The “Infrastructure” choice. You have full control over networking, encryption keys, and regions. Best for custom data engineering and Databricks heavy-lifters.
  • OneLake: The “SaaS” choice. It is actually built on top of ADLS, but it manages the folders and permissions for you automatically within Microsoft Fabric.

Summary Checklist

  • Format: Use Delta or Parquet for your “Silver” and “Gold” layers. These are compressed and optimized for AI and BI.
  • Structure: Always follow the Bronze -> Silver -> Gold folder structure to keep your lake from becoming a “data swamp.”
  • Access: Use Managed Identities so ADF and Databricks can talk to ADLS without you ever having to copy-paste a password or a secret key.

Are you planning to manage the ADLS folders yourself, or is your company moving toward a managed environment like Microsoft Fabric?

Leave a comment