Mount Blob containers into Databricks via DBFS

Mount Blob containers into Databricks via DBFS

Preface

DBFS is the primary mechanism that Databricks uses to access data from external locations such as Amazon S3 buckets, Azure Blob containers, RDMS databases, and many more.

What makes DBFS powerful is its ability to behave like a local file system within Databricks by interacting with each of these storage platforms.

In other words, DBFS itself doesn’t store any data - it only acts as an interface for moving data between Databricks and the storage platforms it supports.

In this blog, I will quickly show you how to mount Azure Blob containers into Databricks using DBFS. Let’s begin!

Prerequisites

This post assumes you have a storage account and Blob container already set up in the Azure portal

Steps

1. Create an Azure AD app

  • Go to the Azure portal

  • Go to Azure Active Directory

  • Under the Manage header on the lefthand pane, click on App registrations, then click on New registration

  • Enter a name for the Azure AD app

  • In the Redirect URI section, select Web as the type then enter “https://localhost” as the URL.

  • Click Register

2. Create a client secret

  • Once the app is created enter the Certificates & secrets option under the Manage header

  • Click on New client secret

  • Add a description for the new client secret and select an expiration period

  • Click Add

  • Note the value of the new client secret (this will be required later)

3. Grant access to the Azure AD App

  • Go to the Blob container you want to grant access to

  • Click on Access control (IAM), then click on + Add button

  • Select Add role assignment

  • Select the Storage Blob Data Contributor role then click Next

  • Click on +Select members and search for the Azure AD app you created in the previous steps

  • Click Select

  • Click Review + assign, then click on Review + assign for the next page

4. Create SAS token and connection string

  • Go to Storage accounts

  • Click on the storage account that contains the container you're after

  • Under the Security + networking pane on the left-hand menu click on Shared access signature

  • Under Allowed resource types check the Service, Container and Object boxes

  • Configure expiry dates to your preference

  • Click Generate SAS and connection string

5. Create a secret scope to store your Azure credentials as secrets

I’ve already written a blog post on how to create the secret scope and secrets for this stage here. Use the credentials for this section to create the scope, then advance to the next step.

Here are the credentials you need to store as secrets to this point:

  • Client ID

  • Client secret

  • Tenant ID

  • Storage account name

  • Container name

  • SAS token

  • SAS connection string

Here's how to find each of these credentials:

Client ID

  • Go to Azure Active Directory

  • Click on App registrations

  • Click on the app you've just created

The client ID is the same as the Application (client) ID, which should appear on this page

Client secret

  • Go to Azure Active Directory

  • Click on App registrations

  • Click on the app you've just created

  • Click on Certificates & secrets

The client secret is the value of the secret created from the previous steps, which may be masked at this stage.

Note: If you don't have the value of the secret, you may be required to create another one as they are only displayed during the creation process. Follow the previous steps to create a new client secret for your app.

Tenant ID

  • Go to Azure Active Directory

  • Under the Overview header ****click on the Properties tab

The Tenant ID should appear on the page

Storage account name

  • Enter Azure portal

  • On the homepage click on Storage accounts under the Azure services pane

  • Click on the subscription that holds the storage account of your choice

The storage account name should be on the top left-hand side of the page over the Storage account sub-header

Container name

  • Enter Azure portal

  • Click on Storage accounts

  • Select the subscription that contains the blob storage account your after

  • Under the Data storage pane on the left-hand menu click on Containers

  • Select the container that contains the Blob you're after

  • Click on the blob you want to access

The container name should be displayed at the top of the page

SAS Token & connection string

You can only view these details immediately after creating them. If you don't have these, create a new SAS token and connection string and note them once you've created them again.

You can find the instructions in these steps.

6. Run the configuration code to perform the mount

Retrieve the secrets from the secret scope

Read the secrets from your secret scope and other config details into Python objects:

client_id                      =    dbutils.secrets.get(scope="azure", key="client_id")
client_secret                  =    dbutils.secrets.get(scope="azure", key="client_secret")
tenant_id                      =    dbutils.secrets.get(scope="azure", key="tenant_id")
storage_account_name           =    dbutils.secrets.get(scope="azure", key="storage_account_name")
container_name                 =    dbutils.secrets.get(scope="azure", key="container_name")
sas_token                      =    dbutils.secrets.get(scope="azure", key="sas_token")
sas_connection_string          =    dbutils.secrets.get(scope="azure", key="sas_connection_string")
blob_service_sas_url           =    dbutils.secrets.get(scope="azure", key="blob_service_sas_url")

source_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net"
mount_point = f"/mnt/{container_name}-dbfs"
extra_configs = {
        f"fs.azure.account.auth.type.{storage_account_name}.blob.core.windows.net": "OAuth",
        f"fs.azure.account.oauth.provider.type.{storage_account_name}.blob.core.windows.net": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
        f"fs.azure.account.oauth2.client.id.{storage_account_name}.blob.core.windows.net": client_id,
        f"fs.azure.account.oauth2.client.secret.{storage_account_name}.blob.core.windows.net": client_secret,
        f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.blob.core.windows.net": f"<https://login.microsoftonline.com/{tenant_id}/oauth2/token>",
        f"fs.azure.sas.{container_name}.{storage_account_name}.blob.core.windows.net": sas_token,
        f"fs.azure.account.key.<storage-account-name>.blob.core.windows.net": sas_connection_string   
}

Set up the config for mounting the Blob container to DBFS

dbutils.fs.mount(
    source         =   source_path ,
    mount_point    =   mount_point ,
    extra_configs  =   extra_configs 
)

This will use the objects from the previous steps to mount the Blob container to the DBFS location specified.

Verify the Blob mount to DBFS

Confirm the mount job was successful by listing the objects in the DBFS mount location:

dbutils.fs.ls(mount_point)

The results should match the content in your actual Azure Blob container.

Feel free to reach out via my handles: LinkedIn| Email | Twitter