Processing CSV Data from Azure Blob Storage with PySpark and Databricks

geetharajesh20
May 28, 2024
1 min read

Updated: May 30, 2024

Databricks, a cloud-based platform built on Apache Spark, provides a comprehensice and interactive workspace for data engineering, data science, and machine learning.

PySpark serves as the python API for the Apache Spark, a robust and unified analytics engine tailored for the big data processing machine learning applications.

Create an Azure Databricks workspace

Step 1: In portal.azure.com search for Azure Databricks

Step 2: In Azure Databricks page click on create, to create a new work space

After that click Review+Create

Click Create.

Step 3: You can check Azure Databricks service have been created in Resource Group

Click on Launch WorkSpace, then workspace will be opened in new tab like below

Step 4: Creating a Cluster in workspace

Click on New > Cluster

As it is test resource group I am selecting single node, and click on create compute

Once cluster is created, Create a new Notebook, like below.

Add the below code in notebook by replacing your container name, storage account, and access key

try:
 
sourceUrl= 'wasbs://<container_Name>@<storage_account>.blob.core.windows.net'
mountpoint = '/mnt/customerblob'
dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})
except:
 print("Already mounted.... /mnt/customerblob")

mount_point will create a instance without reading URL every request

    dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})

dbutils.fs.mount command will create a connection with blob container storage account

display(dbutils.fs.ls(mountpoint))

dbutils.fs.ls(mountpoint) command will give us list of files in container

Run the following command to read the CSV file from the mount point

df_ord= spark.read.format("csv").option("header",True).load("dbfs:/mnt/<mount_point_Name>/<file_name>.csv")

The Following command will display the Dataframe's content with limit 20

display(df_ord.limit(10))

Processing CSV Data from Azure Blob Storage with PySpark and Databricks

Recent Posts

Comments

SUBSCRIBE