Processing CSV Data from Azure Blob Storage with PySpark and Databricks
- geetharajesh20
- May 28, 2024
- 1 min read
Updated: May 30, 2024
Databricks, a cloud-based platform built on Apache Spark, provides a comprehensice and interactive workspace for data engineering, data science, and machine learning.
PySpark serves as the python API for the Apache Spark, a robust and unified analytics engine tailored for the big data processing machine learning applications.
Create an Azure Databricks workspace
Step 1: In portal.azure.com search for Azure Databricks
Step 2: In Azure Databricks page click on create, to create a new work space
After that click Review+Create
Click Create.
Step 3: You can check Azure Databricks service have been created in Resource Group
Click on Launch WorkSpace, then workspace will be opened in new tab like below
Step 4: Creating a Cluster in workspace
Click on New > Cluster
As it is test resource group I am selecting single node, and click on create compute
Once cluster is created, Create a new Notebook, like below.
Add the below code in notebook by replacing your container name, storage account, and access key
try:
sourceUrl= 'wasbs://<container_Name>@<storage_account>.blob.core.windows.net'
mountpoint = '/mnt/customerblob'
dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})
except:
print("Already mounted.... /mnt/customerblob")
mount_point will create a instance without reading URL every request
dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})
dbutils.fs.mount command will create a connection with blob container storage account
display(dbutils.fs.ls(mountpoint))
dbutils.fs.ls(mountpoint) command will give us list of files in container
Run the following command to read the CSV file from the mount point
df_ord= spark.read.format("csv").option("header",True).load("dbfs:/mnt/<mount_point_Name>/<file_name>.csv")
The Following command will display the Dataframe's content with limit 20
display(df_ord.limit(10))
Comments