top of page

Processing CSV Data from Azure Blob Storage with PySpark and Databricks

Updated: May 30, 2024

Databricks, a cloud-based platform built on Apache Spark, provides a comprehensice and interactive workspace for data engineering, data science, and machine learning.


PySpark serves as the python API for the Apache Spark, a robust and unified analytics engine tailored for the big data processing machine learning applications.


Create an Azure Databricks workspace


Step 1: In portal.azure.com search for Azure Databricks

Azure Portal

Step 2: In Azure Databricks page click on create, to create a new work space



After that click Review+Create

Click Create.


Step 3: You can check Azure Databricks service have been created in Resource Group

Click on Launch WorkSpace, then workspace will be opened in new tab like below


Step 4: Creating a Cluster in workspace

Click on New > Cluster

As it is test resource group I am selecting single node, and click on create compute

Once cluster is created, Create a new Notebook, like below.


Add the below code in notebook by replacing your container name, storage account, and access key

try:
 
sourceUrl= 'wasbs://<container_Name>@<storage_account>.blob.core.windows.net'
mountpoint = '/mnt/customerblob'
dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})
except:
 print("Already mounted.... /mnt/customerblob")

mount_point will create a instance without reading URL every request

    dbutils.fs.mount(source=sourceUrl,mount_point=mountpoint,extra_configs={'fs.azure.account.key.augmentosa.blob.core.windows.net':'<Access_Key>'})

dbutils.fs.mount command will create a connection with blob container storage account

display(dbutils.fs.ls(mountpoint))

dbutils.fs.ls(mountpoint) command will give us list of files in container


Run the following command to read the CSV file from the mount point

df_ord= spark.read.format("csv").option("header",True).load("dbfs:/mnt/<mount_point_Name>/<file_name>.csv")

The Following command will display the Dataframe's content with limit 20

display(df_ord.limit(10))

 
 
 

Recent Posts

See All

Comments


SUBSCRIBE

Sign up to receive Tech Sphere Insights news and updates.

Thanks for submitting!

© 2024 by TechSphereInsights.

  • LinkedIn
  • Facebook
  • Twitter
  • Instagram
bottom of page