Data Ingestion from Amazon S3 to Snowflake

Calibo Accelerate now supports ingestion of delta data into a Snowflake data lake from an Amazon S3 data source where the file format used is xlsx. An audit table is created on folder selection, during the job creation,

Let us create a pipeline with the following nodes:

Data Ingestion for delta data from S3 to Snowflake

To create a data integration job using Amazon S3 as source and Snowflake as target

Configure the Amazon S3 nodes and select a folder with a .xlsx file.
Configure the Snowflake node.
Click the Databricks node in the data integration stage of the pipeline and click Create Templatized Job.

Complete the following steps to create the job:

Cluster Configuration

You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Job Cluster

Cluster Details

Description

Choose Cluster

Provide a name for the job cluster that you want to create.

Job Configuration Name

Provide a name for the job cluster configuration.

Databricks Runtime Version

Select the appropriate Databricks version.

Worker Type

Select the worker type for the job cluster.

Workers

Enter the number of workers to be used for running the job in the job cluster.

You can either have a fixed number of workers or you can choose autoscaling.

Enable Autoscaling

Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs.

Cloud Infrastructure Details

First on Demand

Provide the number of cluster nodes that are marked as first_on_demand.

The first_on_demand nodes of the cluster are placed on on-demand instances.

Availability

Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options:

Spot
On-demand
Spot with fallback

Zone

Identifier of the availability zone or data center in which the cluster resides.

The provided availability zone must be in the same region as the Databricks deployment.

Instance Profile ARN

Provide an instance profile ARN that can access the target Amazon S3 bucket.

EBS Volume Type

The type of EBS volume that is launched with this cluster.

EBS Volume Count

The number of volumes launched for each instance of the cluster.

EBS Volume Size

The size of the EBS volume to be used for the cluster.

Additional Details

Spark Config

To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs.

Environment Variables

Configure custom environment variables that you can use in init scripts.

Logging Path (DBFS Only)

Provide the logging path to deliver the logs for the Spark jobs.

Init Scripts

Provide the init or initialization scripts that run during the start up of each cluster.

Cluster Configuration

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Job Cluster

Cluster Details	Description
Choose Cluster	Provide a name for the job cluster that you want to create.
Job Configuration Name	Provide a name for the job cluster configuration.
Databricks Runtime Version	Select the appropriate Databricks version.
Worker Type	Select the worker type for the job cluster.
Workers	Enter the number of workers to be used for running the job in the job cluster. You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling	Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs.
Cloud Infrastructure Details
First on Demand	Provide the number of cluster nodes that are marked as first_on_demand. The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability	Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options: Spot On-demand Spot with fallback
Zone	Identifier of the availability zone or data center in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN	Provide an instance profile ARN that can access the target Amazon S3 bucket.
EBS Volume Type	The type of EBS volume that is launched with this cluster.
EBS Volume Count	The number of volumes launched for each instance of the cluster.
EBS Volume Size	The size of the EBS volume to be used for the cluster.
Additional Details
Spark Config	To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs.
Environment Variables	Configure custom environment variables that you can use in init scripts.
Logging Path (DBFS Only)	Provide the logging path to deliver the logs for the Spark jobs.
Init Scripts	Provide the init or initialization scripts that run during the start up of each cluster.

Notifications

You can configure the SQS and SNS services to send notifications related to the node in this job. This provides information about various events related to the node without connecting to the Calibo Accelerate platform.

SQS and SNS
Configurations - Select an SQS or SNS configuration that is integrated with the Calibo Accelerate platform.
Events - Enable the events for which you want to enable notifications: Select All Node Execution Failed Node Execution Succeeded Node Execution Running Node Execution Rejected
Event Details - Select the details of the events from the dropdown list, that you want to include in the notifications.
Additional Parameters - Provide any additional parameters that are to be added in the SQS and SNS notifications. A sample JSON is provided, you can use this to write logic for processing the events.

What happens after the first job run?

After the first job run target tables are created as per the number of sheets in the source Excel file and the source data is loaded into the respective tables.
For subsequent job runs, the delta data is loaded into the tables.

What's next?Databricks Templatized Data Integration Jobs

Data Ingestion from Amazon S3 to Snowflake

Note:

Note: