Data Ingestion from SFTP to Snowflake Data Lake

Job Name

Template - this is automatically selected depending on the stages selected.
Job Name - Provide a name for the data analyzer job.
Node rerun Attempts - Specify the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.

Click Next.

Source

The source is pre-selected based on the SFTP instance that you have configured. The data format that is currently supported is Parquet.

Decide whether you want to partition the data when loading it into the target data lake. Partitioning the data helps to manage and query your data easily. Select from the following options:

Source - the source is preselected based on the configuration that is selected.
Datastore - the datastore is preselected based on the configuration that is selected.
Select Path - select the folder from the SFTP folder structure.
File Format - one of the following file formats is selected based on the data source node that is configured.
- CSV
- JSON
- XLSX
- PARQUET
Select File - browse to select the required file and click Add File.
Path - the path is displayed based on the folder structure that you selected.

Click Next.

Target

Target - the target is preselected based on the selected configuration.
You can select one of the following options:
- Configured Datastore
- New Datastore
Datastore - the datastore is preselected based on the selected configuration.
Warehouse - the warehouse is preselected.
Database - the database is preselected.
Schema - the schema is preselected.

Click Next.

Data Mapping

Map source data to target tables - map the source data to the tables in the target. Some important points to note while mapping the source data to target tables:

If the source is an Excel file it may consist of multiple sheets. The target table name that you provide will be treated as an audit table. Calibo Accelerate will generate the audit table with the required schema. Along with that it will generate individual tables for each sheet in the Excel file. You can provide names for the sheets according to your requirement. For example, if your Excel file contains three sheets and the table name is "Sales," then following tables will be created:
- Sales (Or whatever name you provide to the table. This table will contain the audit logs and other information like the source path, the sheets in the file, and the number of records processed.
- <name of Sheet1>
- <name of Sheet2>
- <name of Sheet3>

If the path selected for the source is at the folder level, and the target table does not have a column named LAZSA_AUDITING_COL, then a column with that name is created after the job run.

Note:

If the target table was created outside of the Calibo Accelerate platform, then either the same user must add the LAZSA_AUDITING_COL to the table or the user creating the job must have ALTER permissions on the table..

If the source node is Amazon S3 then the data type for LAZSA_AUDITING_COL must be number. If the source node is SFTP, the data type for the LAZSA_AUDITING_COL must be timestamp.

Sample code for adding LAZSA_AUDITING_COL:

Source Node	Sample code
Amazon S3	Copy `ALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL NUMBER(38,0);`
SFTP	Copy `ALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL TIMESTAMP_NTZ(9);`

Click Next.

Cluster Configuration

You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Select an All Purpose Cluster - this is already configured. Select one from the dropdown.

Note:

If you do not see a cluster configuration in the dropdown list, it is possible that the configured Databricks cluster has been deleted.

In this case, you must create a new Databricks cluster configuration in the Data Integration section of Cloud Platform Tools and Technologies. Delete the data integration node from the data pipeline, add a new node with the newly created configuration, and configure the job again. Now the user can select the newly configured Databricks cluster.
Job Cluster - Provide the required details to create a job cluster.

Click Complete.

Notifications

SQN and SNS

Configurations - Select an SQS or SNS configuration that is integrated with the Calibo Accelerate platform
Events - Select the events for which you want to enable SQS or SNS queues.
- Select All
- Node Execution Failed
- Node Execution Succeeded
- Node Execution Running
- Node Execution Rejected

Event Details - Select the details of the events for which notifications are enabled.
Additional Parameters - provide any additional parameters to be considered for SQS and SNS queues.

Click Complete.