Data Ingestion from SFTP to Snowflake
You can create a data ingestion pipeline using SFTP data source and push the data into a Snowflake data lake. The file formats supported for SFTP are CSV, JSON, XLSX, PARQUET. The data integration tool used is Databricks.
-
Create a data pipeline with the following nodes: SFTP > Databricks > Snowflake
-
Configure the SFTP and Snowflake nodes.
-
Click the Databricks node and click Create Job.
-
Complete the following steps to create the Databricks data integration job:
-
Template - this is automatically selected depending on the stages selected.
-
Job Name - Provide a name for the data analyzer job.
-
Node rerun Attempts - Specify the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.
Click Next.
The source is pre-selected based on the SFTP instance that you have configured. The data format that is currently supported is Parquet.
Decide whether you want to partition the data when loading it into the target data lake. Partitioning the data helps to manage and query your data easily. Select from the following options:
-
Source - the source is preselected based on the configuration that is selected.
-
Datastore - the datastore is preselected based on the configuration that is selected.
-
Select Path - select the folder from the SFTP folder structure.
-
File Format - one of the following file formats is selected based on the data source node that is configured.
-
CSV
-
JSON
-
XLSX
-
PARQUET
-
-
Select File - browse to select the required file and click Add File.
-
Path - the path is displayed based on the folder structure that you selected.
Click Next.
-
Target - the target is preselected based on the selected configuration.
-
You can select one of the following options:
-
Configured Datastore
-
New Datastore
-
-
Datastore - the datastore is preselected based on the selected configuration.
-
Warehouse - the warehouse is preselected.
-
Database - the database is preselected.
-
Schema - the schema is preselected.
Click Next.
-
Map source data to target tables - map the source data to the tables in the target. Some important points to note while mapping the source data to target tables:
-
If the source is an Excel file it may consist of multiple sheets. The target table name that you provide will be treated as an audit table. Lazsa will generate the audit table with the required schema. Along with that it will generate individual tables for each sheet in the Excel file. For example, if your Excel file contains three sheets and the table name is "Sales," then following tables will be created:
-
Sales
-
Sales_Sheet1
-
Sales_Sheet2
-
Sales_Sheet3
-
-
If the path selected for the source is at the folder level, then the target table must contain the LAZSA_AUDITING_COL. If this column is not present you must add it at the end of the table.
Note:
If the target table was created outside of the Lazsa Platform, then either the same user must add the LAZSA_AUDITING_COL to the table or the user creating the job must have ALTER permissions on the table..
If the source node is Amazon S3 then the data type for LAZSA_AUDITING_COL must be number. If the source node is SFTP, the data type for the LAZSA_AUDITING_COL must be timestamp.
Sample code for adding LAZSA_AUDITING_COL:
Source Node Sample code Amazon S3 CopyALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL NUMBER(38,0);
SFTP CopyALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL TIMESTAMP_NTZ(9);
-
Click Next.
You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Lazsa Platform and you want to update custom environment variables, refer to the following:
-
Select an All Purpose Cluster - this is already configured. Select one from the dropdown.
-
Job Cluster - Provide the required details to create a job cluster.
Click Complete.
SQN and SNS
-
Configurations - Select an SQS or SNS configuration that is integrated with the Lazsa Platform
-
Events - Select the events for which you want to enable SQS or SNS queues.
-
Select All
-
Node Execution Failed
-
Node Execution Succeeded
-
Node Execution Running
-
Node Execution Rejected
-
-
Event Details - Select the details of the events for which notifications are enabled.
-
Additional Parameters - provide any additional parameters to be considered for SQS and SNS queues.
Click Complete.
What's next?Snowflake Bulk Ingest with Storage Integration |