Ingesting Data using Snowflake Stream Ingest into a Snowflake Data Lake

In today’s fast-paced business environment, organizations need to effectively leverage the value of data to make informed decisions and stay competitive. With data coming in from IoT devices, mobile apps, and websites, the need for real-time data processing becomes increasingly critical.

Data Pipeline Studio (DPS) supports processing of streaming data using Snowflake stream ingest in the Calibo Accelerate platform.

Before you create a stream ingest data pipeline, ensure that the role assigned to you has the following permissions on Snowflake Stream and Tasks:

GRANT CREATE STREAM ON SCHEMA <SCHEMA_NAME> TO ROLE <ROLE>;
GRANT CREATE TASK ON SCHEMA <SCHEMA_NAME> TO ROLE <ROLE>;
GRANT EXECUTE TASK ON ALL TASKS IN SCHEMA <SCHEMA_NAME> TO ROLE <ROLE>;
GRANT EXECUTE TASK ON FUTURE TASKS IN SCHEMA <SCHEMA_NAME> TO ROLE <ROLE>;

If you are using S3 as a data lake and ingesting data into Snowflake, then this is what your pipeline looks like:

Snowflake Stream Ingest pipeline

Amazon S3 (Data Lake) > Snowflake Stream Ingest (Data Integration) > Snowflake (Data Lake)

The data is loaded into a landing layer temporarily and then into the unification layer after the selected operation is performed on it. When you ingest streaming data from an S3 bucket into a Snowflake table, you must select a preconfigured storage integration in Snowflake and ensure that your S3 bucket has access to the selected storage integration. See Configuring a Snowflake storage integration to access Amazon S3.

Snowflake Stream Ingest uses Snowpipe to continuously load data from files as soon as it is available. This way near real-time data can be made available for processing. When you create a Snowflake stream ingest job, you create a task and specify the interval for the task. The task interval is the polling frequency at which the data is loaded from source to target after performing the specified operation in the unification layer.

To create a stream ingest data integration job

On the home page of DPS, add the following stages:
- Data Lake: Amazon S3
- Data Integration: Snowflake Stream Ingest
- Data Lake: Snowflake
Configure the Amazon S3 and Snowflake nodes.
Click the data integration node and click Create Job.
For the data integration job creation, provide the following inputs:

Notifications

You can configure the SQS and SNS services to send notifications related to the node in this job. This provides information about various events related to the node without actually connecting to the Calibo Accelerate platform.

SQS and SNS
Configurations - Select an SQS or SNS configuration that is integrated with the Calibo Accelerate platform.
Events - Enable the events for which you want to enable notifications: Select All Node Execution Failed Node Execution Succeeded Node Execution Running Node Execution Rejected
Event Details - Select the details of the events from the dropdown list, that you want to include in the notifications.
Additional Parameters - Provide any additional parameters that are to be added in the SQS and SNS notifications. A sample JSON is provided, you can use this to write logic for processing the events.

Click Complete.

To run the Stream Ingest data integration job

Publish the pipeline with the changes.
Notice that the Run Pipeline option is disabled. Click the down arrow key adjacent to it. Enable the toggle switch for Snowflake Stream Ingest 1.
The stream ingest job goes into Running state. The status of the Snowflake Integration job is now seen as Running.

To stop running the Stream Ingest data integration job

On the DPS home page, click the down arrow (adjacent to Run Pipeline) and disable the toggle for Snowflake Stream Ingest. The job stops running and the status changes to Terminated.

What's next? Ingesting Data from Amazon Kinesis Data Streams into an S3 Data Lake