Data Ingestion using Amazon Kinesis Data Streams with Snowflake Data Lake

In this topic we describe the creation of a data ingestion pipeline using Amazon Kinesis Data Streams as a data source with Databricks for data integration and ingesting the data into a Snowflake data lake.

Prerequisites

  • Access to a configured Snowflake account, which will be used as a data lake in the pipeline.

  • A configured instance of Amazon Kinesis Data Streams. For information about configuring Kinesis, refer to the following topic: Configuring Amazon Kinesis Data Streams

Creating a data ingestion pipeline

  1. On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:
    • Data Source: Amazon Kinesis Data Streams
    • Data Integration: Databricks
    • Data Lake: Snowflake

    Kinesis Data Streams usinf Snowflake data lake

  2. Configure the Kinesis node and Snowflake node.

  3. Click the Databricks node and click Create Job.

  4. Complete the following steps to create a data integration job:

Running a data ingestion pipeline

After you have created the data integration job with Amazon Kinesis Data Streams, you can run the pipeline in the following way:

  1. After the job creation is complete, ensure that you publish the pipeline. If you haven't already done so, click Publish.

  2. Click Run Kinesis Data Stream pipeline . The Data Streams window opens, which provides a list of the data streams in the pipeline. Enable the toggle for the stream that you want to use to fetch data.

    Kinesis Data Streams with Snowflake data lake

    You can see that the data stream that you enabled is now running. Click the refresh icon to view the latest information about number of events processed.

Troubleshooting a failed data integration job

When you click the Databricks node in the pipeline, you know if your data integration job has failed looking at the status of the job.

  1. Click the Databricks node in the pipeline.

  2. Check the status of the Databricks integration job. The status could be one of the following:

    • Running

    • Canceled

    • Pending

    • Failed

  3. If the job status is seen as Failed, click the (...) ellipsis and then click Open Databricks Dashboard.

    Troubleshooting Kinesis data stream job

  4. You are navigated to the specific Databricks job. This shows the list of job runs. Click the job run for which you want to view the details.

    Databricks Data Integration job list

  5. View the details and check for errors.

    Databricks job details troubleshooting

Related Topics Link IconRecommended Topics What's next? Data Ingestion using Amazon Kinesis Data Streams