Data Ingestion using Databricks Autoloader

Calibo Accelerate platform currently supports data ingestion from Amazon S3 data source into Unity Catalog data lake using Databricks Autoloader feature.

Databricks Autoloader currently supports the following file formats using S3 as a data source:

  • JSON

  • CSV

  • Parquet

What is Databricks Autoloader and how does it work?

Databricks Autoloader incrementally processes data files as they arrive in cloud storage. It provides a structured streaming source called cloudFiles. This source automatically processes files as they arrive.

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of the Autoloader pipeline. This key-value store ensures that data is processed just once. In case of failures Autoloader resumes from where it left off with information stored in the checkpoint location, eliminating the need to maintain the state manually.

Creating a data integration job

Complete the following steps to create the Databricks data integration job using S3 as source and Unity Catalog as target.

  1. Create a data pipeline using the following nodes:

    Data Integration using Databricks Autoloader into Unity Catalog data lake

  2. Configure the source and target nodes.

    Note:

    While configuring the S3 source node, make sure to select a folder and not a file. If you select a file, an error is shown, and the job creation cannot be completed.

  3. Complete the following steps to create the job:

 

Related Topics Link IconRecommended Topics What's next? Data Issue Resolver using Unity Catalog