Data Ingestion from Data Catalogs to Unity Catalog

Data Pipeline Studio supports data ingestion from Lazsa Data Ingestion Catalogs to Unity Catalog data lake. You can crawl data from various types of sources and create a data catalog from it, which can be used in the data source stage of a data pipeline. Data crawlers can connect to various data sources, including databases, data lakes, APIs, and file systems providing wider visibility and deeper access to data.

To use a data ingestion catalog as a data source in a data pipeline, you must first create a crawler using sources such as CSV, MS Excel, Parquet, FTP, SFTP. After creating a crawler you create a data catalog from it. While creating a data catalog, you can filter the data that you bring from the crawler to the data catalog. For more information, see Data Crawler and Data Catalog.

Data Pipeline Studio currently supports the following sources for creating a crawler and catalog and using it as a Lazsa Data Ingestion Catalog in the source stage of a pipeline:

  • CSV

  • MS Excel

  • Parquet

  • FTP (using CSV, XLSX, JSON, and Parquet file formats)

  • SFTP (using CSV, XLSX, JSON, Parquet file formats)

  • MySQL

  • MS SQL Server

  • Oracle

  • PostgreSQL

  • REST API (using CSV and JSON format)

  • Snowflake

Prerequisites

  • Access to a Databricks node that has Unity Catalog enabled which will be used as a data integration node in the data ingestion pipeline. The Databricks Runtime version of the cluster must be 14.3.

  • Access to a Databricks Unity Catalog node which will be used as a data lake in the data ingestion pipeline.

Creating a data ingestion pipeline

  1. On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:

    For the sake of an example, we are using FTP in the data source node.

    Data Catalog data ingestion pipeline into Unity Catalog data lake

  2. Configure the data source node.

  3. Configure the data lake node.

    • Click the dropdown Use an existing Databricks Unity Catalog, select an instance. Click Add to data pipeline.

    • Click the dropdown Schema Name and select a schema.

    • Click Data Browsing. Browse the folders and view the required files. This step is optional.

    • Click Save.

  4. Click the Databricks node in the data integration stage. Click Create Templatized Job. Complete the following steps to create the job:

Related Topics Link IconRecommended Topics What's next? Data Issue Resolver using Unity Catalog