Data Profiler using Unity Catalog

In this topic we will describe how to create a data quality pipeline with data from Unity Catalog data lake using Unity Catalog enabled data quality nodes and adding the data into a Databricks Unity Catalog data lake. The data lake that you use can either be the same for source and target nodes or it can be different.

Prerequisites

You must complete the following prerequisites before creating a data profiler job:

  • The data quality nodes have specific requirements as far as the Databricks Runtime version of the cluster and access mode is concerned. Following are the requirements for Unity Catalog-enabled Databricks used as a data profiler node in the data pipeline:

    Data Quality Node Databricks Cluster Runtime Version Access Mode
    Data Profiler 12.2 LTS Dedicated
  • Access to a Databricks Unity Catalog node which will be used as a data lake in the data ingestion pipeline.

Creating a data profiler job

  1. On the home page of Data Pipeline Studio, add the following stages:

    • Data Quality (Databricks - Unity Catalog enabled)

    • Data Lake (Databricks Unity Catalog)

    For the sake of example, two pipelines are shown below:

    • Pipeline 1 which uses 2 different Unity Catalog data lake nodes as source and target, and data quality nodes.

    • Pipeline 2 which uses the same Unity Catalog data lake node as source and target, and data quality nodes.

  2. Configure the data lake nodes. Do the following:
    1. The catalog name and workspace are selected. To edit the details, click the ellipsis (...) and select the required options.
    2. Select the required schema.
    3. Click the dropdown arrow in Data Browsing to check the data in the source.
  3. Click the Unity Catalog node for data profiler in the data quality stage. On the Unity Catalog Profiler Job tab, click Create Job. Complete the following steps to create the job: