Databricks Data Profiler

Data Profiling is the process of examining, analyzing, and summarizing a sample dataset to gain insights into the quality of data, based on the selected parameters. In the Calibo Accelerate platform to add a data profiler stage, do the following:

  1. Add a data quality stage after the data lake stage.

  2. Add a data profiler node to the data quality stage. Connect the node to and from the data lake, as shown below:

    Add Data Profiler stage

  3. Click the data profiler node and then click Create Job to create a data profiler job.

    Create Data Profiler job

    Note: After you create a job and run it for the first time then the Profiler Result tab is visible. If you use the Validate constraint in your profiler job, then you can also view the Validated Profiler Result.

  4. Provide the following information for the data profiler job:

  5. After you create the job, you can run the job in the following two ways:

    • Publish the pipeline with the changes and then run the pipeline.

    • Click the Data Profiler node, and click Start.

      DQ Start Data Profiler job

  6. Click View Profiler Results to view the results of the Data Profiler job. After viewing the results, you can validate columns based on a specified pattern.

    DQ View Profiler Results

  7. Specify the pattern for validation and click Validate to validate the data in the selected columns based on the provided pattern.

    DQ Validate Profiler Results

    Note: The pipeline must be in Edit mode for the Validate button to be enabled.

  8. Click Start or run the pipeline to run the validation job. Once the job is complete you can view the results of that job under the Validated Profiler Result tab. You can download the results in the form of a CSV file.

Related Topics Link IconRecommended Topics What's next? Databricks Data Analyzer