Databricks Data Analyzer

In the data analyzer stage, you perform analysis of the complete dataset based on selected constraints. For this you must add the Data Analyzer node to the data quality stage and then create a data analyzer job.

  1. In the data quality stage, add a Data Analyzer node. Connect the node to and from the data lake.

    DQ Add Data Analyzer stage

  2. Click the data analyzer node and then click Create Job to create the data analyzer job.

    DQ Data Analyzer Create Job

  3. Provide the following information to create the Data Analyzer job:

  4. The Data Analyzer job is created. Click Start to run the data analyzer job. Alternately publish the pipeline and then run it to run the data analyzer job.

    DQ Start Data Analyzer job

  5. Once the job is complete, click the Analyzer Result tab. Click View Analyzer Results.

    DQ View Data Analyzer job results

  6. Depending on the selected constraints, you can view the results.

    Note:

    If you selected data type constraint in the data analyzer job, you see additional entries generated in the output results. See Data type constraints in data analyzer jobs.

    You can download the results in the form of a CSV file.

  1. Once the data analyzer job is complete and the results are available, the next step is to create a data validator job.

Note: The pipeline must be in Edit mode to create a data validator job.

Create a data validator job

  1. Click the Data Analyzer node in the pipeline. First click the ellipsis (...) and then click Configuration.

    Select Data Validator job

  2. Notice that the job now has an additional step of Validators added to it.

    DQ Data Validator job

  3. Provide the following information to create a data validator job:

    • Job Name

      1. Template - this is automatically selected depending on the selected stages.

      2. Job Name - provide a name for the data validator job.

      3. Node Rerun Attempts - the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.

    Click Next.

    • Source

      DQ Data Analyzer job

      • Source - this is automatically selected depending on the type of source added in the pipeline.

      • Datastore - this is automatically selected depending on the configured datastore.

      • Source Format - select either Parquet or Delta table.

      • Choose Base Path - this is automatically populated from the data analyzer path.

      • Constraint - the list of constraints selected in the data analyzer job is automatically populated. You can add additional constraints in the Validators step.

    • Validators

      DQ Validator Add Constraints

      • Do you want the pipeline run to be aborted if the validator result fails? - Enable this option depending on your requirement. If you enable this option, the pipeline run is terminated, if validator job fails.

      • Do you want constraints used in Data Analyzer to be used in Data Validator? – Click Add Constraints. Do one of the following:

        • Add New Constraints - Click this option to add new constraints. Select a constraint from the dropdown list. Select a column. Click Add. Repeat the steps to add all the required constraints. Then click Done.

          Refer to Data Quality Constraints

        • From Data Analyzer - Click this option to view the list of constraints added in the data analyzer. Review the list and select a condition for the constraint, then click Add for the constraints that you want to add. Click Done once you have added the required constraints.

      DQ DV List of constraints

      • View the list of constraints that are added for the data validator job and then click Next.

  • Target

    • Target - this is automatically selected depending on the configured datastores to which you have access.

    • Choose Target Format - select either Parquet or Delta table.

    • Target Folder - select the target folder where you want to store the data validator job output.

    • Target Path - you can provide an additional folder path. This is appended to the target folder.

    • Audit Tables Path - this path is formed based on the folders selected. A folder Data_Analyzer_Job_audit_table is created for data analyzer and another folder Data_Analyzer_Job_audit_table_validator is created for data validator.

    • Final File Path - the final path is created as follows: /S3 bucket name/Target Folder/Target Path

  1. Click the data analyzer node and click Start to initiate the data validator job run.

  1. Once the job is successful, click the Validator Result tab. Click View Validator Results.

    DQ Validator Results

Related Topics Link IconRecommended Topics What's next? Databricks Issue Resolver